Eval
Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
Install
Quick install
npx skills add https://github.com/alirezarezvani/claude-skills/tree/main/engineering/agenthub/skills/evalnpx skills add alirezarezvani/claude-skills --skill eval --agent claude-codenpx skills add alirezarezvani/claude-skills --skill eval --agent cursornpx skills add alirezarezvani/claude-skills --skill eval --agent codexnpx skills add alirezarezvani/claude-skills --skill eval --agent opencodenpx skills add alirezarezvani/claude-skills --skill eval --agent github-copilotnpx skills add alirezarezvani/claude-skills --skill eval --agent windsurfMore install options
Shorthand — useful for multi-skill repos:
npx skills add alirezarezvani/claude-skills --skill evalManual — clone the repo and drop the folder into your agent's skills directory:
git clone https://github.com/alirezarezvani/claude-skills.gitcp -r claude-skills/engineering/agenthub/skills/eval ~/.claude/skills//hub:eval — Evaluate Agent Results
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
Usage
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
What It Does
Metric Mode (eval command configured)
Run the evaluation command in each agent's worktree:
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
Output:
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
LLM Judge Mode (no eval command, or --judge flag)
For each agent:
- Get the diff:
git diff {base_branch}...{agent_branch} - Read the agent's result post from
.agenthub/board/results/agent-{i}-result.md - Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
Hybrid Mode
- Run metric evaluation first
- If top agents are within 10% of each other, use LLM judge to break ties
- Present both metric and qualitative rankings
After Eval
- Update session state:
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
- Tell the user:
- Ranked results with winner highlighted
- Next step:
/hub:mergeto merge the winner - Or
/hub:merge {session-id} --agent {winner}to be explicit
SKILL.md source
---
name: eval
description: Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
---
# /hub:eval — Evaluate Agent Results
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
## Usage
```
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
```
## What It Does
### Metric Mode (eval command configured)
Run the evaluation command in each agent's worktree:
```bash
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
```
Output:
```
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
```
### LLM Judge Mode (no eval command, or --judge flag)
For each agent:
1. Get the diff: `git diff {base_branch}...{agent_branch}`
2. Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md`
3. Compare all diffs and rank by:
- **Correctness** — Does it solve the task?
- **Simplicity** — Fewer lines changed is better (when equal correctness)
- **Quality** — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
```
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
```
### Hybrid Mode
1. Run metric evaluation first
2. If top agents are within 10% of each other, use LLM judge to break ties
3. Present both metric and qualitative rankings
## After Eval
1. Update session state:
```bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
```
2. Tell the user:
- Ranked results with winner highlighted
- Next step: `/hub:merge` to merge the winner
- Or `/hub:merge {session-id} --agent {winner}` to be explicit
Related skills 6
caveman
Ultra-compressed communication mode. Cuts token usage ~75% by speaking like caveman while keeping full technical accuracy. Supports intensity levels: lite, full (default), ultra, wenyan-lite, wenyan-full, wenyan-ultra. Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman. Also auto-triggers when token efficiency is requested.
secure-linux-web-hosting
Use when setting up, hardening, or reviewing a cloud server for self-hosting, including DNS, SSH, firewalls, Nginx, static-site hosting, reverse-proxying an app, HTTPS with Let's Encrypt or ACME clients, safe HTTP-to-HTTPS redirects, or optional post-launch network tuning such as BBR.
readme-i18n
Use when the user wants to translate a repository README, make a repo multilingual, localize docs, add a language switcher, internationalize the README, or update localized README variants in a GitHub-style repository.
lark-shared
Use when first setting up lark-cli, running auth login, switching user/bot identity (--as), handling permission denied or scope errors, needing to update lark-cli, or seeing _notice in JSON output.
improve-codebase-architecture
Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.
paper-context-resolver
Optional RigorPilot helper for README-first deep learning repo reproduction. Use only when the README and repository files leave a narrow reproduction-critical gap and the task is to resolve a specific paper detail such as dataset split, preprocessing, evaluation protocol, checkpoint mapping, or runtime assumption from primary paper sources while recording conflicts. Do not use for general paper summary, repo scanning, environment setup, command execution, title-only paper lookup, or replacin...