★ Featured Development

Eval

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

Authoralirezarezvani

Version1.0.0

LicenseMIT

Token count~591

UpdatedJun 4, 2026

Install

Quick install

via npx skills · works with 57+ agents

npx skills add https://github.com/alirezarezvani/claude-skills/tree/main/engineering/agenthub/skills/eval

Or pick agent:

npx skills add alirezarezvani/claude-skills --skill eval --agent claude-code

npx skills add alirezarezvani/claude-skills --skill eval --agent cursor

npx skills add alirezarezvani/claude-skills --skill eval --agent codex

npx skills add alirezarezvani/claude-skills --skill eval --agent opencode

npx skills add alirezarezvani/claude-skills --skill eval --agent github-copilot

npx skills add alirezarezvani/claude-skills --skill eval --agent windsurf

More install options

Shorthand — useful for multi-skill repos:

npx skills add alirezarezvani/claude-skills --skill eval

Manual — clone the repo and drop the folder into your agent's skills directory:

git clone https://github.com/alirezarezvani/claude-skills.git

cp -r claude-skills/engineering/agenthub/skills/eval ~/.claude/skills/

How to use: Once installed, ask your agent to "use the eval skill" or describe what you want (e.g. "Evaluate and rank agent results by metric or LLM judge for an AgentHub session"). Requires Node.js 18+.

/hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

Get the diff: git diff {base_branch}...{agent_branch}

Read the agent's result post from .agenthub/board/results/agent-{i}-result.md

Compare all diffs and rank by:

Correctness — Does it solve the task?

Simplicity — Fewer lines changed is better (when equal correctness)

Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

Run metric evaluation first
If top agents are within 10% of each other, use LLM judge to break ties
Present both metric and qualitative rankings

After Eval

Update session state:

python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating

Tell the user:

Ranked results with winner highlighted
Next step: /hub:merge to merge the winner
Or /hub:merge {session-id} --agent {winner} to be explicit

SKILL.md source

---
name: eval
description: Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
---

# /hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

## Usage

```
/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)
```

## What It Does

### Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

```bash
python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}
```

Output:
```
RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)
```

### LLM Judge Mode (no eval command, or --judge flag)

For each agent:
1. Get the diff: `git diff {base_branch}...{agent_branch}`
2. Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md`
3. Compare all diffs and rank by:
   - **Correctness** — Does it solve the task?
   - **Simplicity** — Fewer lines changed is better (when equal correctness)
   - **Quality** — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:
```
RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)
```

### Hybrid Mode

1. Run metric evaluation first
2. If top agents are within 10% of each other, use LLM judge to break ties
3. Present both metric and qualitative rankings

## After Eval

1. Update session state:
```bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
```

2. Tell the user:
   - Ranked results with winner highlighted
   - Next step: `/hub:merge` to merge the winner
   - Or `/hub:merge {session-id} --agent {winner}` to be explicit

Related skills 6

caveman

★ Featured

Ultra-compressed communication mode. Cuts token usage ~75% by speaking like caveman while keeping full technical accuracy. Supports intensity levels: lite, full (default), ultra, wenyan-lite, wenyan-full, wenyan-ultra. Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman. Also auto-triggers when token efficiency is requested.

juliusbrussee 167k

Development

secure-linux-web-hosting

★ Featured

Use when setting up, hardening, or reviewing a cloud server for self-hosting, including DNS, SSH, firewalls, Nginx, static-site hosting, reverse-proxying an app, HTTPS with Let's Encrypt or ACME clients, safe HTTP-to-HTTPS redirects, or optional post-launch network tuning such as BBR.

xixu-me 155k

Development

readme-i18n

★ Featured

Use when the user wants to translate a repository README, make a repo multilingual, localize docs, add a language switcher, internationalize the README, or update localized README variants in a GitHub-style repository.

xixu-me 155k

Development

lark-shared

★ Featured

Use when first setting up lark-cli, running auth login, switching user/bot identity (--as), handling permission denied or scope errors, needing to update lark-cli, or seeing _notice in JSON output.

larksuite 155k

Development

improve-codebase-architecture

★ Featured

Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.

mattpocock 151k

Development

paper-context-resolver

★ Featured

Optional RigorPilot helper for README-first deep learning repo reproduction. Use only when the README and repository files leave a narrow reproduction-critical gap and the task is to resolve a specific paper detail such as dataset split, preprocessing, evaluation protocol, checkpoint mapping, or runtime assumption from primary paper sources while recording conflicts. Do not use for general paper summary, repo scanning, environment setup, command execution, title-only paper lookup, or replacin...

lllllllama 127k

Development