Development

Agent Evaluation

Version1.0.0

LicenseMIT

Token count~519

UpdatedMay 27, 2026

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

Install

Quick install

via npx skills · works with 57+ agents

npx skills add https://github.com/davila7/claude-code-templates/tree/main/cli-tool/components/skills/ai-research/agent-evaluation

Or pick agent:

npx skills add davila7/claude-code-templates --skill agent-evaluation --agent claude-code

npx skills add davila7/claude-code-templates --skill agent-evaluation --agent cursor

npx skills add davila7/claude-code-templates --skill agent-evaluation --agent codex

npx skills add davila7/claude-code-templates --skill agent-evaluation --agent opencode

npx skills add davila7/claude-code-templates --skill agent-evaluation --agent github-copilot

npx skills add davila7/claude-code-templates --skill agent-evaluation --agent windsurf

More install options

Shorthand — useful for multi-skill repos:

npx skills add davila7/claude-code-templates --skill agent-evaluation

Manual — clone the repo and drop the folder into your agent's skills directory:

git clone https://github.com/davila7/claude-code-templates.git

cp -r claude-code-templates/cli-tool/components/skills/ai-research/agent-evaluation ~/.claude/skills/

How to use: Once installed, ask your agent to "use the agent-evaluation skill" or describe what you want (e.g. "Testing and benchmarking LLM agents including behavioral testing, capability ass"). Requires Node.js 18+.

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
production. You've learned that evaluating LLM agents is fundamentally different from
testing traditional software—the same input can produce different outputs, and "correct"
often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression
tests, capability assessments, and reliability metrics. You understand that the goal isn't
100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

| Issue | Severity | Solution |
|-------|----------|----------|
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

SKILL.md source

---
name: agent-evaluation
description: Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world ben...
---

# Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
production. You've learned that evaluating LLM agents is fundamentally different from
testing traditional software—the same input can produce different outputs, and "correct"
often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression
tests, capability assessments, and reliability metrics. You understand that the goal isn't
100% test pass rate—it

## Capabilities

- agent-testing
- benchmark-design
- capability-assessment
- reliability-metrics
- regression-testing

## Requirements

- testing-fundamentals
- llm-fundamentals

## Patterns

### Statistical Test Evaluation

Run tests multiple times and analyze result distributions

### Behavioral Contract Testing

Define and test agent behavioral invariants

### Adversarial Testing

Actively try to break agent behavior

## Anti-Patterns

### ❌ Single-Run Testing

### ❌ Only Happy Path Tests

### ❌ Output String Matching

## ⚠️ Sharp Edges

| Issue | Severity | Solution |
|-------|----------|----------|
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |

## Related Skills

Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`

Related skills 6

ElevenLabs Automation

★ Featured

Automate ElevenLabs text-to-speech workflows -- generate speech from text, browse and inspect voices, check subscription limits, list models, stream audio, and retrieve history via the Composio MCP...

ComposioHQ 46

Development

Finishing A Development Branch

★ Featured Official

Use when implementation is complete, all tests pass, and you need to decide how to integrate the work - guides completion of development work by presenting structured options for merge, PR, or cleanup

obra 45

Development

Receiving Code Review

★ Featured Official

Use when receiving code review feedback, before implementing suggestions, especially if feedback seems unclear or technically questionable - requires technical rigor and verification, not performat...

obra 45

Development

Webapp Testing

★ Featured

Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browse...

ComposioHQ 40

Development

Replicate Automation

★ Featured

Automate Replicate AI model operations -- run predictions, upload files, inspect model schemas, list versions, and manage prediction history via the Composio MCP integration.

ComposioHQ 38

Development

OpenAI Automation

★ Featured

Automate OpenAI API operations -- generate responses with multimodal and structured output support, create embeddings, generate images, and list models via the Composio MCP integration.

ComposioHQ 36

Development