Development

Eval Guide

Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases…

Authormicrosoft

Version1.0.0

LicenseMIT

Token count~7,490

UpdatedJun 5, 2026

Install

Quick install

via npx skills · works with 57+ agents

npx skills add https://github.com/microsoft/eval-guide/tree/HEAD/skills/eval-guide

Or pick agent:

npx skills add microsoft/eval-guide --skill eval-guide --agent claude-code

npx skills add microsoft/eval-guide --skill eval-guide --agent cursor

npx skills add microsoft/eval-guide --skill eval-guide --agent codex

npx skills add microsoft/eval-guide --skill eval-guide --agent opencode

npx skills add microsoft/eval-guide --skill eval-guide --agent github-copilot

npx skills add microsoft/eval-guide --skill eval-guide --agent windsurf

More install options

Shorthand — useful for multi-skill repos:

npx skills add microsoft/eval-guide --skill eval-guide

Manual — clone the repo and drop the folder into your agent's skills directory:

git clone https://github.com/microsoft/eval-guide.git

cp -r eval-guide/skills/eval-guide ~/.claude/skills/

How to use: Once installed, ask your agent to "use the eval-guide skill" or describe what you want (e.g. "Eval enablement accelerator — help customers think through "what does good look"). Requires Node.js 18+.

eval-guide

Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases…

eval-guideby microsoft

Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases…

npx skills add https://github.com/microsoft/eval-guide --skill eval-guideDownload ZIPGitHub

Eval Guide — Enablement Accelerator

Help customers go from "I don't know where to start with eval" to "I have a plan, test cases, and know how to interpret results" — in one session. The customer becomes self-sufficient for future eval cycles.

Eval-First Mindset

You do NOT need a built agent to start. All you need is an idea, a description, or even a vague goal. This skill is designed around the eval-first approach: define what "good" looks like and write your evals before you build the agent or feature.

Why eval-first?

Evals sharpen your thinking. Writing test cases forces you to articulate exactly what the agent should and shouldn't do — before you spend time building it.

Evals become your spec. The eval plan from Stage 1 and test cases from Stage 2 double as your agent's acceptance criteria. Build the agent to pass these tests.

Evals prevent drift. When you define success upfront, you avoid scope creep and "it seems to work" thinking. You'll know objectively whether the agent meets the bar.

Start here whether you:

Have only a rough idea ("we want an HR bot")

Have a written description but no agent yet

Have a built agent you want to evaluate

Are adding a new feature to an existing agent

Stages 0 (Discover), 1 (Plan), and 2 (Generate) all work without a running agent. They help you think through your agent's purpose, design a structured eval plan, and generate test cases — all before writing a single line of agent configuration. Stage 3 (Run) is the only stage that requires a live agent, and it's optional.

This skill is grounded in Microsoft's Eval Scenario Library, Triage & Improvement Playbook, and MS Learn agent evaluation documentation.

Important: You are an enablement accelerator, not a replacement. Each stage generates artifacts the customer can use immediately AND explains the reasoning so they internalize the methodology. After one session, they should be able to do the next eval without us.

Interactive Dashboard Workflow

Each stage produces an interactive HTML dashboard that opens directly in the browser. The dashboard runs against a tiny localhost HTTP server (serve.py --serve); the customer never sees, downloads, or moves a JSON file. Feedback flows from the browser → server → the AI's bash stdout, in one step.

Flow at each review-stage dashboard (Plan, Generate, Interpret):

Complete the stage's analysis.

Write stage data to a JSON file (e.g., stage-1-data.json).

Launch with --serve mode. The AI's bash blocks until the customer clicks Approve or Regenerate:

python "$(ls ~/.claude/skills/eval-guide/dashboard/serve.py 2>/dev/null || ls ~/.claude/plugins/cache//eval-guide//skills/eval-guide/dashboard/serve.py 2>/dev/null | head -1)" --stage <name> --serve --data <file>.json

The customer reviews in the browser at http://localhost:3118: edits fields inline, drags between quadrants, adds comments. Edits auto-save to the localhost server.

When the customer clicks Approve & Continue or Incorporate Changes & Regenerate, the browser POSTs the feedback to /api/feedback. The server captures it, prints the feedback JSON to stdout between marker lines, and shuts down. No file is downloaded; the customer never moves anything.

Parse the feedback from the bash command's stdout — look for the block:

`===EVAL_GUIDE_FEEDBACK_BEGIN===
{ "stage": "...", "status": "confirmed" | "changes_requested", "edits": {...}, "comments": "..." }
===EVAL_GUIDE_FEEDBACK_END===
`

Decode the JSON between those markers — that's the customer's feedback. (<stage>-feedback.json is also written next to the data file as a debugging backup, but stdout is the primary channel — read from there.)

If status: "confirmed" → apply the edits, generate final deliverables (docx, CSV), proceed to next stage.

If status: "changes_requested" → apply the edits, regenerate the stage data file, re-launch the dashboard. Same loop.

The orient stage is a pre-built static HTML (dashboard/orient-dashboard.html) — agent-agnostic, no serve.py, no JSON write, no feedback file. The skill simply opens the file in the customer's browser and continues the conversation. See Session Start: Orient below.

Stages with dashboards: Discover (0), Plan (1), Generate (2), Interpret (4). Stage 3 (Run) executes tests directly.

Key principle: No docx or CSV files are generated until the customer confirms via the dashboard. The dashboard IS the review checkpoint — it replaces the "does this look right?" chat-based confirmation with a structured, visual review.

Before You Start

Start from wherever the customer is. Most customers come to eval guidance early — they have an idea or a description, not a finished agent. That's exactly right. The eval-first approach means defining "what good looks like" before building.

Ask: "Tell me about the agent you're building or planning to build. It could be a detailed spec, a rough idea, or even just 'we want a bot that helps with X.' We'll use that to build your eval plan — you don't need a running agent to get started."

If they have an idea or description (most common): Proceed directly to Stage 0 (Discover). The conversation will help them articulate their agent's purpose, users, boundaries, and success criteria — this becomes their eval spec.

If they already have a running Copilot Studio agent: Offer to connect to it for richer context: "Since you have a running agent, I can pull its configuration directly to inform the eval plan. Want to share your tenant ID so I can connect?" If yes, use /clone-agent to import the agent's topics, knowledge sources, and configuration. Use this to pre-fill the Agent Vision in Stage 0.

If they already have eval results: Route directly to Stage 4 (Interpret).

The key message: Writing evals early makes the agent better. The eval plan becomes the spec, and the test cases become the acceptance criteria. Customers who define evals first build more focused agents and catch problems before they reach production.

Session Start: Orient

Once the customer has described their agent in one or two sentences, give them a visual snapshot of the Per-Agent Eval Maturity Model — where their agent stands today and where this session takes it. This is the orientation moment, and it sets the frame for everything that follows.

What to do

The orient dashboard is pre-built and shipped with the skill — dashboard/orient-dashboard.html. It is identical for every agent (the maturity model and "what you walk away with" are agent-agnostic), so there is no per-session JSON write and no Python launch. Don't ask for the agent name yet — Stage 0 captures it where it's actually needed for deliverable filenames.

*
Open the static dashboard in the customer's default browser. Use the OS launcher and the install-resolved path:

`ORIENT_HTML="$(ls ~/.claude/skills/eval-guide/dashboard/orient-dashboard.html 2>/dev/null || ls ~/.claude/plugins/cache/*/eval-guide/*/skills/eval-guide/dashboard/orient-dashboard.html 2>/dev/null | head -1)"
case "$(uname -s 2>/dev/null)" in
Darwin) open "$ORIENT_HTML" ;;
Linux) xdg-open "$ORIENT_HTML" ;;
*) cmd.exe /C start "" "$ORIENT_HTML" ;; # Windows / Git Bash
esac
`

The ls ... | head -1 fallback resolves the file regardless of install location — user-global skills first (~/.claude/skills/eval-guide/), plugin-cache second.

For dev installs (skill checked out at an arbitrary path, not in ~/.claude/), the AI should know the absolute path of the SKILL.md it's reading and substitute <SKILL.md-dir>/dashboard/orient-dashboard.html.

This is a read-only stage. There is no feedback file, no confirmation gate, and no serve.py involvement. The customer reviews the snapshot in the browser while the conversation continues in chat.

*
While the dashboard is open, narrate one sentence in chat: "This is the eval maturity model — five pillars of eval practice, five levels each. Today's session takes Pillars 1, 2, and 4 to L300 Systematic; Pillars 3 and 5 reach L200 Defined via the reference protocols you'll get at the end."

*
Proceed to Stage 0 (Discover) without waiting. The dashboard is informational.

When to rebuild the static HTML: if templates/orient.html, templates/base.html, or examples/stage-orient-data.json change, run python dashboard/build-orient.py once and check in the regenerated orient-dashboard.html. The build script reuses serve.py's generate_html, so the rendering stays consistent with the live dashboards.

Why this matters for the customer: The maturity model is the value moment. Without it, the customer sees a series of stages with no map. With it, they understand exactly what they're getting and what comes next — the eval-first message lands because they can see the full journey.

Skip orient when: the customer has already done a session with the toolkit and is returning for a Stage 1 / Stage 2 / Stage 4 jump-in. Don't re-orient someone who already has the map.

How to Route

Customer says...Start at"We're planning to build an agent for..."Stage 0: Discover — eval-first: define evals before building"We have an idea for an agent, what should we test?"Stage 0: Discover — perfect, evals start from an idea"Help us think through what good looks like"Stage 0: Discover"I want to add a new feature to my agent"Stage 0: Discover — write evals for the feature before building it"Here's our agent description, plan the eval"Stage 1: Plan"I already have a plan, generate test cases"Stage 2: Generate"I have eval results, what do they mean?"Stage 4: Interpret
When running the full pipeline, complete each stage, show the output, explain your reasoning, then ask: "Ready for the next stage?"

Eval Maturity Journey

Use the Per-Agent Eval Maturity Model to orient customers on where they are today and where this session takes them. Five pillars of eval practice, five levels each — from L100 Initial (no practice in place) to L500 Optimized (continuous improvement built into operations). Assume the agent starts at L100 Initial on all pillars. This session targets L300 Systematic on Pillars 1, 2, and 4 (in-session deliverables) and L200 Defined on Pillars 3 and 5 (via reference protocols delivered alongside the session).

The full 5×5 definitions live in maturity-model.md — that file is the canonical reference. Update it first when level definitions change.

PillarWhat it measuresAfter this sessionMechanism1 — Define what "good" meansAcceptance criteria qualityL300 Systematic ✓Stage 0 (Discover) + Stage 1 (Plan)2 — Build your eval setsCoverage and versioningL300 Systematic ✓Stage 2 (Generate)3 — Run evals across the lifecycleWhere and when evals execute (offline, pre-deploy, production)L200 Defined ✓rerun-protocol-<agent>-<date>.docx (starter artifact)4 — Improve and iterateHow improvements are validatedL300 Systematic ✓Stage 4 (Interpret) — only if eval results are available5 — Handle changes with confidenceHow changes (prompts, tools, models, architecture) get tested before shippingL200 Defined ✓baseline-comparison-<agent>-<date>.xlsx (starter artifact)
Pillars 3 and 5 stop at L200 Defined this session. L300 Systematic on those pillars requires operating practice — a release cadence with codified triggers (Pillar 3) and version-tagged baselines accumulated over multiple changes (Pillar 5). The starter artifacts get the customer to L200 in one session: a documented protocol and a fill-in workbook they can execute when triggered. Generate rerun-protocol-<agent>-<date>.docx and baseline-comparison-<agent>-<date>.xlsx at the end of Stage 2 (see deliverables C and D in Stage 2's "After confirmation" block).

Each stage below includes a maturity callout naming which pillar and level it advances.

How This Maps to Microsoft's Official Evaluation Framework

Microsoft's evaluation checklist and iterative framework define a 4-stage lifecycle. Our skill stages map directly to it — share this mapping with customers so they see how the accelerator fits the official guidance:

Microsoft's 4 stagesWhat it meansOur skill stagesOther eval skillsStage 1: Define — Create foundational test cases with clear acceptance criteriaTranslate agent scenarios into testable components before you even have a working agentStage 0 (Discover) + Stage 1 (Plan) + Stage 2 (Generate)eval-suite-planner, eval-generatorStage 2: Baseline — Run tests, measure, enter the evaluate→analyze→improve loopEstablish quantitative baseline, categorize failures by quality signal, iterateStage 3 (Run) + Stage 4 (Interpret)eval-result-interpreterStage 3: Expand — Add variation, architecture, and edge-case test categoriesBuild comprehensive suite: High Value · Low Risk (regression), Variations (generalization), Architecture (diagnostic), Edge cases (robustness)Repeat Stage 1–2 with broader categorieseval-suite-planner (expansion sets)Stage 4: Operationalize — Establish cadence, triggers, continuous monitoringRun core on every change, full suite weekly + before releases, track quality signals over timeStage 4 (Interpret) ongoingeval-triage-and-improvement
When to share this: After completing Stage 0, show the customer this mapping and say: "What we're doing today covers Microsoft's Stage 1 — defining your foundational test cases. Once you have a running agent, you'll move into Stage 2 (baseline), then expand and operationalize. The checklist template helps you track progress."

Downloadable checklist: Point customers to the editable checklist template so they can track their progress through all four stages independently.

Stage 0: Discover

Help the customer articulate what their agent is supposed to do and what "good" looks like. This is the most important stage — it shapes everything downstream.

What you walk away with

A 1-page Agent Vision — purpose, users, knowledge sources, core capabilities, boundaries (what the agent must NOT do), success criteria, role-based access, risk profile. Written down, not assumed.

Stakeholder alignment — or, more often, a surfaced disagreement between builder and PM about scope. 10 minutes of structured questions catches what would otherwise cost weeks of rework.

The spec every later stage depends on. Stage 1's eval plan, Stage 2's test cases, and Stage 4's pass/fail judgment all trace back to what gets named here.

When this stage is wrong for you

You already have a written PRD, agent spec, or design doc that covers all 7 questions below. Bring it and skip to Stage 1.

You have eval results in hand and need triage now — go straight to Stage 4.

Your agent is a 50-topic monster. One Stage 0 pass won't fit; run Stage 0 per top-level capability.

What to do — extract Vision, apply safe defaults, proceed to Stage 1

Don't ask Q1–Q7 in chat. This was the old flow; it tested as an interrogation and customers tuned out. The new flow: extract everything you can from the customer's kickoff description, fill the gaps with domain-keyed safe defaults, summarize in 5–6 lines, and proceed straight to the Plan dashboard. The customer corrects in chat ("actually, peer comp comparison isn't a boundary for us") or via the dashboard's General Comments box. Nothing is locked until they confirm in the dashboard.

Step 1 — Pre-extract from the kickoff
From the customer's 1–4 sentence description, extract:

Purpose — usually the first clause ("Personalized HR support…")

Users — usually implied ("employees," "customers," "internal teams")

Capabilities — usually a list ("benefits, training, policies")

Knowledge sources — sometimes named, often categorized ("official company resources" → SharePoint TBD)

Tone hints — sometimes explicit ("trusted HR colleague," "efficient")

Personalization hints — words like "personalized," "your," "based on your role"

If the kickoff is too thin (one sentence with no domain hint), ask one clarifying question — "Two more sentences on what it does and who uses it would help me draft a Vision faster" — then resume.

Step 2 — Apply safe defaults by domain
Domain detection runs on keywords in the kickoff description. Pick the matching default set:

Domain trigger keywordsDefault boundaries (what NOT to do)Default risk profileHR / ESS / employee / benefits / policy / leave / payrollLegal advice; medical advice; salary negotiation; performance review interpretation; HR investigation details; peer compensation comparison; PII about other employeesHIGH (privacy + regulated content)Customer support / refunds / billing / accountsRefunds beyond policy; account-specific data outside this user's scope; legal-binding promises; competitor product recommendationsHIGH (customer trust + financial)Knowledge / documentation / FAQ / wikiContent beyond the named knowledge sources; opinions framed as facts; regulated advice (legal/medical/financial)MEDIUM (defaults higher if regulated content domain)IT / helpdesk / troubleshootingRemote-execute actions on user systems; reset credentials without verification; security advice that bypasses policyMEDIUM (HIGH if security/privacy adjacent)Agentic / tool-using / "submits" / "schedules" / "books"Irreversible actions without confirmation; actions outside user's authorization scope; anything requiring approval the agent can't getHIGH (writes to systems)No domain detected"Outside the named knowledge sources" + "anything the user-cohort isn't authorized for" + 1 generic safety guardrailMEDIUM (default cautious)
Default success criteria (always include unless customer overrides):

Most user questions answered directly (deflection / self-service rate)

Out-of-scope questions routed clearly to the right human or resource (graceful handoff)

Zero privacy / boundary breaches

Default knowledge sources when only categorized:

"some SharePoint sites" / "internal docs" → flag as Multiple SharePoint sites (TBD — name in Plan dashboard) so the customer can fill names without us blocking on it.

Auto-detect role-based access: if the customer's description contains "your," "personalized," "based on your," "role-specific," "tailored to," set role_based_access: true and infer 2–3 likely personalization axes from the agent's domain (HR/ESS → location, tenure, plan; customer support → account tier, region; etc.). Customer corrects if wrong.

Step 3 — Drop aspirational-language capabilities silently
Marketing-language capabilities like "empower employees," "explore opportunities," "streamline X" don't survive the concreteness check. Drop them from Core Capabilities and add a one-line note in the Vision summary: "Note: dropped 'explore opportunities' as aspirational — not a testable feature. Tell me if it's actually a concrete capability and I'll add it back."

This is silent removal with a flagged note, not a question. Customer can flag if they disagree.

Step 4 — Show the Vision summary in chat (5–6 lines, no questions)
Display the pre-extracted Vision compactly:

`Agent Vision: [Name]

Purpose: [one sentence from kickoff]
Users: [extracted or default]
Knowledge: [named sources, or "TBD — confirm in Plan dashboard"]
Capabilities: [3–5 from kickoff, aspirational dropped]
Boundaries: [domain default set, listed]
Success: [default 3 criteria]
Role-based: [auto-detected: yes/no, with axes]
Risk profile: [domain default: HIGH/MEDIUM/LOW]
`

Then: "This is what I extracted from your description, with safe defaults for [HR/ESS/etc.] domain agents filling the gaps. Speak up now if any of this is wrong — boundaries, risk profile, or capabilities especially. I'm proceeding to draft the eval plan; you'll review the full criteria + matrix in the Plan dashboard."

Don't gate on customer confirmation. Write stage-0-data.json and proceed to Stage 1 immediately. The customer either replies with corrections (which you incorporate before launching the dashboard) or stays silent (proceed). The Plan dashboard is the real review surface.

Why this works

Pre-extraction + defaults covers ~80% of what the chat questions extracted, with zero customer chat input beyond the kickoff.

Defaults are domain-keyed, so they're rarely wrong for common agent types (HR, customer support, IT, knowledge).

The Plan dashboard is the correction surface — visual, all-at-once, lets the customer fix Vision-level issues alongside criteria-level edits in one pass.

Customer can always correct in chat before the dashboard launches, but isn't forced to.

When this approach is wrong (revert to gap-question batch)

The kickoff description is genuinely too thin — one sentence with no domain keywords. Ask one clarifying question to get enough material for safe defaults.

The customer is in a regulated-but-uncommon domain (medical devices, financial services, government) where the default boundaries don't fit. After step 2, ask: "Domain looks like [X] — your boundaries are usually [Y]. Anything specific I should add for your context?"

The customer has explicitly said the agent is novel / experimental and they want to talk through it. Default to conversation mode for these — but they're a small minority.

Stage 1: Plan

Using the Agent Vision, produce a structured eval suite plan. This works whether the agent exists or not — the plan defines what the agent SHOULD do.

What you walk away with

10–15 acceptance criteria phrased as "The agent should…" (or "should NOT…" for negative tests). Testable, prioritizable, reviewable.

Each criterion placed on a Value × Risk matrix — High Value · High Risk (highest investment), High Value · Low Risk (expected behavior, occasional misses tolerable), Low Value · High Risk (low traffic, zero tolerance for failure), Low Value · Low Risk (light coverage). The matrix is what keeps the plan tractable.

Each criterion has explicit pass/fail conditions and a test method — so a human or LLM judge can decide outcomes from the criterion alone.

A .docx eval plan for stakeholder review (PM, security, business owner). The artifact for sign-off.

When this stage is wrong for you

You already have written acceptance criteria covering all four Value × Risk quadrants. Bring them and skip to Stage 2.

You're testing a single new feature on an existing agent. Run a mini Stage 1 on just that feature; don't redo the whole plan.

Your agent has 50+ topics. Run Stage 1 per top-level capability; one pass won't fit.

What to do

*
Determine eval depth from agent architecture:

Different agent architectures require different eval layers. Use this to scope the eval plan — don't over-test simple agents or under-test complex ones.

ArchitectureWhat it isWhat to evaluateExample scenariosPrompt-level (simple Q&A, no knowledge sources, no tools)Agent responds from its system prompt and LLM knowledge onlyResponse quality, tone, boundaries, refusal behaviorFAQ bot with hardcoded answers, greeting agentRAG / Knowledge-grounded (has knowledge sources, no tool use)Agent retrieves from documents, SharePoint, websites, etc.Everything above PLUS: retrieval accuracy, grounding (did it cite the right source?), hallucination prevention, completenessHR policy bot, IT knowledge base agentAgentic (multi-step, tool use, orchestration)Agent calls APIs, uses connectors, makes decisions, chains actionsEverything above PLUS: tool selection accuracy, action correctness, error recovery, multi-turn context retention, task completion rateExpense submission agent, incident triage bot, booking agent
Tell the customer: "Your agent is [architecture type], which means we need to test [these layers]. A knowledge-grounded agent needs hallucination tests that a simple Q&A bot doesn't. An agentic workflow needs tool-routing tests that a knowledge bot doesn't. This scopes your eval so you're testing what actually matters."

Disambiguate borderline capabilities before locking the architecture call. Some Agent Vision capabilities can go either way — RAG (read-only routing) or Agentic (write actions). When the Vision contains any of these phrasings, ask the customer explicitly before classifying:

Ambiguous Vision phrasingThe disambiguation question"Help update personal info" / "Update settings" / "Edit profile"Does the agent take the action (calls an API/connector to write), or just tell the user where to do it themselves?"Submit request" / "File ticket" / "Create record"Does the agent submit on the user's behalf, or draft for the user to submit?"Schedule meeting" / "Book resource"Direct booking action, or routing to a booking tool?"Approve" / "Authorize" / "Sign off"Agent has approval authority, or surfaces the decision to a human?"Send email" / "Notify" / "Message"Agent writes/sends, or drafts for user review?
If write → Agentic (add Tool Invocation + action-correctness criteria). If route-only → RAG (no Tool Invocation criteria needed). Don't guess; ask in one sentence: "Does the agent take this action itself, or surface where to do it?"

Use this to filter criteria families in the next step — skip capability families that don't apply to the agent's architecture. A prompt-level agent doesn't need Knowledge Grounding criteria; a non-agentic agent doesn't need Tool Invocation criteria.

*
Identify the families of acceptance criteria to write:

Acceptance criteria come in two families — functional (what the agent should do for users) and capability (how well it should do it). Use the table to pick the families that apply.

If the agent...Functional criteriaCapability criteriaAnswers questions from knowledge sourcesInformation RetrievalKnowledge Grounding + ComplianceExecutes tasks via APIs/connectorsRequest SubmissionTool Invocations + SafetyWalks users through troubleshootingTroubleshootingKnowledge Grounding + Graceful FailureGuides through multi-step processesProcess NavigationTrigger Routing + Tone & QualityRoutes conversations to teams/departmentsTriage & RoutingTrigger Routing + Graceful FailureHandles sensitive data(add to whichever applies)Safety + ComplianceAll agents (always include)—Red-Teaming
Explain your picks AND your skips. This is a pedagogy moment, not a checklist. Customers learn the methodology by hearing what's rejected as much as what's selected.

Picks: "Based on your Agent Vision, I'm selecting Information Retrieval and Knowledge Grounding because your agent answers from policy documents. I'm also including Red-Teaming — every agent needs adversarial testing."

Skips: "I'm skipping Tool Invocations (no tool use in this version), Process Navigation (not multi-step), and Trigger Routing (no tool routing). If any of these change in v2, we add the families then — for now they'd be wasted test cases."

The Skipping: ... because ... narration is mandatory when any family is excluded. It signals the customer that the eval scope is fitted to their agent, not a generic kitchen sink. It also gives them a forward marker for when to revisit ("when v2 adds tool use, come back for Tool Invocation criteria").

Write acceptance criteria:

Each criterion is a single testable statement starting with "The agent should…" (or "The agent should NOT…" for negative tests). Write 10–15 criteria across the families identified above.

Criteria plan table:

#Acceptance CriterionQuality DimensionMethod
Quality dimension naming — keep it broad. Aim for 4–6 dimensions, not 8–12.

Customers fragment dimensions when the AI does. "Policy Accuracy / Benefits Accuracy / Training Accuracy" should usually be one dimension called "Accuracy" (or "Knowledge Accuracy"). The criterion's statement already specifies what knowledge it tests — the dimension shouldn't repeat that. Consolidate aggressively.

Default dimension set for most agents:

DimensionWhat it groupsAccuracyAll factual correctness criteria, regardless of which knowledge source they hit (policy, benefits, training, FAQs, etc.)GroundingCitation enforcement; agent references the source it usedHallucination Prevention"The agent should NOT invent facts not in sources" criteriaRoutingOut-of-scope handling, escalation, handoff to right resourceToneTone, empathy, brand voice, persona criteriaBoundaries / SafetyRefusals on legal / medical / privacy / regulated topics — pair with Low Value · High Risk quadrantAdversarial / Red-TeamingJailbreak resistance, boundary probingPersonalization (if role-based access)Role/cohort-specific behavior
Most agents need 4–6 of these, not all 8. Don't create per-source or per-topic dimensions (e.g., "PTO Accuracy," "Benefits Lookup Accuracy") — that's what the criterion statement is for. The dimension is a coarse bucket for grouping criteria when reviewing the plan.

The dashboard supports renaming dimensions inline (click a dimension name to edit) and merging by renaming to an existing name. If the AI's first draft has too many dimensions, the customer can collapse them in the dashboard, but it's better to default to consolidated names from the start.

Good criteria are:

Behavior-led — start with "The agent should…" and describe an observable behavior

Verifiable — pass or fail can be judged by comparing a response to the criterion

Scoped — one behavior per criterion; don't stack multiple checks into one row

Grounded — tied to the Agent Vision (a capability, boundary, knowledge source, or user cohort)

Examples:

"The agent should return the correct PTO days for the employee's office and tenure, with a citation to the source policy."

"The agent should refuse salary or compensation queries and direct the user to HR."

"The agent should

SKILL.md source

---
name: eval-guide
description: Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases…
---

# eval-guide

Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases…

# eval-guideby microsoft
Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases…

`npx skills add https://github.com/microsoft/eval-guide --skill eval-guide`Download ZIPGitHub

## Eval Guide — Enablement Accelerator

## Eval-First Mindset

Why eval-first?

* Evals sharpen your thinking. Writing test cases forces you to articulate exactly what the agent should and shouldn't do — before you spend time building it.

* Evals become your spec. The eval plan from Stage 1 and test cases from Stage 2 double as your agent's acceptance criteria. Build the agent to pass these tests.

* Evals prevent drift. When you define success upfront, you avoid scope creep and "it seems to work" thinking. You'll know objectively whether the agent meets the bar.

Start here whether you:

* Have only a rough idea ("we want an HR bot")

* Have a written description but no agent yet

* Have a built agent you want to evaluate

* Are adding a new feature to an existing agent

This skill is grounded in Microsoft's Eval Scenario Library, Triage & Improvement Playbook, and MS Learn agent evaluation documentation.

## Interactive Dashboard Workflow

Each stage produces an interactive HTML dashboard that opens directly in the browser. The dashboard runs against a tiny localhost HTTP server (`serve.py --serve`); the customer never sees, downloads, or moves a JSON file. Feedback flows from the browser → server → the AI's `bash` stdout, in one step.

Flow at each review-stage dashboard (Plan, Generate, Interpret):

* Complete the stage's analysis.

* Write stage data to a JSON file (e.g., `stage-1-data.json`).

* Launch with `--serve` mode. The AI's bash blocks until the customer clicks Approve or Regenerate:
`python "$(ls ~/.claude/skills/eval-guide/dashboard/serve.py 2>/dev/null || ls ~/.claude/plugins/cache/*/eval-guide/*/skills/eval-guide/dashboard/serve.py 2>/dev/null | head -1)" --stage <name> --serve --data <file>.json`

* The customer reviews in the browser at `http://localhost:3118`: edits fields inline, drags between quadrants, adds comments. Edits auto-save to the localhost server.

* When the customer clicks Approve & Continue or Incorporate Changes & Regenerate, the browser POSTs the feedback to `/api/feedback`. The server captures it, prints the feedback JSON to stdout between marker lines, and shuts down. No file is downloaded; the customer never moves anything.

* Parse the feedback from the bash command's stdout — look for the block:

```
`===EVAL_GUIDE_FEEDBACK_BEGIN===
{ "stage": "...", "status": "confirmed" | "changes_requested", "edits": {...}, "comments": "..." }
===EVAL_GUIDE_FEEDBACK_END===
`
```

Decode the JSON between those markers — that's the customer's feedback. (`<stage>-feedback.json` is also written next to the data file as a debugging backup, but stdout is the primary channel — read from there.)

* If `status: "confirmed"` → apply the edits, generate final deliverables (docx, CSV), proceed to next stage.

* If `status: "changes_requested"` → apply the edits, regenerate the stage data file, re-launch the dashboard. Same loop.

The orient stage is a pre-built static HTML (`dashboard/orient-dashboard.html`) — agent-agnostic, no `serve.py`, no JSON write, no feedback file. The skill simply opens the file in the customer's browser and continues the conversation. See Session Start: Orient below.

Stages with dashboards: Discover (0), Plan (1), Generate (2), Interpret (4). Stage 3 (Run) executes tests directly.

## Before You Start

* If they have an idea or description (most common): Proceed directly to Stage 0 (Discover). The conversation will help them articulate their agent's purpose, users, boundaries, and success criteria — this becomes their eval spec.

* If they already have a running Copilot Studio agent: Offer to connect to it for richer context: "Since you have a running agent, I can pull its configuration directly to inform the eval plan. Want to share your tenant ID so I can connect?" If yes, use `/clone-agent` to import the agent's topics, knowledge sources, and configuration. Use this to pre-fill the Agent Vision in Stage 0.

* If they already have eval results: Route directly to Stage 4 (Interpret).

## Session Start: Orient

### What to do

The orient dashboard is pre-built and shipped with the skill — `dashboard/orient-dashboard.html`. It is identical for every agent (the maturity model and "what you walk away with" are agent-agnostic), so there is no per-session JSON write and no Python launch. Don't ask for the agent name yet — Stage 0 captures it where it's actually needed for deliverable filenames.

*
Open the static dashboard in the customer's default browser. Use the OS launcher and the install-resolved path:

```
`ORIENT_HTML="$(ls ~/.claude/skills/eval-guide/dashboard/orient-dashboard.html 2>/dev/null || ls ~/.claude/plugins/cache/*/eval-guide/*/skills/eval-guide/dashboard/orient-dashboard.html 2>/dev/null | head -1)"
case "$(uname -s 2>/dev/null)" in
Darwin) open "$ORIENT_HTML" ;;
Linux) xdg-open "$ORIENT_HTML" ;;
*) cmd.exe /C start "" "$ORIENT_HTML" ;; # Windows / Git Bash
esac
`
```

The `ls ... | head -1` fallback resolves the file regardless of install location — user-global skills first (`~/.claude/skills/eval-guide/`), plugin-cache second.

For dev installs (skill checked out at an arbitrary path, not in `~/.claude/`), the AI should know the absolute path of the SKILL.md it's reading and substitute `<SKILL.md-dir>/dashboard/orient-dashboard.html`.

This is a read-only stage. There is no feedback file, no confirmation gate, and no `serve.py` involvement. The customer reviews the snapshot in the browser while the conversation continues in chat.

*
Proceed to Stage 0 (Discover) without waiting. The dashboard is informational.

When to rebuild the static HTML: if `templates/orient.html`, `templates/base.html`, or `examples/stage-orient-data.json` change, run `python dashboard/build-orient.py` once and check in the regenerated `orient-dashboard.html`. The build script reuses `serve.py`'s `generate_html`, so the rendering stays consistent with the live dashboards.

Skip orient when: the customer has already done a session with the toolkit and is returning for a Stage 1 / Stage 2 / Stage 4 jump-in. Don't re-orient someone who already has the map.

## How to Route

## Eval Maturity Journey

Use the Per-Agent Eval Maturity Model to orient customers on where they are today and where this session takes them. Five pillars of eval practice, five levels each — from `L100 Initial` (no practice in place) to `L500 Optimized` (continuous improvement built into operations). Assume the agent starts at L100 Initial on all pillars. This session targets L300 Systematic on Pillars 1, 2, and 4 (in-session deliverables) and L200 Defined on Pillars 3 and 5 (via reference protocols delivered alongside the session).

The full 5×5 definitions live in `maturity-model.md` — that file is the canonical reference. Update it first when level definitions change.

PillarWhat it measuresAfter this sessionMechanism1 — Define what "good" meansAcceptance criteria qualityL300 Systematic ✓Stage 0 (Discover) + Stage 1 (Plan)2 — Build your eval setsCoverage and versioningL300 Systematic ✓Stage 2 (Generate)3 — Run evals across the lifecycleWhere and when evals execute (offline, pre-deploy, production)L200 Defined ✓`rerun-protocol-<agent>-<date>.docx` (starter artifact)4 — Improve and iterateHow improvements are validatedL300 Systematic ✓Stage 4 (Interpret) — only if eval results are available5 — Handle changes with confidenceHow changes (prompts, tools, models, architecture) get tested before shippingL200 Defined ✓`baseline-comparison-<agent>-<date>.xlsx` (starter artifact)
Pillars 3 and 5 stop at L200 Defined this session. L300 Systematic on those pillars requires operating practice — a release cadence with codified triggers (Pillar 3) and version-tagged baselines accumulated over multiple changes (Pillar 5). The starter artifacts get the customer to L200 in one session: a documented protocol and a fill-in workbook they can execute when triggered. Generate `rerun-protocol-<agent>-<date>.docx` and `baseline-comparison-<agent>-<date>.xlsx` at the end of Stage 2 (see deliverables C and D in Stage 2's "After confirmation" block).

Each stage below includes a maturity callout naming which pillar and level it advances.

## How This Maps to Microsoft's Official Evaluation Framework

Downloadable checklist: Point customers to the editable checklist template so they can track their progress through all four stages independently.

## Stage 0: Discover

Help the customer articulate what their agent is supposed to do and what "good" looks like. This is the most important stage — it shapes everything downstream.

### What you walk away with

* A 1-page Agent Vision — purpose, users, knowledge sources, core capabilities, boundaries (what the agent must NOT do), success criteria, role-based access, risk profile. Written down, not assumed.

* Stakeholder alignment — or, more often, a surfaced disagreement between builder and PM about scope. 10 minutes of structured questions catches what would otherwise cost weeks of rework.

* The spec every later stage depends on. Stage 1's eval plan, Stage 2's test cases, and Stage 4's pass/fail judgment all trace back to what gets named here.

### When this stage is wrong for you

* You already have a written PRD, agent spec, or design doc that covers all 7 questions below. Bring it and skip to Stage 1.

* You have eval results in hand and need triage now — go straight to Stage 4.

* Your agent is a 50-topic monster. One Stage 0 pass won't fit; run Stage 0 per top-level capability.

### What to do — extract Vision, apply safe defaults, proceed to Stage 1

Step 1 — Pre-extract from the kickoff
From the customer's 1–4 sentence description, extract:

* Purpose — usually the first clause ("Personalized HR support…")

* Users — usually implied ("employees," "customers," "internal teams")

* Capabilities — usually a list ("benefits, training, policies")

* Knowledge sources — sometimes named, often categorized ("official company resources" → SharePoint TBD)

* Tone hints — sometimes explicit ("trusted HR colleague," "efficient")

* Personalization hints — words like "personalized," "your," "based on your role"

Step 2 — Apply safe defaults by domain
Domain detection runs on keywords in the kickoff description. Pick the matching default set:

* Most user questions answered directly (deflection / self-service rate)

* Out-of-scope questions routed clearly to the right human or resource (graceful handoff)

* Zero privacy / boundary breaches

Default knowledge sources when only categorized:

* "some SharePoint sites" / "internal docs" → flag as `Multiple SharePoint sites (TBD — name in Plan dashboard)` so the customer can fill names without us blocking on it.

Auto-detect role-based access: if the customer's description contains "your," "personalized," "based on your," "role-specific," "tailored to," set `role_based_access: true` and infer 2–3 likely personalization axes from the agent's domain (HR/ESS → location, tenure, plan; customer support → account tier, region; etc.). Customer corrects if wrong.

This is silent removal with a flagged note, not a question. Customer can flag if they disagree.

Step 4 — Show the Vision summary in chat (5–6 lines, no questions)
Display the pre-extracted Vision compactly:

```
`Agent Vision: [Name]

Purpose: [one sentence from kickoff]
Users: [extracted or default]
Knowledge: [named sources, or "TBD — confirm in Plan dashboard"]
Capabilities: [3–5 from kickoff, aspirational dropped]
Boundaries: [domain default set, listed]
Success: [default 3 criteria]
Role-based: [auto-detected: yes/no, with axes]
Risk profile: [domain default: HIGH/MEDIUM/LOW]
`
```

Don't gate on customer confirmation. Write `stage-0-data.json` and proceed to Stage 1 immediately. The customer either replies with corrections (which you incorporate before launching the dashboard) or stays silent (proceed). The Plan dashboard is the real review surface.

Why this works

* Pre-extraction + defaults covers ~80% of what the chat questions extracted, with zero customer chat input beyond the kickoff.

* Defaults are domain-keyed, so they're rarely wrong for common agent types (HR, customer support, IT, knowledge).

* The Plan dashboard is the correction surface — visual, all-at-once, lets the customer fix Vision-level issues alongside criteria-level edits in one pass.

* Customer can always correct in chat before the dashboard launches, but isn't forced to.

When this approach is wrong (revert to gap-question batch)

* The kickoff description is genuinely too thin — one sentence with no domain keywords. Ask one clarifying question to get enough material for safe defaults.

* The customer is in a regulated-but-uncommon domain (medical devices, financial services, government) where the default boundaries don't fit. After step 2, ask: "Domain looks like [X] — your boundaries are usually [Y]. Anything specific I should add for your context?"

* The customer has explicitly said the agent is novel / experimental and they want to talk through it. Default to conversation mode for these — but they're a small minority.

## Stage 1: Plan

Using the Agent Vision, produce a structured eval suite plan. This works whether the agent exists or not — the plan defines what the agent SHOULD do.

### What you walk away with

* 10–15 acceptance criteria phrased as "The agent should…" (or "should NOT…" for negative tests). Testable, prioritizable, reviewable.

* Each criterion placed on a Value × Risk matrix — High Value · High Risk (highest investment), High Value · Low Risk (expected behavior, occasional misses tolerable), Low Value · High Risk (low traffic, zero tolerance for failure), Low Value · Low Risk (light coverage). The matrix is what keeps the plan tractable.

* Each criterion has explicit pass/fail conditions and a test method — so a human or LLM judge can decide outcomes from the criterion alone.

* A `.docx` eval plan for stakeholder review (PM, security, business owner). The artifact for sign-off.

### When this stage is wrong for you

* You already have written acceptance criteria covering all four Value × Risk quadrants. Bring them and skip to Stage 2.

* You're testing a single new feature on an existing agent. Run a mini Stage 1 on just that feature; don't redo the whole plan.

* Your agent has 50+ topics. Run Stage 1 per top-level capability; one pass won't fit.

### What to do

*
Determine eval depth from agent architecture:

Different agent architectures require different eval layers. Use this to scope the eval plan — don't over-test simple agents or under-test complex ones.

*
Identify the families of acceptance criteria to write:

Acceptance criteria come in two families — functional (what the agent should do for users) and capability (how well it should do it). Use the table to pick the families that apply.

* Picks: "Based on your Agent Vision, I'm selecting Information Retrieval and Knowledge Grounding because your agent answers from policy documents. I'm also including Red-Teaming — every agent needs adversarial testing."

* Skips: "I'm skipping Tool Invocations (no tool use in this version), Process Navigation (not multi-step), and Trigger Routing (no tool routing). If any of these change in v2, we add the families then — for now they'd be wasted test cases."

The `Skipping: ... because ...` narration is mandatory when any family is excluded. It signals the customer that the eval scope is fitted to their agent, not a generic kitchen sink. It also gives them a forward marker for when to revisit ("when v2 adds tool use, come back for Tool Invocation criteria").

* Write acceptance criteria:

Each criterion is a single testable statement starting with "The agent should…" (or "The agent should NOT…" for negative tests). Write 10–15 criteria across the families identified above.

Criteria plan table:

#Acceptance CriterionQuality DimensionMethod
Quality dimension naming — keep it broad. Aim for 4–6 dimensions, not 8–12.

Default dimension set for most agents:

Good criteria are:

* Behavior-led — start with "The agent should…" and describe an observable behavior

* Verifiable — pass or fail can be judged by comparing a response to the criterion

* Scoped — one behavior per criterion; don't stack multiple checks into one row

* Grounded — tied to the Agent Vision (a capability, boundary, knowledge source, or user cohort)

Examples:

* "The agent should return the correct PTO days for the employee's office and tenure, with a citation to the source policy."

★ Featured

Optional RigorPilot helper for README-first deep learning repo reproduction. Use only when the README and repository files leave a narrow reproduction-critical gap and the task is to resolve a specific paper detail such as dataset split, preprocessing, evaluation protocol, checkpoint mapping, or runtime assumption from primary paper sources while recording conflicts. Do not use for general paper summary, repo scanning, environment setup, command execution, title-only paper lookup, or replacin...

lllllllama 127k

Development