Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation.
Install
Quick install
npx skills add https://github.com/mgechev/skillgradenpx skills add mgechev/skillgrade --agent claude-codenpx skills add mgechev/skillgrade --agent cursornpx skills add mgechev/skillgrade --agent codexnpx skills add mgechev/skillgrade --agent opencodenpx skills add mgechev/skillgrade --agent github-copilotnpx skills add mgechev/skillgrade --agent windsurfMore install options
Shorthand — useful for multi-skill repos:
npx skills add mgechev/skillgradeManual — clone the repo and drop the folder into your agent's skills directory:
git clone https://github.com/mgechev/skillgrade.gitcp -r skillgrade ~/.claude/skills/Skillgrade Setup
Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation.
---
name: skillgrade-setup
description: Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation.
---
Skillgrade Evaluation Setup
Procedures
Step 1: Install Skillgrade
- Verify Node.js 20+ and Docker are available.
- Run
npm i -g skillgradeto install the CLI globally.
Step 2: Initialize an Eval Configuration
- Navigate to the skill directory (must contain a
SKILL.md). - Set the appropriate API key environment variable (
GEMINI_API_KEY,ANTHROPIC_API_KEY, orOPENAI_API_KEY). - Run
skillgrade initto generate aneval.yamlwith AI-powered tasks and graders. - If an
eval.yamlalready exists, pass--forceto overwrite:skillgrade init --force. - Without an API key, a well-commented template is generated instead.
Step 3: Configure eval.yaml
- Read
references/eval-yaml-spec.mdfor the full configuration schema. - Define one or more tasks under the
tasks:key. Each task requires:
name: unique task identifierinstruction: what the agent should accomplishworkspace: files to copy into the evaluation containergraders: one or more scoring mechanisms (see theskillgrade-gradersskill)
- Optionally configure
defaults:for agent, provider, trials, timeout, and threshold.
Step 4: Run Evaluations
- Select an appropriate preset based on the evaluation goal:
--smoke(5 trials): Quick capability check.--reliable(15 trials): Reliable pass rate estimate.--regression(30 trials): High-confidence regression detection.
- Run the evaluation:
skillgrade --smoke. - Run a specific eval by name:
skillgrade --eval=fix-linting. - Run multiple evals:
skillgrade --eval=fix-linting,write-tests. - Run only deterministic graders (skip LLM calls):
skillgrade --grader=deterministic. - Run only LLM rubric graders:
skillgrade --grader=llm_rubric. - The agent is auto-detected from the API key. Override with
--agent=gemini|claude|codex. - Override the provider with
--provider=docker|local.
Step 5: Review Results
- Run
skillgrade previewfor a CLI report. - Run
skillgrade preview browserto open the web UI athttp://localhost:3847. - Reports are saved to
$TMPDIR/skillgrade/<skill-name>/results/. Override with--output=DIR.
Step 6: Integrate with CI
- Add a GitHub Actions step that installs skillgrade, navigates to the skill directory, and runs with
--regression --ci --provider=local. - Use
--provider=localin CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit. - The
--ciflag causes a non-zero exit code if the pass rate falls below--threshold(default: 0.8). - Read
references/ci-example.mdfor a complete workflow template.
Error Handling
- If
skillgrade initfails with "No SKILL.md found," verify the current directory contains a validSKILL.mdfile. - If evaluation hangs, check Docker is running and the container has network access for API calls.
- If all trials fail with "No API key," ensure the environment variable is exported, not just set inline for a different command.
---
Source: https://github.com/mgechev/skillgrade
Author: mgechev
Discovered via: skillsdirectory.com
Genre: ai-agents
SKILL.md source
--- name: Skillgrade Setup description: Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader... --- # Skillgrade Setup Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation. --- name: skillgrade-setup description: Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation. --- # Skillgrade Evaluation Setup ## Procedures **Step 1: Install Skillgrade** 1. Verify Node.js 20+ and Docker are available. 2. Run `npm i -g skillgrade` to install the CLI globally. **Step 2: Initialize an Eval Configuration** 1. Navigate to the skill directory (must contain a `SKILL.md`). 2. Set the appropriate API key environment variable (`GEMINI_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY`). 3. Run `skillgrade init` to generate an `eval.yaml` with AI-powered tasks and graders. 4. If an `eval.yaml` already exists, pass `--force` to overwrite: `skillgrade init --force`. 5. Without an API key, a well-commented template is generated instead. **Step 3: Configure eval.yaml** 1. Read `references/eval-yaml-spec.md` for the full configuration schema. 2. Define one or more tasks under the `tasks:` key. Each task requires: - `name`: unique task identifier - `instruction`: what the agent should accomplish - `workspace`: files to copy into the evaluation container - `graders`: one or more scoring mechanisms (see the `skillgrade-graders` skill) 3. Optionally configure `defaults:` for agent, provider, trials, timeout, and threshold. **Step 4: Run Evaluations** 1. Select an appropriate preset based on the evaluation goal: - `--smoke` (5 trials): Quick capability check. - `--reliable` (15 trials): Reliable pass rate estimate. - `--regression` (30 trials): High-confidence regression detection. 2. Run the evaluation: `skillgrade --smoke`. 3. Run a specific eval by name: `skillgrade --eval=fix-linting`. 4. Run multiple evals: `skillgrade --eval=fix-linting,write-tests`. 5. Run only deterministic graders (skip LLM calls): `skillgrade --grader=deterministic`. 6. Run only LLM rubric graders: `skillgrade --grader=llm_rubric`. 7. The agent is auto-detected from the API key. Override with `--agent=gemini|claude|codex`. 8. Override the provider with `--provider=docker|local`. **Step 5: Review Results** 1. Run `skillgrade preview` for a CLI report. 2. Run `skillgrade preview browser` to open the web UI at `http://localhost:3847`. 3. Reports are saved to `$TMPDIR/skillgrade/<skill-name>/results/`. Override with `--output=DIR`. **Step 6: Integrate with CI** 1. Add a GitHub Actions step that installs skillgrade, navigates to the skill directory, and runs with `--regression --ci --provider=local`. 2. Use `--provider=local` in CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit. 3. The `--ci` flag causes a non-zero exit code if the pass rate falls below `--threshold` (default: 0.8). 4. Read `references/ci-example.md` for a complete workflow template. ## Error Handling * If `skillgrade init` fails with "No SKILL.md found," verify the current directory contains a valid `SKILL.md` file. * If evaluation hangs, check Docker is running and the container has network access for API calls. * If all trials fail with "No API key," ensure the environment variable is exported, not just set inline for a different command. --- **Source**: https://github.com/mgechev/skillgrade **Author**: mgechev **Discovered via**: skillsdirectory.com **Genre**: ai-agents
Related skills 6
running-claude-code-via-litellm-copilot
Use when routing Claude Code through a local LiteLLM proxy to GitHub Copilot, reducing direct Anthropic spend, configuring ANTHROPIC_BASE_URL or ANTHROPIC_MODEL overrides, or troubleshooting Copilot proxy setup failures such as model-not-found, no localhost traffic, or GitHub 401/403 auth errors.
skills-cli
Use when users ask to discover, install, list, check, update, remove, back up, restore, sync, or initialize Agent Skills, mention `bunx skills`, `npx skills`, `skills.sh`, or `skills-lock.json`, ask "find a skill for X", or want help extending agent capabilities with installable skills.
repo-intake-and-plan
Narrow RigorPilot helper for README-first deep learning repo reproduction. Use when the task is specifically to scan a repository, read the README and common project files, extract documented commands, classify inference, evaluation, and training candidates, and return the smallest trustworthy reproduction plan to the main orchestrator. Do not use for environment setup, asset download, command execution, final reporting, paper lookup, or end-to-end orchestration.
image-to-video
Animate any still image on RunComfy — this skill is a smart router that matches the user's intent to the right i2v model in the RunComfy catalog. Picks HappyHorse 1.0 I2V (Arena #1, native audio, identity preservation) for general animations, Wan 2.7 with `audio_url` for custom-voiceover lip-sync, or Seedance 2.0 Pro for multi-modal animation from image + reference video + reference audio. Bundles each model's documented prompting patterns so the caller gets sharper output without burning ite...
video-edit
Edit existing video on RunComfy — this skill is a smart router that matches the user's intent to the right edit model in the RunComfy catalog. Picks Wan 2.7 Edit-Video (general restyle / background swap / packaging swap, identity + motion preservation), Kling 2.6 Pro Motion Control (transfer precise motion from a reference video to a target character), or Lucy Edit Restyle (lightweight identity-stable restyle / outfit swap). Bundles each model's documented prompting patterns so the skill gets...
nano-banana-2
Generate images with Google Nano Banana 2 (Gemini-family flash-tier text-to-image) on RunComfy — bundled with the model's documented prompting patterns so the skill gets sharper output than naive prompting against the same model. Documents Nano Banana 2's strengths (rapid iteration, in-image typography rendering, predictable framing, optional web-grounded context), the resolution-tier pricing, the safety-tolerance dial, and when to route to Nano Banana Pro / GPT Image 2 / Flux 2 / Seedream in...