NEW Browse AI tools across categories — updated daily. See what's new →

Langsmith Code Eval

Creates code-based evaluators for LangSmith-traced agents. Use when building custom evaluation logic, testing tool usage patterns, or scoring agent outputs…

Version1.0.0
LicenseMIT
Token count~1,415
UpdatedJun 5, 2026

Install

Quick install

via npx skills · works with 57+ agents
npx skills add https://github.com/langchain-ai/lca-skills/tree/HEAD/skills/langsmith-code-eval
Or pick agent:
npx skills add langchain-ai/lca-skills --skill langsmith-code-eval --agent claude-code
npx skills add langchain-ai/lca-skills --skill langsmith-code-eval --agent cursor
npx skills add langchain-ai/lca-skills --skill langsmith-code-eval --agent codex
npx skills add langchain-ai/lca-skills --skill langsmith-code-eval --agent opencode
npx skills add langchain-ai/lca-skills --skill langsmith-code-eval --agent github-copilot
npx skills add langchain-ai/lca-skills --skill langsmith-code-eval --agent windsurf
More install options

Shorthand — useful for multi-skill repos:

npx skills add langchain-ai/lca-skills --skill langsmith-code-eval

Manual — clone the repo and drop the folder into your agent's skills directory:

git clone https://github.com/langchain-ai/lca-skills.git
cp -r lca-skills/skills/langsmith-code-eval ~/.claude/skills/
How to use: Once installed, ask your agent to "use the langsmith-code-eval skill" or describe what you want (e.g. "Creates code-based evaluators for LangSmith-traced agents. Use when building cus"). Requires Node.js 18+.

langsmith-code-eval

Creates code-based evaluators for LangSmith-traced agents. Use when building custom evaluation logic, testing tool usage patterns, or scoring agent outputs…

langsmith-code-evalby langchain-ai

Creates code-based evaluators for LangSmith-traced agents. Use when building custom evaluation logic, testing tool usage patterns, or scoring agent outputs…

npx skills add https://github.com/langchain-ai/lca-skills --skill langsmith-code-evalDownload ZIPGitHub

LangSmith Code Evaluator Creation

Creates evaluators for LangSmith experiments through structured inspection and implementation.

Prerequisites

  • langsmith Python package installed
  • LANGSMITH_API_KEY environment variable set (check project's .env file)

Workflow

Copy this checklist and track progress:

`Evaluator Creation Progress:
- [ ] Step 1: Gather info from user
- [ ] Step 2: Inspect trace and dataset structure
- [ ] Step 3: Read agent code
- [ ] Step 4: Write evaluator
- [ ] Step 5: Write experiment runner
- [ ] Step 6: Run and iterate
`

Step 1: Gather Info from User

IMPORTANT: Do NOT search or explore the codebase. Ask the user all of these questions upfront using AskUserQuestion before doing anything else.

Ask the user the following in a single AskUserQuestion call:

  • Python command: How do you run Python in this project? (e.g., python, python3, uv run python, poetry run python)
  • Agent file path: What is the path to your agent file?
  • LangSmith project name: What is your LangSmith project name (where traces are logged)?
  • LangSmith dataset name: What is the name of the dataset to evaluate against?
  • Evaluation goal: What behavior should pass vs fail? Common types:
  • Tool usage: Did the agent call the correct tool?
  • Output correctness: Does output match expected format/content?
  • Policy compliance: Did it follow specific rules?
  • Classification: Did it categorize correctly?

Step 2: Inspect Trace and Dataset Structure

Using the info from Step 1, run the inspection scripts located in this skill's directory:

`{python_cmd} {skill_dir}/scripts/inspect_trace.py PROJECT_NAME [RUN_ID]
{python_cmd} {skill_dir}/scripts/inspect_dataset.py DATASET_NAME
`

Replace {python_cmd} with the command from Step 1, and {skill_dir} with this skill's directory path.

Verify the trace matches the agent:

  • Does the trace type match? (e.g., OpenAI trace for OpenAI agent)
  • Does it contain the data needed for evaluation?
  • If mismatched, clarify before proceeding.

From the dataset inspection, note:

  • Input schema (what gets passed to the agent)
  • Output schema (reference/expected outputs)
  • Metadata fields (e.g., expected_tool, difficulty, labels)

The dataset metadata often contains ground truth for evaluation (e.g., which tool should be called, expected classification).

Step 3: Read Agent Code

Read the agent file provided in Step 1 to identify:

  • Entry point function (look for @traceable decorator)
  • Available tools
  • Output format (what the function returns)

Step 4: Write the Evaluator

Create evaluator functions based on trace and dataset structure. See EVALUATOR_REFERENCE.md for function signatures and return formats.

Step 5: Write Experiment Runner

Create a script that:

  • Imports the agent's entry function
  • Wraps it as a target function
  • Runs evaluate() or aevaluate() against the dataset

See EVALUATOR_REFERENCE.md for evaluate() usage.

Step 6: Run and Iterate

Execute the experiment, review results in LangSmith, refine evaluators as needed.

More skills from langchain-ai

arxiv-searchby langchain-aiSearch arXiv for preprints and academic papers by topic with abstract retrieval. Query-based search across physics, mathematics, computer science, biology, statistics, and related fields Configurable result limit (default 10 papers) with results sorted by relevance Returns title and abstract for each matching paper Requires the arxiv Python package; install via pip if not already presentblog-postby langchain-aiLong-form blog post writing with research delegation, structured content templates, and AI-generated cover images. Delegates research to subagents before writing, storing findings in markdown for reference and context Enforces a five-part post structure: hook, context, main content (3–5 sections), practical application, and conclusion with call-to-action Generates SEO-optimized cover images using detailed prompts covering subject, style, composition, color, and lighting Outputs posts to...code-reviewby langchain-aiPerform a structured code review of changes, checking for correctness, style, tests, and potential issues.coding-prefsby langchain-aiRead the user's coding preferences from /memory/coding-prefs.md before making non-trivial style decisions, and append new preferences when the user gives…competitor-analysisby langchain-aiWhen asked to analyze competitors:cudf-analyticsby langchain-aiUse for GPU-accelerated data analysis on datasets, CSVs, or tabular data using NVIDIA cuDF. Triggers when tasks involve groupby aggregations, statistical…cuml-machine-learningby langchain-aiUse for GPU-accelerated machine learning on tabular data using NVIDIA cuML. Triggers when tasks involve classification, regression, clustering, dimensionality…data-visualizationby langchain-aiUse for creating publication-quality charts and multi-panel analysis summaries. Triggers when tasks involve visualizing data, plotting results, creating…

---

Source: https://github.com/langchain-ai/lca-skills/tree/HEAD/skills/langsmith-code-eval
Author: langchain-ai
Discovered via: mcpservers.org

SKILL.md source

---
name: langsmith-code-eval
description: Creates code-based evaluators for LangSmith-traced agents. Use when building custom evaluation logic, testing tool usage patterns, or scoring agent outputs…
---

# langsmith-code-eval

Creates code-based evaluators for LangSmith-traced agents. Use when building custom evaluation logic, testing tool usage patterns, or scoring agent outputs…

# langsmith-code-evalby langchain-ai
Creates code-based evaluators for LangSmith-traced agents. Use when building custom evaluation logic, testing tool usage patterns, or scoring agent outputs…

`npx skills add https://github.com/langchain-ai/lca-skills --skill langsmith-code-eval`Download ZIPGitHub

## LangSmith Code Evaluator Creation

Creates evaluators for LangSmith experiments through structured inspection and implementation.

## Prerequisites

* `langsmith` Python package installed

* `LANGSMITH_API_KEY` environment variable set (check project's `.env` file)

## Workflow

Copy this checklist and track progress:

```
`Evaluator Creation Progress:
- [ ] Step 1: Gather info from user
- [ ] Step 2: Inspect trace and dataset structure
- [ ] Step 3: Read agent code
- [ ] Step 4: Write evaluator
- [ ] Step 5: Write experiment runner
- [ ] Step 6: Run and iterate
`
```

### Step 1: Gather Info from User

IMPORTANT: Do NOT search or explore the codebase. Ask the user all of these questions upfront using AskUserQuestion before doing anything else.

Ask the user the following in a single AskUserQuestion call:

* Python command: How do you run Python in this project? (e.g., `python`, `python3`, `uv run python`, `poetry run python`)

* Agent file path: What is the path to your agent file?

* LangSmith project name: What is your LangSmith project name (where traces are logged)?

* LangSmith dataset name: What is the name of the dataset to evaluate against?

* Evaluation goal: What behavior should pass vs fail? Common types:

* Tool usage: Did the agent call the correct tool?

* Output correctness: Does output match expected format/content?

* Policy compliance: Did it follow specific rules?

* Classification: Did it categorize correctly?

### Step 2: Inspect Trace and Dataset Structure

Using the info from Step 1, run the inspection scripts located in this skill's directory:

```
`{python_cmd} {skill_dir}/scripts/inspect_trace.py PROJECT_NAME [RUN_ID]
{python_cmd} {skill_dir}/scripts/inspect_dataset.py DATASET_NAME
`
```

Replace `{python_cmd}` with the command from Step 1, and `{skill_dir}` with this skill's directory path.

Verify the trace matches the agent:

* Does the trace type match? (e.g., OpenAI trace for OpenAI agent)

* Does it contain the data needed for evaluation?

* If mismatched, clarify before proceeding.

From the dataset inspection, note:

* Input schema (what gets passed to the agent)

* Output schema (reference/expected outputs)

* Metadata fields (e.g., `expected_tool`, `difficulty`, labels)

The dataset metadata often contains ground truth for evaluation (e.g., which tool should be called, expected classification).

### Step 3: Read Agent Code

Read the agent file provided in Step 1 to identify:

* Entry point function (look for `@traceable` decorator)

* Available tools

* Output format (what the function returns)

### Step 4: Write the Evaluator

Create evaluator functions based on trace and dataset structure. See EVALUATOR_REFERENCE.md for function signatures and return formats.

### Step 5: Write Experiment Runner

Create a script that:

* Imports the agent's entry function

* Wraps it as a target function

* Runs `evaluate()` or `aevaluate()` against the dataset

See EVALUATOR_REFERENCE.md for `evaluate()` usage.

### Step 6: Run and Iterate

Execute the experiment, review results in LangSmith, refine evaluators as needed.

## More skills from langchain-ai
arxiv-searchby langchain-aiSearch arXiv for preprints and academic papers by topic with abstract retrieval. Query-based search across physics, mathematics, computer science, biology, statistics, and related fields Configurable result limit (default 10 papers) with results sorted by relevance Returns title and abstract for each matching paper Requires the arxiv Python package; install via pip if not already presentblog-postby langchain-aiLong-form blog post writing with research delegation, structured content templates, and AI-generated cover images. Delegates research to subagents before writing, storing findings in markdown for reference and context Enforces a five-part post structure: hook, context, main content (3–5 sections), practical application, and conclusion with call-to-action Generates SEO-optimized cover images using detailed prompts covering subject, style, composition, color, and lighting Outputs posts to...code-reviewby langchain-aiPerform a structured code review of changes, checking for correctness, style, tests, and potential issues.coding-prefsby langchain-aiRead the user's coding preferences from /memory/coding-prefs.md before making non-trivial style decisions, and append new preferences when the user gives…competitor-analysisby langchain-aiWhen asked to analyze competitors:cudf-analyticsby langchain-aiUse for GPU-accelerated data analysis on datasets, CSVs, or tabular data using NVIDIA cuDF. Triggers when tasks involve groupby aggregations, statistical…cuml-machine-learningby langchain-aiUse for GPU-accelerated machine learning on tabular data using NVIDIA cuML. Triggers when tasks involve classification, regression, clustering, dimensionality…data-visualizationby langchain-aiUse for creating publication-quality charts and multi-panel analysis summaries. Triggers when tasks involve visualizing data, plotting results, creating…

---

**Source**: https://github.com/langchain-ai/lca-skills/tree/HEAD/skills/langsmith-code-eval
**Author**: langchain-ai
**Discovered via**: mcpservers.org

Related skills 6

caveman

★ Featured

Ultra-compressed communication mode. Cuts token usage ~75% by speaking like caveman while keeping full technical accuracy. Supports intensity levels: lite, full (default), ultra, wenyan-lite, wenyan-full, wenyan-ultra. Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman. Also auto-triggers when token efficiency is requested.

juliusbrussee 167k
Development

secure-linux-web-hosting

★ Featured

Use when setting up, hardening, or reviewing a cloud server for self-hosting, including DNS, SSH, firewalls, Nginx, static-site hosting, reverse-proxying an app, HTTPS with Let's Encrypt or ACME clients, safe HTTP-to-HTTPS redirects, or optional post-launch network tuning such as BBR.

xixu-me 155k
Development

readme-i18n

★ Featured

Use when the user wants to translate a repository README, make a repo multilingual, localize docs, add a language switcher, internationalize the README, or update localized README variants in a GitHub-style repository.

xixu-me 155k
Development

lark-shared

★ Featured

Use when first setting up lark-cli, running auth login, switching user/bot identity (--as), handling permission denied or scope errors, needing to update lark-cli, or seeing _notice in JSON output.

larksuite 155k
Development

improve-codebase-architecture

★ Featured

Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.

mattpocock 151k
Development

paper-context-resolver

★ Featured

Optional RigorPilot helper for README-first deep learning repo reproduction. Use only when the README and repository files leave a narrow reproduction-critical gap and the task is to resolve a specific paper detail such as dataset split, preprocessing, evaluation protocol, checkpoint mapping, or runtime assumption from primary paper sources while recording conflicts. Do not use for general paper summary, repo scanning, environment setup, command execution, title-only paper lookup, or replacin...

lllllllama 127k
Development