Inference Server
Start and test the prime-rl inference server. Use when asked to run inference, start vLLM, test a model, or launch the inference server.
Install
Quick install
npx skills add https://github.com/huggingface/prime-rl/tree/HEAD/skills/inference-servernpx skills add huggingface/prime-rl --skill inference-server --agent claude-codenpx skills add huggingface/prime-rl --skill inference-server --agent cursornpx skills add huggingface/prime-rl --skill inference-server --agent codexnpx skills add huggingface/prime-rl --skill inference-server --agent opencodenpx skills add huggingface/prime-rl --skill inference-server --agent github-copilotnpx skills add huggingface/prime-rl --skill inference-server --agent windsurfMore install options
Shorthand — useful for multi-skill repos:
npx skills add huggingface/prime-rl --skill inference-serverManual — clone the repo and drop the folder into your agent's skills directory:
git clone https://github.com/huggingface/prime-rl.gitcp -r prime-rl/skills/inference-server ~/.claude/skills/inference-server
Start and test the prime-rl inference server. Use when asked to run inference, start vLLM, test a model, or launch the inference server.
inference-serverby huggingface
Start and test the prime-rl inference server. Use when asked to run inference, start vLLM, test a model, or launch the inference server.npx skills add https://github.com/huggingface/prime-rl --skill inference-serverDownload ZIPGitHub
Inference Server
Starting the server
Always use the inference entry point — never vllm serve or python -m vllm.entrypoints.openai.api_server directly. The entry point runs setup_vllm_env() which configures environment variables (LoRA, multiprocessing) before vLLM is imported.
`# With a TOML config
uv run inference @ path/to/config.toml
# With CLI overrides
uv run inference --model.name Qwen/Qwen3-0.6B --model.max_model_len 2048 --model.enforce_eager
# Combined
uv run inference @ path/to/config.toml --server.port 8001 --gpu-memory-utilization 0.5
`
SLURM scheduling
The inference entrypoint supports optional SLURM scheduling, following the same patterns as SFT and RL.
Single-node SLURM
`# inference_slurm.toml
output_dir = "/shared/outputs/my-inference"
[model]
name = "Qwen/Qwen3-8B"
[parallel]
tp = 8
[slurm]
job_name = "my-inference"
partition = "cluster"
`
`uv run inference @ inference_slurm.toml
`
Multi-node SLURM (independent vLLM replicas)
Each node runs an independent vLLM instance. No cross-node parallelism — TP and DP must fit within a single node's GPUs.
`# inference_multinode.toml
output_dir = "/shared/outputs/my-inference"
[model]
name = "PrimeIntellect/INTELLECT-3-RL-600"
[parallel]
tp = 8
dp = 1
[deployment]
type = "multi_node"
num_nodes = 4
gpus_per_node = 8
[slurm]
job_name = "my-inference"
partition = "cluster"
`
Dry run
Add dry_run = true to generate the sbatch script without submitting:
`uv run inference @ config.toml --dry-run true
`
Custom endpoints
The server extends vLLM with:
/v1/chat/completions/tokens— accepts token IDs as prompt input (used by multi-turn RL rollouts)
/update_weights— hot-reload model weights from the trainer
/load_lora_adapter— load LoRA adapters at runtime
/init_broadcaster— initialize weight broadcast for distributed training
Testing the server
`curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hi"}],
"max_tokens": 50
}'
`
Key files
src/prime_rl/entrypoints/inference.py— entrypoint with local/SLURM routing
src/prime_rl/inference/server.py— vLLM env setup
src/prime_rl/configs/inference.py—InferenceConfigand all sub-configs
src/prime_rl/inference/vllm/server.py— FastAPI routes and vLLM monkey-patches
src/prime_rl/templates/inference.sbatch.j2— SLURM template (handles both single and multi-node)
configs/debug/infer.toml— minimal debug config
More skills from huggingface
Hugging Face Cliby huggingfaceExecute Hugging Face Hub operations using thehf CLI. Use when the user needs to download models/datasets/spaces, upload files to Hub repositories, create repos, manage local cache, or run compute jobs on HF infrastructure. Covers authentication, file transfers, repository creation, cache operations, and cloud compute.Hugging Face Datasetsby huggingfaceCreate and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.Hugging Face Evaluationby huggingfaceAdd and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.Hugging Face Jobsby huggingfaceRun any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks.Hugging Face Model Trainerby huggingfaceTrain or fine-tune language models using TRL (Transformer Reinforcement Learning) on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on dataset preparation, hardware selection, cost estimation, and model persistence.Hugging Face Paper Publisherby huggingfacePublish and manage research papers on Hugging Face Hub. Supports creating paper pages, linking papers to models/datasets, claiming authorship, and generating professional markdown-based research articles.Hugging Face Tool Builderby huggingfaceBuild reusable scripts and tools using the Hugging Face API. Useful when chaining or combining API calls, or when tasks will be repeated/automated. Creates reusable command line scripts to fetch, enrich, or process data from Hugging Face Hub.Hugging Face Trackioby huggingfaceTrack and visualize ML training experiments with Trackio. Use when logging metrics during training (Python API) or retrieving/analyzing logged metrics (CLI). Supports real-time dashboard visualization, HF Space syncing, and JSON output for automation.
---
Source: https://github.com/huggingface/prime-rl/tree/HEAD/skills/inference-server
Author: huggingface
Discovered via: mcpservers.org
SKILL.md source
---
name: inference-server
description: Start and test the prime-rl inference server. Use when asked to run inference, start vLLM, test a model, or launch the inference server.
---
# inference-server
Start and test the prime-rl inference server. Use when asked to run inference, start vLLM, test a model, or launch the inference server.
# inference-serverby huggingface
Start and test the prime-rl inference server. Use when asked to run inference, start vLLM, test a model, or launch the inference server.
`npx skills add https://github.com/huggingface/prime-rl --skill inference-server`Download ZIPGitHub
## Inference Server
## Starting the server
Always use the `inference` entry point — never `vllm serve` or `python -m vllm.entrypoints.openai.api_server` directly. The entry point runs `setup_vllm_env()` which configures environment variables (LoRA, multiprocessing) before vLLM is imported.
```
`# With a TOML config
uv run inference @ path/to/config.toml
# With CLI overrides
uv run inference --model.name Qwen/Qwen3-0.6B --model.max_model_len 2048 --model.enforce_eager
# Combined
uv run inference @ path/to/config.toml --server.port 8001 --gpu-memory-utilization 0.5
`
```
## SLURM scheduling
The inference entrypoint supports optional SLURM scheduling, following the same patterns as SFT and RL.
### Single-node SLURM
```
`# inference_slurm.toml
output_dir = "/shared/outputs/my-inference"
[model]
name = "Qwen/Qwen3-8B"
[parallel]
tp = 8
[slurm]
job_name = "my-inference"
partition = "cluster"
`
```
```
`uv run inference @ inference_slurm.toml
`
```
### Multi-node SLURM (independent vLLM replicas)
Each node runs an independent vLLM instance. No cross-node parallelism — TP and DP must fit within a single node's GPUs.
```
`# inference_multinode.toml
output_dir = "/shared/outputs/my-inference"
[model]
name = "PrimeIntellect/INTELLECT-3-RL-600"
[parallel]
tp = 8
dp = 1
[deployment]
type = "multi_node"
num_nodes = 4
gpus_per_node = 8
[slurm]
job_name = "my-inference"
partition = "cluster"
`
```
### Dry run
Add `dry_run = true` to generate the sbatch script without submitting:
```
`uv run inference @ config.toml --dry-run true
`
```
## Custom endpoints
The server extends vLLM with:
* `/v1/chat/completions/tokens` — accepts token IDs as prompt input (used by multi-turn RL rollouts)
* `/update_weights` — hot-reload model weights from the trainer
* `/load_lora_adapter` — load LoRA adapters at runtime
* `/init_broadcaster` — initialize weight broadcast for distributed training
## Testing the server
```
`curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hi"}],
"max_tokens": 50
}'
`
```
## Key files
* `src/prime_rl/entrypoints/inference.py` — entrypoint with local/SLURM routing
* `src/prime_rl/inference/server.py` — vLLM env setup
* `src/prime_rl/configs/inference.py` — `InferenceConfig` and all sub-configs
* `src/prime_rl/inference/vllm/server.py` — FastAPI routes and vLLM monkey-patches
* `src/prime_rl/templates/inference.sbatch.j2` — SLURM template (handles both single and multi-node)
* `configs/debug/infer.toml` — minimal debug config
## More skills from huggingface
Hugging Face Cliby huggingfaceExecute Hugging Face Hub operations using the `hf` CLI. Use when the user needs to download models/datasets/spaces, upload files to Hub repositories, create repos, manage local cache, or run compute jobs on HF infrastructure. Covers authentication, file transfers, repository creation, cache operations, and cloud compute.Hugging Face Datasetsby huggingfaceCreate and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.Hugging Face Evaluationby huggingfaceAdd and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.Hugging Face Jobsby huggingfaceRun any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks.Hugging Face Model Trainerby huggingfaceTrain or fine-tune language models using TRL (Transformer Reinforcement Learning) on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on dataset preparation, hardware selection, cost estimation, and model persistence.Hugging Face Paper Publisherby huggingfacePublish and manage research papers on Hugging Face Hub. Supports creating paper pages, linking papers to models/datasets, claiming authorship, and generating professional markdown-based research articles.Hugging Face Tool Builderby huggingfaceBuild reusable scripts and tools using the Hugging Face API. Useful when chaining or combining API calls, or when tasks will be repeated/automated. Creates reusable command line scripts to fetch, enrich, or process data from Hugging Face Hub.Hugging Face Trackioby huggingfaceTrack and visualize ML training experiments with Trackio. Use when logging metrics during training (Python API) or retrieving/analyzing logged metrics (CLI). Supports real-time dashboard visualization, HF Space syncing, and JSON output for automation.
---
**Source**: https://github.com/huggingface/prime-rl/tree/HEAD/skills/inference-server
**Author**: huggingface
**Discovered via**: mcpservers.org
Related skills 6
running-claude-code-via-litellm-copilot
Use when routing Claude Code through a local LiteLLM proxy to GitHub Copilot, reducing direct Anthropic spend, configuring ANTHROPIC_BASE_URL or ANTHROPIC_MODEL overrides, or troubleshooting Copilot proxy setup failures such as model-not-found, no localhost traffic, or GitHub 401/403 auth errors.
skills-cli
Use when users ask to discover, install, list, check, update, remove, back up, restore, sync, or initialize Agent Skills, mention `bunx skills`, `npx skills`, `skills.sh`, or `skills-lock.json`, ask "find a skill for X", or want help extending agent capabilities with installable skills.
repo-intake-and-plan
Narrow RigorPilot helper for README-first deep learning repo reproduction. Use when the task is specifically to scan a repository, read the README and common project files, extract documented commands, classify inference, evaluation, and training candidates, and return the smallest trustworthy reproduction plan to the main orchestrator. Do not use for environment setup, asset download, command execution, final reporting, paper lookup, or end-to-end orchestration.
image-to-video
Animate any still image on RunComfy — this skill is a smart router that matches the user's intent to the right i2v model in the RunComfy catalog. Picks HappyHorse 1.0 I2V (Arena #1, native audio, identity preservation) for general animations, Wan 2.7 with `audio_url` for custom-voiceover lip-sync, or Seedance 2.0 Pro for multi-modal animation from image + reference video + reference audio. Bundles each model's documented prompting patterns so the caller gets sharper output without burning ite...
video-edit
Edit existing video on RunComfy — this skill is a smart router that matches the user's intent to the right edit model in the RunComfy catalog. Picks Wan 2.7 Edit-Video (general restyle / background swap / packaging swap, identity + motion preservation), Kling 2.6 Pro Motion Control (transfer precise motion from a reference video to a target character), or Lucy Edit Restyle (lightweight identity-stable restyle / outfit swap). Bundles each model's documented prompting patterns so the skill gets...
nano-banana-2
Generate images with Google Nano Banana 2 (Gemini-family flash-tier text-to-image) on RunComfy — bundled with the model's documented prompting patterns so the skill gets sharper output than naive prompting against the same model. Documents Nano Banana 2's strengths (rapid iteration, in-image typography rendering, predictable framing, optional web-grounded context), the resolution-tier pricing, the safety-tolerance dial, and when to route to Nano Banana Pro / GPT Image 2 / Flux 2 / Seedream in...