NEW Browse AI tools across categories — updated daily. See what's new →

Saga Orchestration

Implement saga patterns for distributed transactions and cross-aggregate workflows. Use this skill when implementing distributed transactions across microservices where 2PC is unavailable, designin...

Authorwshobson
Version1.0.0
LicenseMIT
Token count~1,502
UpdatedMay 27, 2026

Implement saga patterns for distributed transactions and cross-aggregate workflows. Use this skill when implementing distributed transactions across microservices where 2PC is unavailable, designing compensating actions for failed order workflows that span inventory, payment, and shipping services, building event-driven saga coordinators for travel booking systems that must roll back hotel, flight, and car rental reservations atomically, or debugging stuck saga states in production where compensation steps never complete.

Install

Quick install

via npx skills · works with 57+ agents
npx skills add https://github.com/wshobson/agents/tree/main/plugins/backend-development/skills/saga-orchestration
Or pick agent:
npx skills add wshobson/agents --skill saga-orchestration --agent claude-code
npx skills add wshobson/agents --skill saga-orchestration --agent cursor
npx skills add wshobson/agents --skill saga-orchestration --agent codex
npx skills add wshobson/agents --skill saga-orchestration --agent opencode
npx skills add wshobson/agents --skill saga-orchestration --agent github-copilot
npx skills add wshobson/agents --skill saga-orchestration --agent windsurf
More install options

Shorthand — useful for multi-skill repos:

npx skills add wshobson/agents --skill saga-orchestration

Manual — clone the repo and drop the folder into your agent's skills directory:

git clone https://github.com/wshobson/agents.git
cp -r agents/plugins/backend-development/skills/saga-orchestration ~/.claude/skills/
How to use: Once installed, ask your agent to "use the saga-orchestration skill" or describe what you want (e.g. "Implement saga patterns for distributed transactions and cross-aggregate workflo"). Requires Node.js 18+.

Saga Orchestration

Patterns for managing distributed transactions and long-running business processes without two-phase commit.

Inputs and Outputs

What you provide:


  • Service boundaries and ownership (which service owns which step)

  • Transaction requirements (which steps must be atomic, which can be eventual)

  • Failure modes for each step (transient vs. permanent, retry policy)

  • SLA requirements per step (informs timeout configuration)

  • Existing event/messaging infrastructure (Kafka, RabbitMQ, SQS, etc.)

What this skill produces:


  • Saga definition with ordered steps, action commands, and compensation commands

  • Orchestrator or choreography implementation for your chosen pattern

  • Compensation logic for each participant service (idempotent, always-succeeds)

  • Step timeout configuration with per-step deadlines

  • Monitoring setup: state machine metrics, stuck saga detection, DLQ recovery

---

When to Use This Skill

  • Coordinating multi-service transactions without distributed locks
  • Implementing compensating transactions for partial failures
  • Managing long-running business workflows (minutes to hours)
  • Handling failures in distributed systems where atomicity is required
  • Building order fulfillment, approval, or booking processes
  • Replacing fragile two-phase commit with async compensation

---

Detailed section: Core Concepts

Moved to references/details.md.

Detailed section: Templates

Moved to references/details.md.

Best Practices

Do's

  • Make every step idempotent — Commands may be replayed on broker reconnect
  • Design compensations carefully — They are the most critical code path
  • Use correlation IDs — The saga_id must flow through every event and log
  • Implement per-step timeouts — Never wait indefinitely for a participant reply
  • Log state transitionssaga_id, step_name, old_state → new_state on every change
  • Test compensation paths explicitly — Inject failures at each step index in integration tests

Don'ts

  • Don't assume instant completion — Sagas are async and may take minutes
  • Don't skip compensation testing — The rollback path is the hardest to get right
  • Don't couple services directly — Use async messaging, never synchronous calls inside a saga step
  • Don't ignore partial failures — A step that partially executed still needs compensation
  • Don't use a global timeout — Each step has different latency characteristics

---

Troubleshooting

Saga stuck in COMPENSATING state

A saga enters compensation but never reaches FAILED. This means a compensation handler is throwing an unhandled exception and never publishing SagaCompensationCompleted. Add dead-letter queue (DLQ) handling to compensation consumers and ensure every compensation action publishes a result event even when the underlying operation was already rolled back.

async def handle_release_reservation(self, command: Dict):
    try:
        await self.release_reservation(command["original_result"]["reservation_id"])
    except ReservationNotFoundError:
        pass  # Already released — treat as success
    # Always publish completion, regardless of outcome
    await self.event_publisher.publish("SagaCompensationCompleted", {
        "saga_id": command["saga_id"],
        "step_name": "reserve_inventory"
    })

Duplicate saga executions on restart

If your orchestrator service restarts mid-saga, it may replay events and re-execute already-completed steps. Guard every step action with an idempotency key — see Template 3 above.

Choreography saga losing events

In a choreography-based saga, a downstream service may miss an event if it was offline when published. Use a durable message broker (Kafka with replication, RabbitMQ with persistence) and store the current saga state in a dedicated saga_log table so you can replay from the last known good step.

Timeout firing before a slow-but-valid step completes

A step like create_shipment might take up to 15 minutes during peak load but your global timeout is 5 minutes, causing spurious compensation. Make step timeouts configurable per step type — see references/advanced-patterns.md for the TimeoutSagaOrchestrator implementation and the STEP_TIMEOUTS dict pattern.

Compensation order not matching execution order

When two steps both complete before a failure is detected, compensation must run in strict reverse order or you leave data in an inconsistent state. Verify that _compensate() iterates from current_step - 1 down to 0, and add an integration test that deliberately fails at each step index to confirm correct rollback order.

---

Advanced Patterns

The references/ directory contains production-grade implementations not needed for most sagas:

  • references/advanced-patterns.md — Full SagaOrchestrator abstract base class, TimeoutSagaOrchestrator with per-step deadlines, detailed bank transfer compensating transaction chain, Prometheus instrumentation, stuck saga PromQL alerts, and DLQ recovery worker.

---

Related Skills

  • cqrs-implementation — Pair sagas with CQRS for read-model updates after each step completes
  • event-store-design — Store saga events in an event store for full audit trail and replay capability
  • workflow-orchestration-patterns — Higher-level workflow engines (Temporal, Conductor) that build on saga concepts

SKILL.md source

---
name: saga-orchestration
description: Implement saga patterns for distributed transactions and cross-aggregate workflows. Use this skill when implementing distributed transactions across microservices where 2PC is unavailable, designin...
---

# Saga Orchestration

Patterns for managing distributed transactions and long-running business processes without two-phase commit.

## Inputs and Outputs

**What you provide:**
- Service boundaries and ownership (which service owns which step)
- Transaction requirements (which steps must be atomic, which can be eventual)
- Failure modes for each step (transient vs. permanent, retry policy)
- SLA requirements per step (informs timeout configuration)
- Existing event/messaging infrastructure (Kafka, RabbitMQ, SQS, etc.)

**What this skill produces:**
- Saga definition with ordered steps, action commands, and compensation commands
- Orchestrator or choreography implementation for your chosen pattern
- Compensation logic for each participant service (idempotent, always-succeeds)
- Step timeout configuration with per-step deadlines
- Monitoring setup: state machine metrics, stuck saga detection, DLQ recovery

---

## When to Use This Skill

- Coordinating multi-service transactions without distributed locks
- Implementing compensating transactions for partial failures
- Managing long-running business workflows (minutes to hours)
- Handling failures in distributed systems where atomicity is required
- Building order fulfillment, approval, or booking processes
- Replacing fragile two-phase commit with async compensation

---

## Detailed section: Core Concepts

Moved to `references/details.md`.

## Detailed section: Templates

Moved to `references/details.md`.

## Best Practices

### Do's

- **Make every step idempotent** — Commands may be replayed on broker reconnect
- **Design compensations carefully** — They are the most critical code path
- **Use correlation IDs** — The `saga_id` must flow through every event and log
- **Implement per-step timeouts** — Never wait indefinitely for a participant reply
- **Log state transitions** — `saga_id`, `step_name`, `old_state → new_state` on every change
- **Test compensation paths explicitly** — Inject failures at each step index in integration tests

### Don'ts

- **Don't assume instant completion** — Sagas are async and may take minutes
- **Don't skip compensation testing** — The rollback path is the hardest to get right
- **Don't couple services directly** — Use async messaging, never synchronous calls inside a saga step
- **Don't ignore partial failures** — A step that partially executed still needs compensation
- **Don't use a global timeout** — Each step has different latency characteristics

---

## Troubleshooting

### Saga stuck in COMPENSATING state

A saga enters compensation but never reaches FAILED. This means a compensation handler is throwing an unhandled exception and never publishing `SagaCompensationCompleted`. Add dead-letter queue (DLQ) handling to compensation consumers and ensure every compensation action publishes a result event even when the underlying operation was already rolled back.

```python
async def handle_release_reservation(self, command: Dict):
    try:
        await self.release_reservation(command["original_result"]["reservation_id"])
    except ReservationNotFoundError:
        pass  # Already released — treat as success
    # Always publish completion, regardless of outcome
    await self.event_publisher.publish("SagaCompensationCompleted", {
        "saga_id": command["saga_id"],
        "step_name": "reserve_inventory"
    })
```

### Duplicate saga executions on restart

If your orchestrator service restarts mid-saga, it may replay events and re-execute already-completed steps. Guard every step action with an idempotency key — see **Template 3** above.

### Choreography saga losing events

In a choreography-based saga, a downstream service may miss an event if it was offline when published. Use a durable message broker (Kafka with replication, RabbitMQ with persistence) and store the current saga state in a dedicated `saga_log` table so you can replay from the last known good step.

### Timeout firing before a slow-but-valid step completes

A step like `create_shipment` might take up to 15 minutes during peak load but your global timeout is 5 minutes, causing spurious compensation. Make step timeouts configurable per step type — see `references/advanced-patterns.md` for the `TimeoutSagaOrchestrator` implementation and the `STEP_TIMEOUTS` dict pattern.

### Compensation order not matching execution order

When two steps both complete before a failure is detected, compensation must run in strict reverse order or you leave data in an inconsistent state. Verify that `_compensate()` iterates from `current_step - 1` down to `0`, and add an integration test that deliberately fails at each step index to confirm correct rollback order.

---

## Advanced Patterns

The `references/` directory contains production-grade implementations not needed for most sagas:

- **`references/advanced-patterns.md`** — Full `SagaOrchestrator` abstract base class, `TimeoutSagaOrchestrator` with per-step deadlines, detailed bank transfer compensating transaction chain, Prometheus instrumentation, stuck saga PromQL alerts, and DLQ recovery worker.

---

## Related Skills

- `cqrs-implementation` — Pair sagas with CQRS for read-model updates after each step completes
- `event-store-design` — Store saga events in an event store for full audit trail and replay capability
- `workflow-orchestration-patterns` — Higher-level workflow engines (Temporal, Conductor) that build on saga concepts

Related skills 6

running-claude-code-via-litellm-copilot

★ Featured

Use when routing Claude Code through a local LiteLLM proxy to GitHub Copilot, reducing direct Anthropic spend, configuring ANTHROPIC_BASE_URL or ANTHROPIC_MODEL overrides, or troubleshooting Copilot proxy setup failures such as model-not-found, no localhost traffic, or GitHub 401/403 auth errors.

xixu-me 155k
AI & ML

skills-cli

★ Featured

Use when users ask to discover, install, list, check, update, remove, back up, restore, sync, or initialize Agent Skills, mention `bunx skills`, `npx skills`, `skills.sh`, or `skills-lock.json`, ask "find a skill for X", or want help extending agent capabilities with installable skills.

xixu-me 155k
AI & ML

repo-intake-and-plan

★ Featured

Narrow RigorPilot helper for README-first deep learning repo reproduction. Use when the task is specifically to scan a repository, read the README and common project files, extract documented commands, classify inference, evaluation, and training candidates, and return the smallest trustworthy reproduction plan to the main orchestrator. Do not use for environment setup, asset download, command execution, final reporting, paper lookup, or end-to-end orchestration.

lllllllama 127k
AI & ML

image-to-video

★ Featured

Animate any still image on RunComfy — this skill is a smart router that matches the user's intent to the right i2v model in the RunComfy catalog. Picks HappyHorse 1.0 I2V (Arena #1, native audio, identity preservation) for general animations, Wan 2.7 with `audio_url` for custom-voiceover lip-sync, or Seedance 2.0 Pro for multi-modal animation from image + reference video + reference audio. Bundles each model's documented prompting patterns so the caller gets sharper output without burning ite...

agentspace-so 121k
AI & ML

video-edit

★ Featured

Edit existing video on RunComfy — this skill is a smart router that matches the user's intent to the right edit model in the RunComfy catalog. Picks Wan 2.7 Edit-Video (general restyle / background swap / packaging swap, identity + motion preservation), Kling 2.6 Pro Motion Control (transfer precise motion from a reference video to a target character), or Lucy Edit Restyle (lightweight identity-stable restyle / outfit swap). Bundles each model's documented prompting patterns so the skill gets...

agentspace-so 121k
AI & ML

nano-banana-2

★ Featured

Generate images with Google Nano Banana 2 (Gemini-family flash-tier text-to-image) on RunComfy — bundled with the model's documented prompting patterns so the skill gets sharper output than naive prompting against the same model. Documents Nano Banana 2's strengths (rapid iteration, in-image typography rendering, predictable framing, optional web-grounded context), the resolution-tier pricing, the safety-tolerance dial, and when to route to Nano Banana Pro / GPT Image 2 / Flux 2 / Seedream in...

agentspace-so 121k
AI & ML