Kubernetes Operator
Use when building a Kubernetes Operator — custom controllers that reconcile CRD state. Triggers on "build an operator", "CRD design", "reconcile loop", "controller-runtime", "kubebuilder", "operato...
Use when building a Kubernetes Operator — custom controllers that reconcile CRD state. Triggers on "build an operator", "CRD design", "reconcile loop", "controller-runtime", "kubebuilder", "operator-sdk", "metacontroller", "KOPF", "operator capability levels", or "custom resource". Ships CRD validator, reconcile-loop linter, and OperatorHub capability auditor (all stdlib Python), 4 references on the operator pattern + CRD design + reconcile patterns + tooling landscape, and a /operator-audit slash command. NOT a generic k8s skill — specifically the Operator pattern.
Install
Quick install
npx skills add https://github.com/alirezarezvani/claude-skills/tree/main/engineering/kubernetes-operator/skills/kubernetes-operatornpx skills add alirezarezvani/claude-skills --skill kubernetes-operator --agent claude-codenpx skills add alirezarezvani/claude-skills --skill kubernetes-operator --agent cursornpx skills add alirezarezvani/claude-skills --skill kubernetes-operator --agent codexnpx skills add alirezarezvani/claude-skills --skill kubernetes-operator --agent opencodenpx skills add alirezarezvani/claude-skills --skill kubernetes-operator --agent github-copilotnpx skills add alirezarezvani/claude-skills --skill kubernetes-operator --agent windsurfMore install options
Shorthand — useful for multi-skill repos:
npx skills add alirezarezvani/claude-skills --skill kubernetes-operatorManual — clone the repo and drop the folder into your agent's skills directory:
git clone https://github.com/alirezarezvani/claude-skills.gitcp -r claude-skills/engineering/kubernetes-operator/skills/kubernetes-operator ~/.claude/skills/Kubernetes Operator
Build operators that reconcile correctly. Most operator bugs are not Kubernetes bugs — they are reconcile-loop bugs: missing finalizers, blocking calls, no requeue on transient errors, status drift, RBAC over-grants. This skill catches them deterministically before they reach a cluster.
When to use
- Building a new Kubernetes Operator (controller for a CRD)
- Reviewing an existing operator for capability-level gaps
- Auditing a CRD spec for status/conditions/finalizer correctness
- Choosing a framework (controller-runtime / kubebuilder / operator-sdk / metacontroller / KOPF)
- Designing the API surface of a Custom Resource
- Hardening RBAC, leader election, or webhook validation
When NOT to use
- Plain Helm chart packaging → use
helm-chart-builder - Standard kubectl operations / blue-green deploys → use
senior-devops - General k8s security posture → use
cloud-security - "I want to run a workload" — that's a Deployment / Job, not an operator
Core principle: an operator is a reconcile loop, not a script
observe(actual) → desired = read(spec) → diff(actual, desired) → act → update(status)
↓
requeue / done
Operators that fail are the ones that:
- Treat reconcile as imperative (do this, then this, then this) instead of declarative (make actual=desired, idempotently)
- Don't requeue transient failures
- Don't use finalizers, leaving orphan resources
- Mutate spec instead of status
- Don't use the status subresource (status updates trigger spec reconciles → loop)
- Block in reconcile (long HTTP calls, locks)
- Forget leader election → split-brain on multi-replica deploys
The 3 tools below catch each of these.
Quick start
SKILL=engineering/kubernetes-operator/skills/kubernetes-operator
# Validate a CRD design
python "$SKILL/scripts/crd_validator.py" --crd config/crd/myapp.yaml
# Lint a Go reconcile function
python "$SKILL/scripts/reconcile_lint.py" --controller controllers/myapp_controller.go
# Score against OperatorHub Capability Levels (1-5)
python "$SKILL/scripts/operator_capability_audit.py" --operator-dir .
The 3 Python tools
All stdlib-only. Run with --help.
crd_validator.py
Validates a CRD YAML against operator-pattern best practices.
python scripts/crd_validator.py --crd config/crd/myapp.yaml
python scripts/crd_validator.py --crd config/crd/ --format json
Checks:
spec.versions[*].subresources.statusis set (status subresource)spec.scopeisNamespaced(notCluster) unless explicitly justified- Singular and listKind defined
spec.versions[*].schema.openAPIV3Schemahas type definitions (nox-kubernetes-preserve-unknown-fields: trueat top level)- A version is marked
served: trueANDstorage: true - Conditions array is in the schema (allows
metav1.Conditions) - Printer columns include
AgeandStatus/Phase
reconcile_lint.py
Lints a Go controller reconcile function for anti-patterns.
python scripts/reconcile_lint.py --controller controllers/myapp_controller.go
Checks (regex-based heuristics):
- Returns are
(ctrl.Result, error)shape - Errors trigger a non-zero requeue (
return ctrl.Result{Requeue: true}, err) client.Update()on the spec object is flagged (controllers should update only status)time.Sleepinside reconcile is flagged (useRequeueAfter)- HTTP calls without context cancellation are flagged
- Missing
deferafter a finalizer add - No
IsConditionTrue/SetConditioncalls when conditions present in CRD - Reconcile function exceeds 80 lines (extract subroutines)
operator_capability_audit.py
Scores an operator against OperatorHub's 5 Capability Levels.
python scripts/operator_capability_audit.py --operator-dir .
Levels:
- L1 — Basic Install: CRD defined, controller deploys it
- L2 — Seamless Upgrades: PDBs, conversion webhooks, version skew strategy
- L3 — Full Lifecycle: backups, restores, failure recovery
- L4 — Deep Insights: metrics endpoint, Prometheus rules, alerts
- L5 — Auto Pilot: auto-scaling, auto-tuning, anomaly detection
Reports current level + concrete next steps to advance one level.
Tooling landscape
Pick a framework based on language and complexity. See references/tooling_landscape.md.
| Framework | Language | Best for | Maintenance |
|---|---|---|---|
| controller-runtime | Go | Production-grade, low-level control | Active (sig-api-machinery) |
| kubebuilder | Go | Standard scaffolding, opinionated | Active (Kubernetes SIGs) |
| operator-sdk | Go / Helm / Ansible | OpenShift / mixed-paradigm teams | Active (Red Hat) |
| metacontroller | Any (webhook-based) | Polyglot teams, avoiding Go | Less active |
| KOPF | Python | Python shops, async-first | Active (community) |
| java-operator-sdk | Java | JVM shops | Active (Red Hat / Java SIG) |
Decision rules:
- New operator + Go shop → kubebuilder
- New operator + Python shop → KOPF
- New operator + can't pick a language → metacontroller
- OpenShift target → operator-sdk
CRD design principles
See references/crd_design.md for full detail. Quick rules:
- status is the source of truth for the controller's view of the world. Spec is what the user wants; status is what the controller observed.
- Use the status subresource. Without it, status updates re-trigger reconcile (loop).
- Use Conditions.
Ready,Reconciling,Degraded. Each carries a reason and message. - Add finalizers. Without finalizers, deletion races the controller and orphans external resources.
- Version your CRD from day 1.
v1alpha1→v1beta1→v1. Plan a conversion webhook. - Validate via OpenAPI v3 schema. Don't rely on the controller for validation that should fail at admission.
- Use
additionalPrinterColumnsforkubectl get. ShowAge,Phase,Readyat minimum. - Namespace your CRDs unless they manage cluster-scoped resources.
Reconcile loop principles
See references/reconcile_loop.md for full detail. Quick rules:
- Idempotent. Reconciling the same state twice → same result, zero side effects.
- Read once, decide, act. Don't observe the world repeatedly during reconcile.
- Update status, not spec. Spec belongs to the user.
- Return errors that requeue. Use
ctrl.Result{RequeueAfter: ...}for known transient cases. - Never block. No
time.Sleep. No long HTTP calls without context. - Use the cache. Read via the controller's cached client; only escape the cache for a specific reason.
- Leader-elect when running >1 replica. Otherwise enable single-replica mode.
- Set OwnerReferences. Cascading deletion is the operator pattern's free gift.
Workflows
Workflow 1: Bootstrap a new operator (Go + kubebuilder)
1. Pick a Group/Version/Kind: e.g., apps.example.com/v1alpha1, kind=MyApp
2. kubebuilder init --domain example.com --repo github.com/org/myapp-operator
3. kubebuilder create api --group apps --version v1alpha1 --kind MyApp
4. Run crd_validator.py on config/crd/bases/apps.example.com_myapps.yaml
→ Fix every WARN before writing controller code
5. Implement the reconcile function (Karpathy principle 2: simplest correct version first)
6. Run reconcile_lint.py on controllers/myapp_controller.go
7. Run operator_capability_audit.py --operator-dir . — confirm L1
8. Test in a kind cluster: kubectl apply -f config/samples/
9. Add status conditions; aim for L2 in the same PR
Workflow 2: Audit an existing operator
1. Run operator_capability_audit.py --operator-dir <path>
2. Run crd_validator.py --crd config/crd/
3. Run reconcile_lint.py --controller controllers/
4. Triage findings:
- FAIL → block release; fix before next deploy
- WARN → file an issue; fix in next 30 days
5. Document current capability level in README; commit
6. Plan one capability level advancement per quarter
Workflow 3: Choose a framework
1. Identify primary language constraint (team skill)
2. Identify deployment target (vanilla k8s vs OpenShift)
3. Identify operator complexity (single CRD vs multi-CRD vs cluster-wide)
4. Cross-reference with references/tooling_landscape.md
5. Build a 1-week proof-of-concept before committing
References
references/operator_pattern.md— what an operator IS, when to use vs alternativesreferences/crd_design.md— CRD design principles, versioning, conversion webhooksreferences/reconcile_loop.md— reconcile patterns, error handling, idempotencyreferences/tooling_landscape.md— framework comparison + decision tree
Slash command
/operator-audit — Run all 3 tools on an operator repo and produce a markdown report.
Asset templates
assets/crd_template.yaml— CRD with status subresource, conditions, finalizer hint, printer columnsassets/reconcile_skeleton.go— Go controller reconcile function with idempotency, conditions, finalizers, requeue patterns
Anti-patterns
- **
time.Sleep(30 * time.Second)inside reconcile — block other reconciles. UseRequeueAfter. r.Client.Update(ctx, obj)to set status — user.Status().Update(ctx, obj)instead.- No leader election + 2+ replicas — split-brain.
- No finalizer — external resources orphan on deletion.
- CRD without status subresource — status updates trigger spec reconciles (infinite loop).
- Reconcile function > 200 lines — extract reconcileXxx subroutines per condition.
x-kubernetes-preserve-unknown-fields: trueon spec root — defeats validation.- Imperative reconcile** — "if creating, do A; if updating, do B; if deleting, do C". Wrong shape. Reconcile = make actual=desired, regardless of how we got here.
Verifiable success
A team using this skill should achieve:
- 100% of new CRDs pass
crd_validator.pybefore merge - All reconcile functions pass
reconcile_lint.pystrict mode - Operators reach OperatorHub Capability Level 3 (Full Lifecycle) before public release
- Mean time to fix a reconcile bug: <1 day (no infinite loops in production)
SKILL.md source
---
name: kubernetes-operator
description: Use when building a Kubernetes Operator — custom controllers that reconcile CRD state. Triggers on "build an operator", "CRD design", "reconcile loop", "controller-runtime", "kubebuilder", "operato...
---
# Kubernetes Operator
Build operators that reconcile correctly. Most operator bugs are not Kubernetes bugs — they are reconcile-loop bugs: missing finalizers, blocking calls, no requeue on transient errors, status drift, RBAC over-grants. This skill catches them deterministically before they reach a cluster.
## When to use
- Building a new Kubernetes Operator (controller for a CRD)
- Reviewing an existing operator for capability-level gaps
- Auditing a CRD spec for status/conditions/finalizer correctness
- Choosing a framework (controller-runtime / kubebuilder / operator-sdk / metacontroller / KOPF)
- Designing the API surface of a Custom Resource
- Hardening RBAC, leader election, or webhook validation
## When NOT to use
- Plain Helm chart packaging → use `helm-chart-builder`
- Standard kubectl operations / blue-green deploys → use `senior-devops`
- General k8s security posture → use `cloud-security`
- "I want to run a workload" — that's a Deployment / Job, not an operator
## Core principle: an operator is a reconcile loop, not a script
```
observe(actual) → desired = read(spec) → diff(actual, desired) → act → update(status)
↓
requeue / done
```
Operators that fail are the ones that:
1. Treat reconcile as imperative (do this, then this, then this) instead of declarative (make actual=desired, idempotently)
2. Don't requeue transient failures
3. Don't use finalizers, leaving orphan resources
4. Mutate spec instead of status
5. Don't use the status subresource (status updates trigger spec reconciles → loop)
6. Block in reconcile (long HTTP calls, locks)
7. Forget leader election → split-brain on multi-replica deploys
The 3 tools below catch each of these.
## Quick start
```bash
SKILL=engineering/kubernetes-operator/skills/kubernetes-operator
# Validate a CRD design
python "$SKILL/scripts/crd_validator.py" --crd config/crd/myapp.yaml
# Lint a Go reconcile function
python "$SKILL/scripts/reconcile_lint.py" --controller controllers/myapp_controller.go
# Score against OperatorHub Capability Levels (1-5)
python "$SKILL/scripts/operator_capability_audit.py" --operator-dir .
```
## The 3 Python tools
All stdlib-only. Run with `--help`.
### `crd_validator.py`
Validates a CRD YAML against operator-pattern best practices.
```bash
python scripts/crd_validator.py --crd config/crd/myapp.yaml
python scripts/crd_validator.py --crd config/crd/ --format json
```
**Checks:**
- `spec.versions[*].subresources.status` is set (status subresource)
- `spec.scope` is `Namespaced` (not `Cluster`) unless explicitly justified
- Singular and listKind defined
- `spec.versions[*].schema.openAPIV3Schema` has type definitions (no `x-kubernetes-preserve-unknown-fields: true` at top level)
- A version is marked `served: true` AND `storage: true`
- Conditions array is in the schema (allows `metav1.Conditions`)
- Printer columns include `Age` and `Status`/`Phase`
### `reconcile_lint.py`
Lints a Go controller reconcile function for anti-patterns.
```bash
python scripts/reconcile_lint.py --controller controllers/myapp_controller.go
```
**Checks (regex-based heuristics):**
- Returns are `(ctrl.Result, error)` shape
- Errors trigger a non-zero requeue (`return ctrl.Result{Requeue: true}, err`)
- `client.Update()` on the spec object is flagged (controllers should update only status)
- `time.Sleep` inside reconcile is flagged (use `RequeueAfter`)
- HTTP calls without context cancellation are flagged
- Missing `defer` after a finalizer add
- No `IsConditionTrue` / `SetCondition` calls when conditions present in CRD
- Reconcile function exceeds 80 lines (extract subroutines)
### `operator_capability_audit.py`
Scores an operator against OperatorHub's 5 Capability Levels.
```bash
python scripts/operator_capability_audit.py --operator-dir .
```
**Levels:**
- **L1 — Basic Install:** CRD defined, controller deploys it
- **L2 — Seamless Upgrades:** PDBs, conversion webhooks, version skew strategy
- **L3 — Full Lifecycle:** backups, restores, failure recovery
- **L4 — Deep Insights:** metrics endpoint, Prometheus rules, alerts
- **L5 — Auto Pilot:** auto-scaling, auto-tuning, anomaly detection
Reports current level + concrete next steps to advance one level.
## Tooling landscape
Pick a framework based on language and complexity. See `references/tooling_landscape.md`.
| Framework | Language | Best for | Maintenance |
|---|---|---|---|
| **controller-runtime** | Go | Production-grade, low-level control | Active (sig-api-machinery) |
| **kubebuilder** | Go | Standard scaffolding, opinionated | Active (Kubernetes SIGs) |
| **operator-sdk** | Go / Helm / Ansible | OpenShift / mixed-paradigm teams | Active (Red Hat) |
| **metacontroller** | Any (webhook-based) | Polyglot teams, avoiding Go | Less active |
| **KOPF** | Python | Python shops, async-first | Active (community) |
| **java-operator-sdk** | Java | JVM shops | Active (Red Hat / Java SIG) |
Decision rules:
- New operator + Go shop → kubebuilder
- New operator + Python shop → KOPF
- New operator + can't pick a language → metacontroller
- OpenShift target → operator-sdk
## CRD design principles
See `references/crd_design.md` for full detail. Quick rules:
1. **status is the source of truth for the controller's view of the world.** Spec is what the user wants; status is what the controller observed.
2. **Use the status subresource.** Without it, status updates re-trigger reconcile (loop).
3. **Use Conditions.** `Ready`, `Reconciling`, `Degraded`. Each carries a reason and message.
4. **Add finalizers.** Without finalizers, deletion races the controller and orphans external resources.
5. **Version your CRD from day 1.** `v1alpha1` → `v1beta1` → `v1`. Plan a conversion webhook.
6. **Validate via OpenAPI v3 schema.** Don't rely on the controller for validation that should fail at admission.
7. **Use `additionalPrinterColumns` for `kubectl get`.** Show `Age`, `Phase`, `Ready` at minimum.
8. **Namespace your CRDs unless they manage cluster-scoped resources.**
## Reconcile loop principles
See `references/reconcile_loop.md` for full detail. Quick rules:
1. **Idempotent.** Reconciling the same state twice → same result, zero side effects.
2. **Read once, decide, act.** Don't observe the world repeatedly during reconcile.
3. **Update status, not spec.** Spec belongs to the user.
4. **Return errors that requeue.** Use `ctrl.Result{RequeueAfter: ...}` for known transient cases.
5. **Never block.** No `time.Sleep`. No long HTTP calls without context.
6. **Use the cache.** Read via the controller's cached client; only escape the cache for a specific reason.
7. **Leader-elect when running >1 replica.** Otherwise enable single-replica mode.
8. **Set OwnerReferences.** Cascading deletion is the operator pattern's free gift.
## Workflows
### Workflow 1: Bootstrap a new operator (Go + kubebuilder)
```
1. Pick a Group/Version/Kind: e.g., apps.example.com/v1alpha1, kind=MyApp
2. kubebuilder init --domain example.com --repo github.com/org/myapp-operator
3. kubebuilder create api --group apps --version v1alpha1 --kind MyApp
4. Run crd_validator.py on config/crd/bases/apps.example.com_myapps.yaml
→ Fix every WARN before writing controller code
5. Implement the reconcile function (Karpathy principle 2: simplest correct version first)
6. Run reconcile_lint.py on controllers/myapp_controller.go
7. Run operator_capability_audit.py --operator-dir . — confirm L1
8. Test in a kind cluster: kubectl apply -f config/samples/
9. Add status conditions; aim for L2 in the same PR
```
### Workflow 2: Audit an existing operator
```
1. Run operator_capability_audit.py --operator-dir <path>
2. Run crd_validator.py --crd config/crd/
3. Run reconcile_lint.py --controller controllers/
4. Triage findings:
- FAIL → block release; fix before next deploy
- WARN → file an issue; fix in next 30 days
5. Document current capability level in README; commit
6. Plan one capability level advancement per quarter
```
### Workflow 3: Choose a framework
```
1. Identify primary language constraint (team skill)
2. Identify deployment target (vanilla k8s vs OpenShift)
3. Identify operator complexity (single CRD vs multi-CRD vs cluster-wide)
4. Cross-reference with references/tooling_landscape.md
5. Build a 1-week proof-of-concept before committing
```
## References
- `references/operator_pattern.md` — what an operator IS, when to use vs alternatives
- `references/crd_design.md` — CRD design principles, versioning, conversion webhooks
- `references/reconcile_loop.md` — reconcile patterns, error handling, idempotency
- `references/tooling_landscape.md` — framework comparison + decision tree
## Slash command
`/operator-audit` — Run all 3 tools on an operator repo and produce a markdown report.
## Asset templates
- `assets/crd_template.yaml` — CRD with status subresource, conditions, finalizer hint, printer columns
- `assets/reconcile_skeleton.go` — Go controller reconcile function with idempotency, conditions, finalizers, requeue patterns
## Anti-patterns
- **`time.Sleep(30 * time.Second)` inside reconcile** — block other reconciles. Use `RequeueAfter`.
- **`r.Client.Update(ctx, obj)` to set status** — use `r.Status().Update(ctx, obj)` instead.
- **No leader election + 2+ replicas** — split-brain.
- **No finalizer** — external resources orphan on deletion.
- **CRD without status subresource** — status updates trigger spec reconciles (infinite loop).
- **Reconcile function > 200 lines** — extract reconcileXxx subroutines per condition.
- **`x-kubernetes-preserve-unknown-fields: true` on spec root** — defeats validation.
- **Imperative reconcile** — "if creating, do A; if updating, do B; if deleting, do C". Wrong shape. Reconcile = make actual=desired, regardless of how we got here.
## Verifiable success
A team using this skill should achieve:
- 100% of new CRDs pass `crd_validator.py` before merge
- All reconcile functions pass `reconcile_lint.py` strict mode
- Operators reach OperatorHub Capability Level 3 (Full Lifecycle) before public release
- Mean time to fix a reconcile bug: <1 day (no infinite loops in production)
Related skills 6
caveman
Ultra-compressed communication mode. Cuts token usage ~75% by speaking like caveman while keeping full technical accuracy. Supports intensity levels: lite, full (default), ultra, wenyan-lite, wenyan-full, wenyan-ultra. Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman. Also auto-triggers when token efficiency is requested.
secure-linux-web-hosting
Use when setting up, hardening, or reviewing a cloud server for self-hosting, including DNS, SSH, firewalls, Nginx, static-site hosting, reverse-proxying an app, HTTPS with Let's Encrypt or ACME clients, safe HTTP-to-HTTPS redirects, or optional post-launch network tuning such as BBR.
readme-i18n
Use when the user wants to translate a repository README, make a repo multilingual, localize docs, add a language switcher, internationalize the README, or update localized README variants in a GitHub-style repository.
lark-shared
Use when first setting up lark-cli, running auth login, switching user/bot identity (--as), handling permission denied or scope errors, needing to update lark-cli, or seeing _notice in JSON output.
improve-codebase-architecture
Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.
paper-context-resolver
Optional RigorPilot helper for README-first deep learning repo reproduction. Use only when the README and repository files leave a narrow reproduction-critical gap and the task is to resolve a specific paper detail such as dataset split, preprocessing, evaluation protocol, checkpoint mapping, or runtime assumption from primary paper sources while recording conflicts. Do not use for general paper summary, repo scanning, environment setup, command execution, title-only paper lookup, or replacin...