keysmith: Kubernetes Operator for Automated Secret Rotation

Kubernetes operator that automates secret rotation on a cron schedule. Declare a SecretRotationPolicy CR, and keysmith handles credential generation, secret updates, workload restarts, and audit logging.

keysmith architecture

What it solves

SOC 2, PCI-DSS, and ISO 27001 all require periodic secret rotation. In practice, teams either rotate manually (error-prone, gets skipped) or build one-off CronJobs (no cluster state awareness, no audit trail, no failure handling). keysmith makes rotation a first-class Kubernetes resource with continuous reconciliation.

How it works

A SecretRotationPolicy CRD defines:

On each rotation, the controller:

  1. Creates a RotationRecord in Running phase (audit trail starts before the attempt)
  2. Calls the provider to fetch/generate new credentials
  3. Updates the target Kubernetes Secret
  4. Patches pod template annotations to trigger rolling restarts
  5. Finalizes the RotationRecord as Succeeded or Failed
  6. Dispatches HTTP webhooks on RotationSucceeded, RotationFailed, or RotationSkipped
  7. Prunes old RotationRecords beyond historyLimit

Manual rotation is supported via annotation: set secrets.keysmith.io/rotate=now and the controller rotates immediately. Policies can be suspended and resumed without deletion.

Provider interface

The Provider interface is three methods: Validate, FetchSecret, RotateSecret. Adding a new backend is about 50 lines of Go. Register it in main.go, update the CRD enum, run make manifests.

Built-in providers:

Audit trail

RotationRecord is the compliance-critical piece. Every rotation attempt (success or failure) creates an immutable record capturing:

Records are linked to policies via owner references and auto-pruned per historyLimit. Auditors can query the full rotation history for any secret.

Admission webhooks

Defaulting and validation webhooks run at the API server:

This catches configuration errors at apply time instead of at rotation time.

Observability

Four Prometheus metrics:

Metric Type Purpose
keysmith_rotations_total Counter Total rotation attempts by policy, provider, result
keysmith_rotation_duration_seconds Histogram Rotation execution latency
keysmith_next_rotation_timestamp_seconds Gauge Next scheduled rotation (alert on stale policies)
keysmith_active_policies_total Gauge Count of non-suspended policies

OpenTelemetry tracing and structured logging via Zap round out the observability stack.

Why an operator, not a CronJob

A CronJob runs a script and exits. It has no cluster state awareness, no status conditions, no leader election, no event reactivity. If a CronJob fails, you find out from missing logs. If a keysmith rotation fails, you get a Failed phase on the policy, a RotationRecord with the error, a webhook notification, and a Prometheus metric increment.

The operator pattern uses continuous reconciliation: if something drifts, the controller corrects it on the next loop. Leader election via Kubernetes Lease objects ensures only one replica runs rotations at a time.

Testing

Ginkgo v2 + envtest (real kube-apiserver + etcd in-process). Tests cover:

Coverage: 56%+ controller, 87-100% providers, 91% scheduler.


View on GitHub