Cert Rotation¶
ShoreGuard's CertRotationService proactively rotates mTLS client
certs before they expire. It fulfils the M28 observability promise
operationally: the client-cert metadata and reload_credentials() hook
have existed since v0.31, but until v0.34.0 no scheduler was wiring them
up.
What the service does¶
On startup (when SHOREGUARD_CERT_ROTATION_ENABLED=true, the default), a
background task runs every SHOREGUARD_CERT_ROTATION_POLL_INTERVAL_S
seconds (default 1h). Each cycle:
- Lists every registered gateway.
- For each connected client, reads
cert_info.seconds_until_expiry. - If the remaining validity is below
SHOREGUARD_CERT_ROTATION_THRESHOLD_DAYS(default 7), re-reads the credential bytes from the registry and callsreload_credentials()to rebuild the channel. - Records an audit-log entry
gateway.cert_rotatedon success, and a webhookgateway.cert_rotation_failedwhen all retries have been exhausted.
The service does not generate new certs. It assumes an external
process (cert-manager, a cron, an operator running
shoreguard gateway register --client-cert …) has landed the fresh
material in the credentials table. ShoreGuard's job is to pick it up.
Observability¶
- Metric:
sg_gateway_cert_rotations_total{gateway,outcome}— labelssuccess,failure,skipped_not_due,skipped_no_cert. - Audit:
gateway.cert_rotatedwith{before_seconds_until_expiry, after_seconds_until_expiry, attempts}. - Webhook (on giveup):
gateway.cert_rotation_failedpayload:
{
"gateway": "prod-gw-1",
"reason": "validate_bundle: NotAfter is in the past",
"retries": 3,
"seconds_until_expiry": 259200,
"next_attempt_at": 1714320000.0
}
Runbook — rotation failed¶
The webhook fires after retries are exhausted in a single poll cycle. The next cycle starts clean, so if nothing changes the alert will repeat.
- Inspect the
reason. Most common cause: the registry still holds the expired bytes. Land the fresh cert pair viashoreguard gateway register --client-cert … --client-key …(or your cert-manager equivalent). - Check
sg_gateway_cert_expiry_seconds{gateway}to confirm whether the new bytes got picked up on the next cycle. - Service disabled? If an operator set
SHOREGUARD_CERT_ROTATION_ENABLED=false, rotations are paused — re-enable after the credentials issue is resolved.
Multi-replica safety¶
Rotation is idempotent relative to its inputs (no server-side mutation), so every replica rotates its own client pool independently. No advisory lock is needed.