Submit an issue View all issues Source
MIR-1227

Runner certificates have no renewal path — fleet-wide expiry cliff at one year

Open public
phinze phinze Opened Jun 5, 2026 Updated Jun 11, 2026

Runner certs are issued with ValidFor: 365 days at Join, and nothing ever renews them. Distributed runners are long-lived pets on persistent disks, so every runner in every cluster hits the expiry cliff within a year of joining.

The cliff is sharp because of how refresh authenticates: RefreshCertificate (#848) requires the caller's existing cert to pass VerifyCert, which checks validity. An expired cert can't vouch for its own replacement, so past the cliff the only recovery is manual re-join per runner.

Fix, cheapest first: ensureRunnerCertificate already parses the persisted cert at every runner start; also refresh when time.Now() is within some window of NotAfter (say 30 days), while the cert can still authenticate itself. That converts the cliff into routine self-healing for any runner that restarts at least once a month-ish. Full periodic renewal (a background timer, no restart needed) can come later; the near-expiry check at start is the 90% win. Refreshing also picks up a rotated CA via the returned ca_pem.

Surfaced during review of #848.