MIR-1285

Build saga crash recovery: resumed build can't complete after coordinator restart (registry-DNS race + stale deploy lock)

In Progress public

phinze Opened Jul 1, 2026 Updated Jul 2, 2026

Fix build saga crash recovery so resumed builds complete after a restart

Found while manually testing saga crash recovery on toys (MIR-951).

The saga state recovery works: SIGKILL the coordinator mid-build, and on auto-restart the executor recovers the in-flight build-from-tar saga from etcd and resumes the build-image action from staged source (StreamRegistry), with no client attached (the client had already errored out with timeout: no recent network activity). But the resumed build fails to complete, for two reasons:

1. Registry-DNS race on resume

The recovered build resumes before in-cluster DNS is ready, so the image push to cluster.local:5000 fails with:

failed to push cluster.local:5000/<app>:... dial tcp: lookup cluster.local on 127.0.0.53:53: no such host

The saga retries build-image ~6× within ~1s (too fast to let DNS come up), then fails and rolls back. A fresh deploy minutes later pushes fine, confirming it's a transient post-restart ordering gap, not a broken registry. Recovered builds should wait for the cluster registry to be reachable (readiness gate or real backoff) rather than exhausting fast retries.

2. Stale deployment lock

The crashed deploy leaves a deployment lock on the app; redeploying under the same name is then blocked (ERROR: deployment blocked by lock). The lock should be reconciled/released on recovery (or carry a TTL).

Repro

Deploy with [build] onbuild = ["sleep 180"] to get a build window, sudo systemctl kill -s SIGKILL miren.service on the coordinator mid-build, observe auto-restart + saga recovery (recovering saga ... status: running → executing action ... build-image), then the push failure loop and the leftover lock.

Impact

Saga crash recovery resumes correctly, but a recovered build does not yet finish, so the durability win isn't fully realized for builds. No orphan state is left (clean rollback; cluster healthy afterward). Relevant to trusting crash recovery before wider rollout.

MIR-951 (enablement + manual testing, where this surfaced)
MIR-441 (Build Process Saga)

Build saga crash recovery: resumed build can't complete after coordinator restart (registry-DNS race + stale deploy lock)

1. Registry-DNS race on resume

2. Stale deployment lock

Repro

Impact

Related