Build saga crash recovery: resumed build can't complete after coordinator restart (registry-DNS race + stale deploy lock)
Found while manually testing saga crash recovery on toys (MIR-951).
The saga state recovery works: SIGKILL the coordinator mid-build, and on auto-restart the executor recovers the in-flight build-from-tar saga from etcd and resumes the build-image action from staged source (StreamRegistry), with no client attached (the client had already errored out with timeout: no recent network activity). But the resumed build fails to complete, for two reasons:
1. Registry-DNS race on resume
The recovered build resumes before in-cluster DNS is ready, so the image push to cluster.local:5000 fails with:
failed to push cluster.local:5000/<app>:... dial tcp: lookup cluster.local on 127.0.0.53:53: no such host
The saga retries build-image ~6× within ~1s (too fast to let DNS come up), then fails and rolls back. A fresh deploy minutes later pushes fine, confirming it's a transient post-restart ordering gap, not a broken registry. Recovered builds should wait for the cluster registry to be reachable (readiness gate or real backoff) rather than exhausting fast retries.
2. Stale deployment lock
The crashed deploy leaves a deployment lock on the app; redeploying under the same name is then blocked (ERROR: deployment blocked by lock). The lock should be reconciled/released on recovery (or carry a TTL).
Repro
Deploy with [build] onbuild = ["sleep 180"] to get a build window, sudo systemctl kill -s SIGKILL miren.service on the coordinator mid-build, observe auto-restart + saga recovery (recovering saga ... status: running → executing action ... build-image), then the push failure loop and the leftover lock.
Impact
Saga crash recovery resumes correctly, but a recovered build does not yet finish, so the durability win isn't fully realized for builds. No orphan state is left (clean rollback; cluster healthy afterward). Relevant to trusting crash recovery before wider rollout.