MIR-1252

miren deploy should wait for healthy and report the truth

In Progress public

phinze Opened Jun 23, 2026 Updated Jun 23, 2026

miren deploy declares victory too early. The moment it flips the deployment record to active server-side, it prints "All traffic moved to new version" and exits, without ever waiting to see the new instance serve a request (cli/commands/deploy.go; the activation poller only checks that the active version matches and a pool is assigned, not that anything works). The version pointer moved, but whether anything's answering is anyone's guess. The app might still be booting, or crash-looping, and you'd never know from the deploy output.

Make deploy wait for the new version to actually be healthy before it reports, so it can tell the truth: "live and serving," or "came up, but heads up...", or "never became healthy, here's why."

Scope note: this is the consumer side. "Healthy" should be read from an authoritative signal (today: sandbox RUNNING + the network health check), not redefined in the deploy path. The richer, user-configurable definition belongs to the Health Checks item (MIR-1251); build this to read health from app-status so it flows through whatever that lands, rather than inventing its own notion.

This is also the natural home for the loud port-divergence warning from MIR-1246: when an app ignores $PORT and binds elsewhere, we auto-route and record the observed port on the sandbox, but today the only signal is a quiet line in m logs. Once deploy waits for health, it can surface that as a real deploy warning, the way we warn about implicitly mounting a local disk.

Things to work through:

A health/readiness signal surfaced through app-status / AppInfo.
Autoscale-to-zero: a scaled-to-zero app has no running instance, so "healthy" can't mean "an instance is up right now"; don't hang waiting for one.
Rollback goes through the same path and should get the same honest treatment.
Failure UX: when a version never goes healthy, say why (reuse MIR-1246's actionable port messages, surface crash logs) instead of a generic timeout.