Submit an issue View all issues Source
MIR-524

All sandboxes killed and recreated on miren restart due to task.Wait() returning stale exit status

Open public
phinze phinze Opened Nov 15, 2025 Updated Apr 2, 2026

Problem

When miren restarts (e.g., during autoupgrade), ALL recovered sandboxes are immediately killed and recreated instead of being reattached to their running containers.

Symptom

On miren-garden during the upgrade from main:165a03e to main:3e24da6 at 2025-11-15 00:02:44:

  • 5 sandboxes were successfully recovered from etcd
  • All were reattached to their containerd containers
  • Within 2 seconds, all reported exit_code: 255, exit_time: 0001-01-01T00:00:00.000Z
  • All were marked as STOPPED, cleaned up, and replaced with new sandboxes

Affected apps:

  • meet (sb-CV8UVhPk92M1xPaBLHY6U → sb-CV8hGx4wtAws11ztxZRLK)
  • rfd (rfd-web-CV8UVmGerikqhtZCAhWH3 → new instances)
  • uptime-kuma (sb-CV8UVgWuWCKGuksvCK5Ra → sb-CV8hGwF5RbikE6MrHug1D)
  • mirendev (both instances replaced)

Root Cause

In controllers/sandbox/sandbox.go:557, when reattaching to existing containers:

exitCh, err := task.Wait(ctx)
if err != nil {
    c.Log.Warn("failed to set up task wait during reattach", "id", containerID, "error", err)
} else {
    go c.monitorTaskExit(sb, containerID, exitCh)
    c.Log.Debug("re-established task exit monitoring", "sandbox", sb.ID, "container", containerID)
}

The task.Wait() call immediately returns with stale/invalid exit status when reattaching to existing containers, rather than blocking until the task actually exits. The zero timestamp (0001-01-01T00:00:00.000Z) indicates the exit status is not valid.

Evidence

From journalctl logs at 2025-11-15 00:02:59:

00:02:59.052 [INFO] sandbox.sandbox container process exited │ container: sandbox.sb-CV8UVhPk92M1xPaBLHY6U-app exit_code: 255 exit_time: 0001-01-01T00:00:00.000Z
00:02:59.053 [INFO] sandbox.sandbox container process exited │ container: sandbox.sb-CV8UVhPk92M1xPaBLHY6U_pause exit_code: 255 exit_time: 0001-01-01T00:00:00.000Z

This pattern repeated for all 5 recovered sandboxes immediately after task monitoring was re-established.

Impact

  • App downtime during every miren restart/upgrade
  • Unnecessary sandbox churn
  • Loss of container state/history
  • Defeats the purpose of sandbox recovery

Suggested Fix

The reattachment code needs to:

  1. Verify the task is actually still running before trusting the exit status
  2. Handle the case where task.Wait() returns immediately with stale data
  3. Possibly check exitStatus.ExitTime() for zero value before acting on the exit

Location: controllers/sandbox/sandbox.go:557