All sandboxes killed and recreated on miren restart due to task.Wait() returning stale exit status
Problem
When miren restarts (e.g., during autoupgrade), ALL recovered sandboxes are immediately killed and recreated instead of being reattached to their running containers.
Symptom
On miren-garden during the upgrade from main:165a03e to main:3e24da6 at 2025-11-15 00:02:44:
- 5 sandboxes were successfully recovered from etcd
- All were reattached to their containerd containers
- Within 2 seconds, all reported
exit_code: 255, exit_time: 0001-01-01T00:00:00.000Z - All were marked as STOPPED, cleaned up, and replaced with new sandboxes
Affected apps:
meet(sb-CV8UVhPk92M1xPaBLHY6U → sb-CV8hGx4wtAws11ztxZRLK)rfd(rfd-web-CV8UVmGerikqhtZCAhWH3 → new instances)uptime-kuma(sb-CV8UVgWuWCKGuksvCK5Ra → sb-CV8hGwF5RbikE6MrHug1D)mirendev(both instances replaced)
Root Cause
In controllers/sandbox/sandbox.go:557, when reattaching to existing containers:
exitCh, err := task.Wait(ctx)
if err != nil {
c.Log.Warn("failed to set up task wait during reattach", "id", containerID, "error", err)
} else {
go c.monitorTaskExit(sb, containerID, exitCh)
c.Log.Debug("re-established task exit monitoring", "sandbox", sb.ID, "container", containerID)
}
The task.Wait() call immediately returns with stale/invalid exit status when reattaching to existing containers, rather than blocking until the task actually exits. The zero timestamp (0001-01-01T00:00:00.000Z) indicates the exit status is not valid.
Evidence
From journalctl logs at 2025-11-15 00:02:59:
00:02:59.052 [INFO] sandbox.sandbox container process exited │ container: sandbox.sb-CV8UVhPk92M1xPaBLHY6U-app exit_code: 255 exit_time: 0001-01-01T00:00:00.000Z
00:02:59.053 [INFO] sandbox.sandbox container process exited │ container: sandbox.sb-CV8UVhPk92M1xPaBLHY6U_pause exit_code: 255 exit_time: 0001-01-01T00:00:00.000Z
This pattern repeated for all 5 recovered sandboxes immediately after task monitoring was re-established.
Impact
- App downtime during every miren restart/upgrade
- Unnecessary sandbox churn
- Loss of container state/history
- Defeats the purpose of sandbox recovery
Suggested Fix
The reattachment code needs to:
- Verify the task is actually still running before trusting the exit status
- Handle the case where
task.Wait()returns immediately with stale data - Possibly check
exitStatus.ExitTime()for zero value before acting on the exit
Location: controllers/sandbox/sandbox.go:557