Sandbox controller doesn't detect when containerd tasks die unexpectedly
Problem
The sandbox controller has a critical bug where it doesn't detect when containerd tasks die but their containers still exist. This causes Miren to think sandboxes are running when they're actually dead, leading to 502 errors.
Investigation Summary
On miren-toys server, the conference-app has 3 sandboxes that Miren believes are running:
sandbox/sb-CScZ5eBs1ozeD8M2ovozrsandbox/sb-CScZ6Av4QAP5g5hz2EzLHsandbox/sb-CScZ6AvJkKrArHr6xeKeR
However:
- The containers exist in containerd but have no running tasks
- Attempting to start tasks fails with:
namespace path: lstat /proc/6347/ns/ipc: no such file or directory - The pause containers (which provide shared namespaces) died at some point, taking the app containers with them
- Miren's reconciliation loop keeps checking but doesn't detect the broken state
Root Cause
In controllers/sandbox/sandbox.go:314, the checkSandbox() function only verifies:
- Container exists
- Version label matches
It never checks if the task is actually running. The controller assumes if a container exists, it's running.
Impact
- Apps appear healthy in Miren but return 502 errors
- Sandboxes remain in broken state indefinitely
- Manual intervention required to fix
Proposed Fix
The checkSandbox() function needs to also verify task status:
func (c *SandboxController) checkSandbox(ctx context.Context, co *compute.Sandbox, meta *entity.Meta) (int, error) {
// ... existing container checks ...
// NEW: Check if task is actually running
task, err := cont.Task(ctx, nil)
if err != nil {
if errdefs.IsNotFound(err) {
c.Log.Debug("task not found for container, needs recreation")
return differentVersion, nil // Force recreation
}
return 0, err
}
status, err := task.Status(ctx)
if err != nil {
return 0, err
}
if status.Status != containerd.Running {
c.Log.Debug("task not running", "status", status.Status)
return differentVersion, nil // Force recreation
}
// ... rest of function
}
Additional Improvements
- Add periodic health checks for running sandboxes
- Monitor for pause container deaths specifically
- Add metrics/alerts for sandbox-task state mismatches
- Consider implementing a "liveness probe" mechanism
Workaround
Until fixed, manually delete and recreate affected sandboxes:
m sandbox delete <sandbox-id>
# Miren will recreate them automatically