Submit an issue View all issues Source
MIR-375

Sandbox controller doesn't detect when containerd tasks die unexpectedly

Open Bug public
phinze phinze Opened Sep 2, 2025 Updated Apr 2, 2026

Problem

The sandbox controller has a critical bug where it doesn't detect when containerd tasks die but their containers still exist. This causes Miren to think sandboxes are running when they're actually dead, leading to 502 errors.

Investigation Summary

On miren-toys server, the conference-app has 3 sandboxes that Miren believes are running:

  • sandbox/sb-CScZ5eBs1ozeD8M2ovozr
  • sandbox/sb-CScZ6Av4QAP5g5hz2EzLH
  • sandbox/sb-CScZ6AvJkKrArHr6xeKeR

However:

  1. The containers exist in containerd but have no running tasks
  2. Attempting to start tasks fails with: namespace path: lstat /proc/6347/ns/ipc: no such file or directory
  3. The pause containers (which provide shared namespaces) died at some point, taking the app containers with them
  4. Miren's reconciliation loop keeps checking but doesn't detect the broken state

Root Cause

In controllers/sandbox/sandbox.go:314, the checkSandbox() function only verifies:

  • Container exists
  • Version label matches

It never checks if the task is actually running. The controller assumes if a container exists, it's running.

Impact

  • Apps appear healthy in Miren but return 502 errors
  • Sandboxes remain in broken state indefinitely
  • Manual intervention required to fix

Proposed Fix

The checkSandbox() function needs to also verify task status:

func (c *SandboxController) checkSandbox(ctx context.Context, co *compute.Sandbox, meta *entity.Meta) (int, error) {
    // ... existing container checks ...
    
    // NEW: Check if task is actually running
    task, err := cont.Task(ctx, nil)
    if err != nil {
        if errdefs.IsNotFound(err) {
            c.Log.Debug("task not found for container, needs recreation")
            return differentVersion, nil  // Force recreation
        }
        return 0, err
    }
    
    status, err := task.Status(ctx)
    if err != nil {
        return 0, err
    }
    
    if status.Status != containerd.Running {
        c.Log.Debug("task not running", "status", status.Status)
        return differentVersion, nil  // Force recreation
    }
    
    // ... rest of function
}

Additional Improvements

  1. Add periodic health checks for running sandboxes
  2. Monitor for pause container deaths specifically
  3. Add metrics/alerts for sandbox-task state mismatches
  4. Consider implementing a "liveness probe" mechanism

Workaround

Until fixed, manually delete and recreate affected sandboxes:

m sandbox delete <sandbox-id>
# Miren will recreate them automatically