MIR-375

Sandbox controller doesn't detect when containerd tasks die unexpectedly

Open Bug public

phinze Opened Sep 2, 2025 Updated Apr 2, 2026

Problem

The sandbox controller has a critical bug where it doesn't detect when containerd tasks die but their containers still exist. This causes Miren to think sandboxes are running when they're actually dead, leading to 502 errors.

Investigation Summary

On miren-toys server, the conference-app has 3 sandboxes that Miren believes are running:

sandbox/sb-CScZ5eBs1ozeD8M2ovozr
sandbox/sb-CScZ6Av4QAP5g5hz2EzLH
sandbox/sb-CScZ6AvJkKrArHr6xeKeR

However:

The containers exist in containerd but have no running tasks
Attempting to start tasks fails with: namespace path: lstat /proc/6347/ns/ipc: no such file or directory
The pause containers (which provide shared namespaces) died at some point, taking the app containers with them
Miren's reconciliation loop keeps checking but doesn't detect the broken state

Root Cause

In controllers/sandbox/sandbox.go:314, the checkSandbox() function only verifies:

Container exists
Version label matches

It never checks if the task is actually running. The controller assumes if a container exists, it's running.

Impact

Apps appear healthy in Miren but return 502 errors
Sandboxes remain in broken state indefinitely
Manual intervention required to fix

Proposed Fix

The checkSandbox() function needs to also verify task status:

func (c *SandboxController) checkSandbox(ctx context.Context, co *compute.Sandbox, meta *entity.Meta) (int, error) {
    // ... existing container checks ...
    
    // NEW: Check if task is actually running
    task, err := cont.Task(ctx, nil)
    if err != nil {
        if errdefs.IsNotFound(err) {
            c.Log.Debug("task not found for container, needs recreation")
            return differentVersion, nil  // Force recreation
        }
        return 0, err
    }
    
    status, err := task.Status(ctx)
    if err != nil {
        return 0, err
    }
    
    if status.Status != containerd.Running {
        c.Log.Debug("task not running", "status", status.Status)
        return differentVersion, nil  // Force recreation
    }
    
    // ... rest of function
}

Additional Improvements

Add periodic health checks for running sandboxes
Monitor for pause container deaths specifically
Add metrics/alerts for sandbox-task state mismatches
Consider implementing a "liveness probe" mechanism

Workaround

Until fixed, manually delete and recreate affected sandboxes:

m sandbox delete <sandbox-id>
# Miren will recreate them automatically