MIR-528

Sandbox health should include network connectivity before marking RUNNING

Open public

phinze Opened Nov 18, 2025 Updated Apr 2, 2026

Problem

On 2025-11-14, rfd.miren.garden experienced 502 errors for over an hour due to a "zombie" sandbox:

Sandbox had status: RUNNING in entity store
Containerd task was running
But the gvisor network had failed: dial tcp 10.8.48.130:3000: connect: no route to host
HTTP ingress kept routing to this sandbox, causing all requests to fail with 502

The system had no mechanism to detect this failure mode until manual intervention.

Root Cause

The sandbox controller marks a sandbox as RUNNING based solely on containerd task status. It doesn't verify that the application is actually reachable on its network address. This creates a gap where sandboxes can be marked RUNNING but are actually unreachable.

Rejected Approach (PR #371)

We attempted to solve this by:

Adding a separate periodic health checker goroutine in the sandbox controller
Having httpingress mark sandboxes as DEAD after consecutive proxy errors

This approach has problems:

Layering violation: httpingress shouldn't reach into sandbox lifecycle management
Wrong pattern: Adding another background goroutine instead of using the reconciliation loop
Reactive: Only detects failures after they cause user-facing errors

Proposed Solution

Integrate network connectivity checks into the sandbox reconciliation loop:

Before marking RUNNING: Verify that the sandbox is actually reachable on its network address
Grace period: Don't immediately fail if port isn't listening - apps take time to start up
During reconciliation: Continue validating network connectivity for RUNNING sandboxes
Transition to DEAD: If a RUNNING sandbox becomes unreachable, mark it DEAD so it gets replaced

This fits the Kubernetes-style reconciliation pattern: the controller ensures desired state matches actual state, and "RUNNING" should mean "actually serving traffic" not just "containerd says it's running".

PR #371 (rejected approach)
Incident: rfd.miren.garden outage 2025-11-14

Sandbox health should include network connectivity before marking RUNNING

Problem

Root Cause

Rejected Approach (PR #371)

Proposed Solution

Related