Submit an issue View all issues Source
MIR-528

Sandbox health should include network connectivity before marking RUNNING

Open public
phinze phinze Opened Nov 18, 2025 Updated Apr 2, 2026

Problem

On 2025-11-14, rfd.miren.garden experienced 502 errors for over an hour due to a "zombie" sandbox:

  • Sandbox had status: RUNNING in entity store
  • Containerd task was running
  • But the gvisor network had failed: dial tcp 10.8.48.130:3000: connect: no route to host
  • HTTP ingress kept routing to this sandbox, causing all requests to fail with 502

The system had no mechanism to detect this failure mode until manual intervention.

Root Cause

The sandbox controller marks a sandbox as RUNNING based solely on containerd task status. It doesn't verify that the application is actually reachable on its network address. This creates a gap where sandboxes can be marked RUNNING but are actually unreachable.

Rejected Approach (PR #371)

We attempted to solve this by:

  1. Adding a separate periodic health checker goroutine in the sandbox controller
  2. Having httpingress mark sandboxes as DEAD after consecutive proxy errors

This approach has problems:

  • Layering violation: httpingress shouldn't reach into sandbox lifecycle management
  • Wrong pattern: Adding another background goroutine instead of using the reconciliation loop
  • Reactive: Only detects failures after they cause user-facing errors

Proposed Solution

Integrate network connectivity checks into the sandbox reconciliation loop:

  1. Before marking RUNNING: Verify that the sandbox is actually reachable on its network address
  2. Grace period: Don't immediately fail if port isn't listening - apps take time to start up
  3. During reconciliation: Continue validating network connectivity for RUNNING sandboxes
  4. Transition to DEAD: If a RUNNING sandbox becomes unreachable, mark it DEAD so it gets replaced

This fits the Kubernetes-style reconciliation pattern: the controller ensures desired state matches actual state, and "RUNNING" should mean "actually serving traffic" not just "containerd says it's running".

Related

  • PR #371 (rejected approach)
  • Incident: rfd.miren.garden outage 2025-11-14