Sandbox health should include network connectivity before marking RUNNING
Problem
On 2025-11-14, rfd.miren.garden experienced 502 errors for over an hour due to a "zombie" sandbox:
- Sandbox had
status: RUNNINGin entity store - Containerd task was running
- But the gvisor network had failed:
dial tcp 10.8.48.130:3000: connect: no route to host - HTTP ingress kept routing to this sandbox, causing all requests to fail with 502
The system had no mechanism to detect this failure mode until manual intervention.
Root Cause
The sandbox controller marks a sandbox as RUNNING based solely on containerd task status. It doesn't verify that the application is actually reachable on its network address. This creates a gap where sandboxes can be marked RUNNING but are actually unreachable.
Rejected Approach (PR #371)
We attempted to solve this by:
- Adding a separate periodic health checker goroutine in the sandbox controller
- Having httpingress mark sandboxes as DEAD after consecutive proxy errors
This approach has problems:
- Layering violation: httpingress shouldn't reach into sandbox lifecycle management
- Wrong pattern: Adding another background goroutine instead of using the reconciliation loop
- Reactive: Only detects failures after they cause user-facing errors
Proposed Solution
Integrate network connectivity checks into the sandbox reconciliation loop:
- Before marking RUNNING: Verify that the sandbox is actually reachable on its network address
- Grace period: Don't immediately fail if port isn't listening - apps take time to start up
- During reconciliation: Continue validating network connectivity for RUNNING sandboxes
- Transition to DEAD: If a RUNNING sandbox becomes unreachable, mark it DEAD so it gets replaced
This fits the Kubernetes-style reconciliation pattern: the controller ensures desired state matches actual state, and "RUNNING" should mean "actually serving traffic" not just "containerd says it's running".
Related
- PR #371 (rejected approach)
- Incident: rfd.miren.garden outage 2025-11-14