MIR-857

Sandbox pool: add backoff for repeated sandbox failures

Open public

phinze Opened Mar 23, 2026 Updated Apr 2, 2026

Problem

When a sandbox crashes immediately after creation, the pool controller creates a replacement with no delay. If the underlying cause persists (e.g. corrupted data volume, missing config), this produces a rapid accumulation of dead sandboxes and constant churn.

On James's server, postgres couldn't start due to corrupted WAL data. The pool controller created 15+ dead postgres sandboxes in rapid succession, each dying immediately, with new ones spawning as fast as old ones were cleaned up.

The activator's fail-fast check already tracks dead sandbox counts per pool (visible in logs as has_pending_or_running: false with growing dead counts), but this information isn't used to slow down creation.

Expected Behavior

When sandboxes in a pool are repeatedly failing:

Apply exponential backoff to sandbox creation after consecutive failures
Cap the number of dead sandboxes that can accumulate before pausing creation
Transition the pool to a "failing" state that is visible to users (see MIR-{doctor issue})

Observed Log Pattern

coordinator.activator fail-fast check │ app: app/gleester ... 
  sandboxes: "[...5 dead sandboxes...]" 
  has_pending_or_running: false 
  increment_pool: true 
  sandbox_count_before: 5

The controller sees 5 dead sandboxes, no pending or running ones, and still decides to increment the pool.