Submit an issue View all issues Source
MIR-486

Race condition in PR 284 causes duplicate pending sandboxes on boot

Open public
phinze phinze Opened Oct 31, 2025 Updated Apr 2, 2026

Problem

After deploying PR #284 (nested component field indexing), garden has two pending sandboxes instead of one. This is caused by a race condition between pool reconciliation and index migration during service boot.

Root Cause

Timeline on Oct 31 19:20:33 boot:

  1. 19:20:33 - Service boots, containerd reconnects to 3 existing running sandboxes (meet, uptime-kuma, rfd)
    • These sandboxes have old schema without nested Spec.Version indexes
  2. 19:20:51 - Activator and Pool Manager start simultaneously
    • Activator: Successfully recovers the 3 existing sandboxes
    • Pool Manager: Queries sandboxes by SandboxSpecVersionId to check pool status
    • PROBLEM: Old sandboxes don't have nested component field indexes yet!
    • Pool sees: actual: 0 ready: 0 desired: 1 (can't find sandboxes via new index)
    • Pool creates: NEW duplicate sandboxes sb-CUfpRfsL3eE5FCA1MPX2o and sb-CUfpRhDSn7cT3pxCwwoEG
  3. 19:20:52-53 - Migration finally runs (2 seconds too late!)
    • reconcileSandboxesOnBoot() patches all 5 sandboxes
    • Adds index-migration-v1 label to trigger nested index rebuild
    • Result: migrated_count: 5 skipped_count: 0
  4. 19:21:51 - Pool reconciliation runs again (1 minute later)
    • NOW it can query migrated sandboxes via nested indexes
    • Sees actual: 2 ready: 1 desired: 1 (both old AND new!)
    • Tries to scale down the duplicate

Technical Details

  • Pool queries at controllers/sandboxpool/manager.go:227 use SandboxSpecVersionId
  • Migration runs in Sandbox controller's Init() at sandbox.go:545
  • Pool controller starts at its own Init() time
  • No synchronization ensures migration completes before pool operations

Solutions

Immediate:

  • Delete duplicate pending sandboxes (sb-CUfpRfsL3eE5FCA1MPX2o and sb-CUfpRhDSn7cT3pxCwwoEG)

Long-term (pick one):

  1. Run migration synchronously before pool controllers start
  2. Block pool queries until migration completes
  3. Remove migration code entirely after all environments migrated (as comment suggests)

Related

  • PR #284: Implement nested component field indexing
  • Commit dcd547d: Fix migration code to include db/id in Patch operation