Race condition in PR 284 causes duplicate pending sandboxes on boot
Problem
After deploying PR #284 (nested component field indexing), garden has two pending sandboxes instead of one. This is caused by a race condition between pool reconciliation and index migration during service boot.
Root Cause
Timeline on Oct 31 19:20:33 boot:
- 19:20:33 - Service boots, containerd reconnects to 3 existing running sandboxes (meet, uptime-kuma, rfd)
- These sandboxes have old schema without nested
Spec.Versionindexes
- These sandboxes have old schema without nested
- 19:20:51 - Activator and Pool Manager start simultaneously
- Activator: Successfully recovers the 3 existing sandboxes
- Pool Manager: Queries sandboxes by
SandboxSpecVersionIdto check pool status - PROBLEM: Old sandboxes don't have nested component field indexes yet!
- Pool sees:
actual: 0 ready: 0 desired: 1(can't find sandboxes via new index) - Pool creates: NEW duplicate sandboxes
sb-CUfpRfsL3eE5FCA1MPX2oandsb-CUfpRhDSn7cT3pxCwwoEG
- 19:20:52-53 - Migration finally runs (2 seconds too late!)
reconcileSandboxesOnBoot()patches all 5 sandboxes- Adds
index-migration-v1label to trigger nested index rebuild - Result:
migrated_count: 5 skipped_count: 0
- 19:21:51 - Pool reconciliation runs again (1 minute later)
- NOW it can query migrated sandboxes via nested indexes
- Sees
actual: 2 ready: 1 desired: 1(both old AND new!) - Tries to scale down the duplicate
Technical Details
- Pool queries at
controllers/sandboxpool/manager.go:227useSandboxSpecVersionId - Migration runs in Sandbox controller's
Init()atsandbox.go:545 - Pool controller starts at its own
Init()time - No synchronization ensures migration completes before pool operations
Solutions
Immediate:
- Delete duplicate pending sandboxes (sb-CUfpRfsL3eE5FCA1MPX2o and sb-CUfpRhDSn7cT3pxCwwoEG)
Long-term (pick one):
- Run migration synchronously before pool controllers start
- Block pool queries until migration completes
- Remove migration code entirely after all environments migrated (as comment suggests)
Related
- PR #284: Implement nested component field indexing
- Commit dcd547d: Fix migration code to include db/id in Patch operation