MIR-912

Overlay IP allocator assigns duplicate IPs to concurrent sandboxes

Done public

phinze Opened Mar 27, 2026 Updated Mar 30, 2026

Fix overlay IP allocator assigning duplicate IPs after restart

Bug

The overlay network IP allocator assigned the same IP address (10.8.95.4) to two different running sandboxes, causing cross-app traffic routing. Requests to meet.miren.garden sporadically returned the websocket-echo test page instead.

This is a data-integrity / security issue — traffic intended for one application was delivered to a completely different application.

Timeline

All times UTC, 2026-03-27. Coordinator: miren-garden (main:1226e9a then dev build).

16:05:05 — websocket-echo sandbox CZTsQRFt created, assigned IP 10.8.95.4:3000

dns added sandbox to DNS mapping │ sandbox: websocket-echo-web-CZTsQRFt app: websocket-echo ip: 10.8.95.4

16:37–16:40 — websocket-echo sandbox survived two miren restarts (KillMode=process preserves shims), recovered each time on same IP, confirmed running:

coordinator.activator recovered sandbox │ app: websocket-echo sandbox: CZTsQRFt url: http://10.8.95.4:3000

16:55:00 — New meet deploy, meet sandbox CZTwDFSB created and scheduled to node/miren:

coordinator.sandboxpool created sandbox │ sandbox: meet-web-CZTwDFSB
coordinator.scheduler assigning sandbox to node │ sandbox: meet-web-CZTwDFSB node: node/miren

16:55:01 — Meet sandbox assigned the same IP 10.8.95.4, container starts:

dns added sandbox to DNS mapping │ sandbox: meet-web-CZTwDFSB app: meet ip: 10.8.95.4
runner.sandbox container started │ id: sandbox.meet-web-CZTwDFSB-app

16:55:01 onward — Both sandboxes running on 10.8.95.4:3000. HTTP ingress routes meet.miren.garden to app/meet correctly, but the IP resolves to either sandbox's container depending on timing. Operator observed websocket-echo responses on meet.miren.garden.

16:56:46 — Connection resets as the two containers fight over the same IP:

coordinator.httpingress proxy error │ error: "write tcp 10.8.95.1:37232->10.8.95.4:3000: connection reset by peer" app: meet

Context

This occurred after multiple server restarts during MIR-890 dogfooding. The crash/restart churn may have caused the IP allocator to lose track of which IPs were in use. The websocket-echo sandbox survived restarts via containerd shim preservation (KillMode=process), but the IP allocator may not have accounted for these surviving sandboxes when assigning IPs to new sandboxes.

Impact

Cross-application traffic routing (security issue)
Application instability (connection resets)
Operator confusion (wrong app responding)

Expected behavior

The IP allocator must never assign an IP that is currently in use by a running sandbox. It should either:

Track all allocated IPs in etcd and refuse duplicates
Check containerd for running containers before allocating
Use a lease-based allocation that's tied to sandbox lifecycle