Submit an issue View all issues Source
MIR-1238

Duplicate sandbox IP (10.8.64.2) assigned to two sandboxes on garden runner-1 → chitchat connection flapping

In Progress public
evan evan Opened Jun 12, 2026 Updated Jun 12, 2026

Summary

On the garden cluster, runner-1 (miren-garden-runner-1, us-central1-a) assigned the same sandbox IP 10.8.64.2 to two different sandboxes on the rt0 bridge. The duplicate IP causes ARP conflicts and intermittent connection resets for the affected sandboxes' outbound traffic.

The sandbox IPAM must never hand out the same rt0 address to two live sandboxes.

Impact

Two apps whose sandboxes landed on runner-1 — sportsagent and reviewagent — both egress from 10.8.64.2 and have unstable outbound connectivity. Their long-lived chitchat WebSocket connections to the ingress flap constantly:

  • ~29 chitchat connection lost events per ~15-min window for each app.
  • Errors: read tcp 10.8.64.2:<port>->34.122.229.118:443: read: connection reset by peer and ... i/o timeout, reconnecting ~1s later.
  • User-visible symptom: RPCs to these services intermittently return "empty reply (no responder)" because a request that lands during a reset/reconnect gap gets no responder.

Apps on other runners (calagent, noteagent) have been stably connected for 7+ hours with zero drops, so this is runner-1-specific, not cluster-wide.

Evidence

Both failing sandboxes share the same source IP:

sportsagent:  read tcp 10.8.64.2:60458->34.122.229.118:443: i/o timeout
reviewagent:  read tcp 10.8.64.2:57422->34.122.229.118:443: read: connection reset by peer

Kernel logs martian/ARP conflicts on runner-1 (dmesg):

IPv4: martian source 10.8.64.1 from 10.8.64.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff ee 45 b1 fe 2d db 08 06   (ethertype 0x0806 = ARP, broadcast)

The node itself is healthy — host-network outbound to the ingress is flawless, so this is purely the sandbox egress path:

# from runner-1 host netns, 10/10 success:
curl https://chitchat.miren.garden/v1/whoami  ->  401 dns=0.002s conn=0.003s total=0.018s  (x10)

Not a resource-exhaustion issue:

  • conntrack: 179 / 262144 (not full); conntrack -S insert_failed / drop / error all 0.
  • load avg ~0.0, up 23 days.

Network config on runner-1:

  • rt0 bridge = 10.8.64.1/24 (sandbox gateway); SNAT/masquerade rules assign per-/32 sources 10.8.64.210.8.64.10, i.e. IPAM is supposed to give each sandbox a distinct address.
  • net.ipv4.conf.all.rp_filter = 1 (strict) on all interfaces, including rt0 — so any asymmetry/duplication shows up as martian drops.

Root cause

Sandbox IP allocation on rt0 handed 10.8.64.2 to two concurrent sandboxes on runner-1. With strict rp_filter and a shared L2 bridge, two hosts answering ARP for the same IP corrupts neighbor resolution and causes the kernel to drop packets (martian) / reset flows — producing the ~1-minute connection flapping as ARP entries refresh.

Repro / how observed

  1. Deploy apps such that ≥2 sandboxes land on the same runner (garden runner-1).
  2. Observe both sandboxes' logs egressing from 10.8.64.2.
  3. dmesg on the runner shows martian source 10.8.64.1 from 10.8.64.2.
  4. Their outbound (e.g. chitchat WS) flaps with connection reset by peer / i/o timeout, while the host's own outbound is fine.

Suggested next steps

  • Audit the rt0 sandbox IPAM / lease allocation for the duplicate-assignment bug (race or leaked lease that lets .2 be reused while still in use). Ensure allocation is atomic and conflict-checked.
  • Consider detecting duplicate rt0 addresses at sandbox start (ARP probe / lease check) and refusing/reallocating.
  • Immediate mitigation on garden: recycle one of the conflicting sandboxes so it gets a fresh, unique IP (or drain/reschedule runner-1's sandboxes).

Environment

  • Cluster: garden · Runner: miren-garden-runner-1 (us-central1-a) · kernel 6.17.0-1016-gcp
  • Observed: 2026-06-11 (UTC evening). Affected apps: sportsagent, reviewagent.