MIR-1279

miren-garden wedged ~14h on unresponsive etcd; root cause is disk pressure + unbounded buildkit cache

Done public

phinze Opened Jun 30, 2026 Updated Jul 1, 2026

miren-development: bump garden data disk to 250GB

On June 30, miren-garden was effectively down for about 14 hours before anyone noticed. Every HTTP route returned errors because the coordinator couldn't look up routes, and the underlying cause was etcd becoming unresponsive. A systemctl restart miren fully recovered it at 15:49 UTC. This issue anchors what we found so the cleanup and the real fixes don't get lost.

Timeline (UTC, Jun 30)

00:17 — first etcd context deadline exceeded (GC tasks start failing)
~01:00 — etcd MemberList auto-sync starts failing every cycle (~110/hr)
02:00 → 15:49 — sustained outage, 5,600–7,600 route-lookup errors per hour
15:49 — restart, full recovery (0 errors since the first 4s of startup)

What actually happened

Two problems fed each other. First, /var/lib/miren is at 90% (disk-pressure GC threshold is 80%), and when disk pressure fired the image-GC, the GC failed because it has to list artifacts through etcd, which was already timing out. So the disk never got cleaned, the pressure persisted, and etcd stayed stressed. A genuine death spiral, visible in the logs (image GC failed │ trigger: disk_pressure ... context deadline exceeded).

Second, the box is a swapless 15G machine and the miren.service slice peaked near 15G (basically all of RAM). App containers summed to under 1G, so the heavy consumer was a system component (coordinator / buildkit / containerd). Memory pressure on a swapless host thrashes Go GC and starves goroutines, which neatly explains why etcd auto-sync (DeadlineExceeded) and the coordinator's internal RPCs (context canceled) both wedged at the same moment.

One forensic gap: etcd's own server logs (the fdatasync / apply took too long lines that would let us separate "disk stall" from "memory starvation") aren't shipped to VictoriaLogs. We only see the coordinator's etcd-client logs.

Disk breakdown (88G used of 98G)

buildkit build cache: 26G — but buildkitd.toml caps it at keepBytes = 10GB. It's 2.6× over its own configured limit, so buildkit's self-GC is not holding the line. This is the prime reclaim target and mostly garbage.
registry/blobs: 26G — legitimately referenced app image versions (image-GC retains all 1716 blobs).
containerd overlayfs + content: ~20G — unpacked snapshots for the ~30 running containers; working set.
data/local/app: 9G — app persistent data.

Important: image-GC ran fine post-restart and reclaimed nothing (229/229 images, 1716/1716 blobs retained). The registry/containerd usage is real working set, not garbage. The disk is undersized for this workload and buildkit cache is leaking past its cap.

Follow-up work (to be split into sub-issues)

Reclaim buildkit cache now (no buildctl on the box; needs a matching binary or a restart-triggered GC) and figure out why buildkit GC isn't enforcing its 10GB cap.
Grow the /var/lib/miren disk — 98G is too small for registry + buildkit + ~30 containers.
Add swap to the box; a swapless 15G node running builds + etcd + 30 apps is fragile.
Break the GC death-spiral in runtime — disk-pressure image-GC shouldn't depend solely on a possibly-wedged etcd; add an etcd-independent fallback prune path.
Ship etcd's container logs to VictoriaLogs so next time we can see fsync warnings directly.
Alerting — garden was down 14h before a human noticed. Alert on etcd MemberList failure rate, route-lookup error rate, and disk >85%.