Submit an issue View all issues Source
MIR-1280

BuildKit GC not enforcing its keepBytes cap — cache grew to 2.6× the configured limit

In Progress runtime Bug public
phinze phinze Opened Jun 30, 2026 Updated Jul 1, 2026

What happened

BuildKit's local build cache on miren-garden grew to 27GB against a configured 10GB cap, a primary contributor to the MIR-1279 disk-pressure outage. Manual buildctl prune --keep-storage 10000 reclaimed 17GB, confirming the cache was collectable — automatic GC just never did it.

Root cause (proven on the box)

Two compounding faults in the generated buildkitd.toml:

  1. No ceiling field was ever set. The single policy used keepBytes, which BuildKit v0.19 deprecates and migrates to reservedSpace — a floor ("never prune below"), not a cap. maxUsedSpace (the actual ceiling) was 0. Confirmed live via buildctl debug workers: {reservedSpace:10GB, maxUsedSpace:0, keepDuration:7d}.
  2. keepDuration pins far more than it looks. It protects any record used within 7 days — and because deletion can't remove a parent while a child still references it, it transitively protects the entire ancestry of anything touched in the window. On a builder that constantly rebuilds a stable family of base images, that pins nearly the whole cache. Measured: 73 of 73 over-7d records on garden were pinned by a younger descendant; zero were freely deletable.

Net: GC ran after every build and freed only whatever just aged out (logs show ~a dozen runs freeing 0.1–3GB each), but the 10GB cap never bound, so the cache climbed unbounded.

Live proof

Same 2GB prune target, only the age gate toggled: --keep-storage 2000 --keep-duration 168h freed 0B; --keep-storage 2000 freed 7.74GB. The age gate alone makes the cap unenforceable.

Ruled out

  • Not a BuildKit version bug. The prune / calculateKeepBytes logic is byte-identical from v0.19.0 (the box's rc3) through v0.31.1. Upgrading won't fix it.
  • Not the pull-through cache migration (infra#76, Jun 24). All base-version churn predates it on the old periodic-sync cadence; only ~0.06GB of pulls postdate the cutover.

Fix

Replace the single keepDuration policy with a graduated multi-tier policy whose final two tiers are age-less maxUsedSpace ceilings (mirrors BuildKit's own default shape, using the non-deprecated fields and OR-form source filters). Implemented in components/buildkit/buildkit.go; PR to follow.

Follow-ups (separate)

  • pkg/stackbuild: SharedKeyHint is keyed on a per-build temp dir (dead as written); WithMetaResolver applied inconsistently; consider pinning base-image digests — more valuable now under the pull-through cache.
  • Registry retention: miren-oci is 236GB with no GC (filed separately).
  • Optional: a periodic buildctl prune as version-independent insurance, and ship buildctl on the host so there's an on-box way to inspect/prune.