MIR-1249

disk_mount wedges in DM_UNMOUNTING when the unmount syscall fails (EBUSY); never retried

Open runtime Bug public

phinze Opened Jun 22, 2026 Updated Jun 22, 2026

Found while investigating MIR-1245 (disk lease cleanup). This is a separate, pre-existing bug in the disk_mount unmount state machine — the MIR-1245 changes don't touch this path.

What we know (traced from code, not yet observed at runtime)

In components/diskio/disk_mount_controller.go, unmountAndDetach persists ActualState = DM_UNMOUNTING (line ~486) before attempting the unmount. If c.ops.Unmount then fails (lines ~536-540) — e.g. EBUSY because something still holds the mount — it returns the error without calling setMountError, so the entity stays in DM_UNMOUNTING.

On any subsequent reconcile, reconcileMountUnmounted (line ~191) treats DM_UNMOUNTING as "in flight, do nothing" and returns nil. This is true for both the event-driven path and the ReconcileWithEntities resync (which re-reads fresh from the store and hits the same no-op). Net effect: the unmount is never retried; the mount stays wedged in DM_UNMOUNTING.

Contrast: when the detach or cloud-lease-release fails (lines ~576-588), it calls setMountError → DM_ERROR, and DM_ERROR is in the re-drive list (line ~193), so that failure mode retries to convergence. The unmount-failure case is the odd one out.

Likely root cause: unmountAndDetach is synchronous (no background actor), so observing DM_UNMOUNTING at the start of a reconcile can only mean a prior call set the flag and then errored/died mid-way — exactly when a retry is wanted. Line 191 instead leaves it alone, on the (here incorrect) assumption that something else is finishing the job.

What we do NOT know yet

Whether this fires in practice. In most teardown paths the container is gone before the unmount runs (in standalone mode containerd dies with miren; MIR-1245's boot-failure path kills the container first), so the unmount likely succeeds. The case that looks reachable on paper is the graceful path: StopSandbox releases the lease early (sandbox.go:~2629) specifically so the async unmount races container shutdown — if the unmount wins that race, EBUSY → wedge. Not yet reproduced.
Exact requeue semantics of the event-driven controller on a returned error. The resync demonstrably re-drives through the no-op, so the conclusion holds regardless, but we haven't traced the immediate-requeue behavior to ground.

Severity (scoped deliberately small)

This looks like a resource leak, not a correctness/corruption bug. Even when a mount wedges in DM_UNMOUNTING, the stale loop device stays attached, and a future sandbox's FindLoopByBacking (disk_mount_controller.go:~289) adopts that same loop instead of double-attaching — so the single-writer invariant is still enforced at the loop layer. What leaks is the orphaned mountpoint and the stuck DM_UNMOUNTING entity, which appears to persist across restarts (the entity exists, so the orphan-cleanup branch in ReconcileWithEntities doesn't catch it). We have not verified the cross-restart behavior by hand.

Possible fix directions (not yet evaluated)

Minimal/consistent: on unmount failure, call setMountError like the detach path, so it lands in the retriable DM_ERROR. Alternative: make DM_UNMOUNTING itself re-drive unmountAndDetach (treat it as "resume"), or don't persist DM_UNMOUNTING until after the unmount succeeds. Each needs a test that an unmount returning EBUSY eventually converges once the holder releases.

(Line numbers are approximate and will drift.)