disk_mount wedges in DM_UNMOUNTING when the unmount syscall fails (EBUSY); never retried
Found while investigating MIR-1245 (disk lease cleanup). This is a separate, pre-existing bug in the disk_mount unmount state machine — the MIR-1245 changes don't touch this path.
What we know (traced from code, not yet observed at runtime)
In components/diskio/disk_mount_controller.go, unmountAndDetach persists ActualState = DM_UNMOUNTING (line ~486) before attempting the unmount. If c.ops.Unmount then fails (lines ~536-540) — e.g. EBUSY because something still holds the mount — it returns the error without calling setMountError, so the entity stays in DM_UNMOUNTING.
On any subsequent reconcile, reconcileMountUnmounted (line ~191) treats DM_UNMOUNTING as "in flight, do nothing" and returns nil. This is true for both the event-driven path and the ReconcileWithEntities resync (which re-reads fresh from the store and hits the same no-op). Net effect: the unmount is never retried; the mount stays wedged in DM_UNMOUNTING.
Contrast: when the detach or cloud-lease-release fails (lines ~576-588), it calls setMountError → DM_ERROR, and DM_ERROR is in the re-drive list (line ~193), so that failure mode retries to convergence. The unmount-failure case is the odd one out.
Likely root cause: unmountAndDetach is synchronous (no background actor), so observing DM_UNMOUNTING at the start of a reconcile can only mean a prior call set the flag and then errored/died mid-way — exactly when a retry is wanted. Line 191 instead leaves it alone, on the (here incorrect) assumption that something else is finishing the job.
What we do NOT know yet
- Whether this fires in practice. In most teardown paths the container is gone before the unmount runs (in standalone mode containerd dies with miren; MIR-1245's boot-failure path kills the container first), so the unmount likely succeeds. The case that looks reachable on paper is the graceful path:
StopSandboxreleases the lease early (sandbox.go:~2629) specifically so the async unmount races container shutdown — if the unmount wins that race, EBUSY → wedge. Not yet reproduced. - Exact requeue semantics of the event-driven controller on a returned error. The resync demonstrably re-drives through the no-op, so the conclusion holds regardless, but we haven't traced the immediate-requeue behavior to ground.
Severity (scoped deliberately small)
This looks like a resource leak, not a correctness/corruption bug. Even when a mount wedges in DM_UNMOUNTING, the stale loop device stays attached, and a future sandbox's FindLoopByBacking (disk_mount_controller.go:~289) adopts that same loop instead of double-attaching — so the single-writer invariant is still enforced at the loop layer. What leaks is the orphaned mountpoint and the stuck DM_UNMOUNTING entity, which appears to persist across restarts (the entity exists, so the orphan-cleanup branch in ReconcileWithEntities doesn't catch it). We have not verified the cross-restart behavior by hand.
Possible fix directions (not yet evaluated)
Minimal/consistent: on unmount failure, call setMountError like the detach path, so it lands in the retriable DM_ERROR. Alternative: make DM_UNMOUNTING itself re-drive unmountAndDetach (treat it as "resume"), or don't persist DM_UNMOUNTING until after the unmount succeeds. Each needs a test that an unmount returning EBUSY eventually converges once the holder releases.
(Line numbers are approximate and will drift.)