MIR-1235

Workload-identity token server orphans already-running sandboxes on restart → permanent 403 "invalid token"

Done public

evan Opened Jun 10, 2026 Updated Jun 10, 2026

sandbox: re-register workload-identity token secrets on controller restart

Summary

A long-running sandbox that authenticates via workload identity permanently loses chitchat (and any multipass-token) connectivity after the host's sandbox controller / token server restarts. The local per-sandbox token server returns 403 {"error":"invalid token"} for the sandbox's MIREN_IDENTITY_TOKEN_SECRET, and the sandbox cannot recover on its own — it requires a restart/redeploy.

Observed

linearagent sandbox AN2 (host miren-garden-runner-1, IP 10.8.64.3) ran fine from 2026-06-07, then began looping every 30s: chitchat connection lost err="token: fetch identity token: token server: 403 Forbidden: {"error":"invalid token"}"
The 403 originates in the runtime token server, surfaced through the service's identity client (hop 1: GET $MIREN_IDENTITY_TOKEN_URL).
Restarting the app fixed it immediately (a fresh sandbox re-registers its secret).
Freshly-started sandboxes are unaffected — which is the tell: this is one lost registration, not a token-server outage.

Root cause — `controllers/sandbox/token_server.go` + `controllers/sandbox/sandbox.go`

The token server validates the caller's bearer (MIREN_IDENTITY_TOKEN_SECRET) against an in-memory registry keyed by source IP (tokenSecretRegistry.byAddr, verify() ~L59-67). No entry for the IP → 403 "invalid token" (handleTokenRequest ~L128-131).
That registry is created empty on controller startup (sandbox.go:492) and is only ever populated on the sandbox start path, where the secret is generated, registered by IP, and injected as an env var (sandbox.go:2135-2141).
Nothing re-registers an already-running sandbox. So when the controller / token server restarts, every live sandbox's IP→secret entry is gone. The running process still holds the original secret in its env (fixed for the life of the process), so its token requests now 403 forever. It only recovers when the sandbox itself is restarted (which regenerates + re-registers + re-injects the secret).

Impact

Any long-lived service using workload identity (anything modeled on clusteragent's identity flow) silently loses chitchat connectivity on a controller / token-server restart and cannot self-heal — manual restart required. Services using a static CHITCHAT_API_KEY are unaffected.

Suggested fixes (pick one)

Re-register on reconcile: on controller startup, repopulate tokenSecretRegistry for already-running sandboxes (requires the secret to be recoverable — see 3).
Persist the registry (IP→secret, sandbox→IP) across controller restarts.
Make the secret a refreshable file rather than a boot-time env value: mount it (like the identity token at /var/run/miren/identity-token) and have the token server accept the current file contents, so a re-registered/rotated secret is picked up by the running process without a restart.
Consider keying the registry by sandbox identity rather than raw IP, to also close IP-reuse edge cases.

Repro

Start a sandbox that uses workload identity; confirm it connects.
Restart the host's sandbox controller (token server) without restarting the sandbox.
Force the sandbox to fetch a fresh token (e.g. reconnect after the 24h multipass token expires) → 403 invalid token; stays broken until the sandbox is restarted.

Notes

Surfaced via linearagent, but this is a runtime issue affecting all workload-identity consumers.
Mitigation under consideration in linearagent: fall back to the static CHITCHAT_API_KEY when identity refresh fails, so a registry loss degrades to apikey auth instead of a full disconnect. That hardens the consumer but does not fix the underlying server bug.