MIR-1016

Unknown app.toml fields silently drop services instead of erroring or being ignored

Done public

phinze Opened Apr 15, 2026 Updated Jul 2, 2026

Fail builds on app.toml parse errors instead of silently continuing

The cloud app deploy on 2026-04-14 included an [aliases] section in app.toml that the prod runtime (v0.6.1) doesn't understand yet. Instead of erroring or ignoring the unknown field, the deploy silently dropped all non-web services. The CLI reported success, the version activated, but only services: 1 was stored instead of 3. No valkey pool, no bgtask pool.

The web service limped along without valkey for about 23 hours (the cluster-channel component kept retrying DNS lookups, gradually pushing request durations to 10-17s). When new web sandboxes tried to boot, they crashed immediately on the valkey connection, hit crash cooldown after 19 failures, and took down the whole control plane.

We confirmed by removing [aliases] and redeploying. All three pools came up immediately. Also hit a panic: send on closed channel in deploy.go:634 on the first redeploy attempt with aliases still present, which may be related.

We don't have complete visibility into the degradation timeline (journalctl only goes back to a miren process restart mid-incident, and the CI logs don't capture server-side service creation). More analysis warranted on how exactly the cascade played out.

The fix here is that unknown app.toml fields should either be ignored or rejected at deploy time. Silently corrupting the service list is the worst outcome since nothing signals that anything went wrong.