Submit an issue View all issues Source
MIR-946

Robustify flakey CI tests

Done public
phinze phinze Opened Mar 31, 2026 Updated Apr 3, 2026

Context

Analysis of CI failures over the last 12 days (Mar 20–31, 91 Test workflow runs, 18 failures). After filtering out branch-specific real bugs (frozen hash checks, compile errors, lint-only), there are 7 runs with true flaky test failures across 5 distinct patterns.

Flakey Patterns

P0: Blackbox addon tests — WaitForAppReady fatals on transient crash (3 runs)

Tests: TestAddonDeployWithAppToml, TestMysqlAddonDeployWithAppToml

Symptom: "app crashed while waiting for ready" / "timed out waiting for env var DATABASE_URL"

Root cause: WaitForAppReady in blackbox/harness/app.go:146-147 calls t.Fatalf immediately on seeing health: "crashed". During addon provisioning the app starts before DATABASE_URL is injected, so it crashes on first boot. This is a transient crash that would recover once the env var arrives and the app restarts.

Fix: Treat "crashed" as a non-fatal transient state during polling — log it and keep waiting, only fatal if the overall timeout is exceeded without ever seeing healthy. Also consider increasing the MySQL WaitForEnvVar timeout (currently 5m, which can be tight for cold-start provisioning).

P0: Periodic cleanup timestamp precision race (1 run)

Tests: TestSandbox/cleans_up_dead_sandboxes_older_than_1_hour, TestPeriodicReleasesDiskLeases/releases_disk_leases_before_deleting_dead_sandbox

Symptom: "Should be false" / "dead sandbox sbID1 should have been cleaned up" / "Not equal"

Root cause: Both tests call sbc.EAC.Put() to write a DEAD sandbox entity, then immediately call Periodic(ctx, 0). Periodic checks updatedAt.Before(time.Now()) — since the entity was written nanoseconds before, updatedAt equals or exceeds the cutoff time, so the sandbox is never cleaned up.

  • sandbox_test.go:820: Periodic(ctx, 0) — cutoff = now, updatedAt ≈ now
  • volume_test.go:531: same pattern

Fix: Use a small negative time horizon (e.g., Periodic(ctx, -time.Millisecond)) or add a time.Sleep(time.Millisecond) before calling Periodic.

P1: Container watchdog entity reference race (3 runs)

Test: TestContainerWatchdog/removes_orphaned_containers

Symptom: "attribute dev.miren.compute/key.node references a non-existent entity: not found: node/test-node"

Root cause: Test creates node/test-node, then creates sandbox entities referencing it. Entity server validates Ref fields by checking existence. On occasion, the sandbox Put fires before the node entity is fully committed/visible.

Fix: After putting the node entity, confirm it's readable (get-after-put) before proceeding with sandbox creation.

P2: Etcd auto-restart timeout too tight (1 run)

Test: TestEtcdComponentAutoRestart

Symptom: "Condition never satisfied" (test ran 68.8s)

Root cause: Test kills etcd, then waits only 30s (require.Eventually) for restart + healthy. On loaded CI runners, etcd restart (backoff + WAL recovery) can exceed 30s. Also creates a new etcdClient on every poll iteration, which is expensive.

Fix: Increase require.Eventually timeout to 60-90s. Reuse a single client in the poll loop.

P2: Runner integration test — no test-level timeout (1 run)

Test: components/runner/ integration test

Symptom: Test killed at 600s (Go's default 10-minute timeout)

Root cause: No context.WithTimeout on the test. coord.Start(ctx) blocks indefinitely if etcd is slow. Uses time.Sleep for synchronization instead of readiness polls.

Fix: Add a test-level context timeout (e.g., 2 minutes). Replace time.Sleep barriers with readiness polls.

Summary

Priority Pattern Runs Affected Fix Complexity
P0 WaitForAppReady fatals on transient crash 3 Low
P0 Periodic timestamp precision race 1 Low (one-liner)
P1 Watchdog entity reference race 3 Low
P2 Etcd restart timeout too short 1 Low
P2 Runner test no timeout 1 Low

All fixes are low-complexity and independent of each other — could be done in a single PR or broken up per-pattern.