Submit an issue View all issues Source
MIR-681

Server-owned deployment lifecycle

Open public
phinze phinze Opened Feb 10, 2026 Updated Jul 2, 2026

Problem

The deployment tracking subsystem currently relies on the CLI client to orchestrate the entire deployment lifecycle. In deploy.go, the client:

  1. Creates a deployment record with a "pending-build" placeholder version
  2. Polls for external cancellation
  3. Updates phase to "building", "pushing", "activating" as the build progresses
  4. Updates the app version ID after build completes
  5. Marks the deployment as "active"
  6. On failure, calls UpdateFailedDeployment with error details

This means:

  • The server is a dumb CRUD store — it accepts whatever the client tells it with no way to verify the information is accurate
  • Client crashes leave stale records — the only safety net is a 30-minute lock timeout
  • The deployment lock is racyCreateDeployment checks for existing in-progress deployments then creates a new one in two separate operations with no transactional guarantee
  • listDeploymentsInternal is O(n) over all deployments — it loads every deployment entity ever created and filters in memory, called on every history query, lock check, and activation

Desired state

The deployment record should be a byproduct of the server-side build/deploy process, not something the client creates and babysits:

  • Client sends Deploy(app, tar, git_info) and gets back a progress stream
  • Server creates the deployment record, transitions it through phases as the build actually progresses, and activates it when the image is running
  • Client is just a viewer of server-managed state
  • Rollback is a server-side operation: Rollback(app, cluster, target_version) — no build needed, server has all the context

Specific concerns to address

  1. Move deployment lifecycle server-side — the build service (or a coordinating deploy service) should own the deployment record lifecycle
  2. Implicit state machine — valid transitions are scattered across UpdateDeploymentStatus, CancelDeployment, UpdateFailedDeployment, and the expired-lock cleanup in CreateDeployment. Centralize into a transition(from, to) function
  3. Inconsistent error patternsCancelDeployment returns errors as result fields (results.SetError()), other methods return RPC-level errors (cond.ValidationFailure). Pick one pattern
  4. "pending-build" sentinel — app_version_id should be optional on creation and required on activation, rather than using a magic string
  5. Full-scan listing — investigate indexed queries or a compaction/archival strategy for old deployments

Timing

This ties naturally into the saga work that's reworking the build server. The rollback path (PR 2) will be implemented as a fully server-side RPC, which can serve as the model for how forward-deploy should eventually work.

Related

  • Cluster ID filter bug fix (shipped)
  • app history column improvements (shipped)
  • Rollback PR (next, building on server-side pattern)