Server-owned deployment lifecycle
Problem
The deployment tracking subsystem currently relies on the CLI client to orchestrate the entire deployment lifecycle. In deploy.go, the client:
- Creates a deployment record with a
"pending-build"placeholder version - Polls for external cancellation
- Updates phase to "building", "pushing", "activating" as the build progresses
- Updates the app version ID after build completes
- Marks the deployment as "active"
- On failure, calls
UpdateFailedDeploymentwith error details
This means:
- The server is a dumb CRUD store — it accepts whatever the client tells it with no way to verify the information is accurate
- Client crashes leave stale records — the only safety net is a 30-minute lock timeout
- The deployment lock is racy —
CreateDeploymentchecks for existing in-progress deployments then creates a new one in two separate operations with no transactional guarantee listDeploymentsInternalis O(n) over all deployments — it loads every deployment entity ever created and filters in memory, called on every history query, lock check, and activation
Desired state
The deployment record should be a byproduct of the server-side build/deploy process, not something the client creates and babysits:
- Client sends
Deploy(app, tar, git_info)and gets back a progress stream - Server creates the deployment record, transitions it through phases as the build actually progresses, and activates it when the image is running
- Client is just a viewer of server-managed state
- Rollback is a server-side operation:
Rollback(app, cluster, target_version)— no build needed, server has all the context
Specific concerns to address
- Move deployment lifecycle server-side — the build service (or a coordinating deploy service) should own the deployment record lifecycle
- Implicit state machine — valid transitions are scattered across
UpdateDeploymentStatus,CancelDeployment,UpdateFailedDeployment, and the expired-lock cleanup inCreateDeployment. Centralize into atransition(from, to)function - Inconsistent error patterns —
CancelDeploymentreturns errors as result fields (results.SetError()), other methods return RPC-level errors (cond.ValidationFailure). Pick one pattern "pending-build"sentinel —app_version_idshould be optional on creation and required on activation, rather than using a magic string- Full-scan listing — investigate indexed queries or a compaction/archival strategy for old deployments
Timing
This ties naturally into the saga work that's reworking the build server. The rollback path (PR 2) will be implemented as a fully server-side RPC, which can serve as the model for how forward-deploy should eventually work.
Related
- Cluster ID filter bug fix (shipped)
app historycolumn improvements (shipped)- Rollback PR (next, building on server-side pattern)