Deployment Acknowledgement via WebSocket #1287
Replies: 3 comments 1 reply
-
|
+1 for the approach. Minor questions.
|
Beta Was this translation helpful? Give feedback.
-
|
Update: Separating Desired State from Actual State Had a further discussion regarding how we handle the deployment state in the platform API (participants: @malinthaprasan @renuka-fernando @dushaniw) The To make this explicit, we're introducing a
This separation makes it easier to answer two distinct questions independently:
When a gateway restarts and requests to fetch the deployments it needs, it can rely on the |
Beta Was this translation helpful? Give feedback.
-
Release StepsThe production Step 1 — Database MigrationAdditive schema changes only, no code deployed.
Step 2 — UI Backward-Compatibility UpdateMust be released before Step 5.
Step 3 — Code ChangesPlatform-API:
Gateway:
Step 4 — Release Gateway
Step 5 — Release and Deploy Platform-APIPrerequisite: Step 2 UI must already be in production.
Step 6 — UI Final Update
Step 7 — Switch Timeout Job to Strict Mode
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Deployment Acknowledgement via WebSocket
Summary
Add deployment acknowledgement via WebSocket so the control plane accurately tracks whether a deployment actually succeeded on the gateway, rather than optimistically marking it as deployed.
Problem
Today, when the control plane pushes a deployment event (API, LLM Provider, or LLM Proxy) to a gateway via WebSocket, it immediately sets the status to
DEPLOYED— before the gateway has actually processed it. There is no feedback loop:DEPLOYEDThis affects all WebSocket-based deployment types: REST APIs, LLM Providers, and LLM Proxies.
Proposed Solution
New Deployment States
Introduce intermediate and terminal states for the deployment lifecycle:
Acknowledgement Flow
DEPLOYINGand writesperformed_at = NOW()todeployment_statusdeploymentIdandperformedAt)deployment.ackmessage back through the same WebSocket connection with:deploymentId,artifactId,resourceType(api/llmprovider/llmproxy)performedAt(echoed back unchanged)status(success/failed),errorMessage(on failure)performedAtagainst DB; if matching:FAILEDunconditionally withstatus_reasonfromerrorMessage(overwrites any prior status for thisperformedAt)DEPLOYING → DEPLOYEDonly; discard if status is alreadyFAILEDScenarios
1. Successful Deployment
Scenario: User deploys an API/LLM Provider/LLM Proxy to a gateway.
Flow: Control plane sets status to
DEPLOYINGwithperformed_at = NOW()→ pushes event to gateway via WebSocket → gateway processes the deployment → gateway sendsdeployment.ackwithstatus=successand the sameperformedAt→ control plane comparesperformedAt, matches, transitions status toDEPLOYED.sequenceDiagram actor User participant CP as Control Plane participant DB as Database participant GW as Gateway User->>CP: Deploy API CP->>DB: SET status=DEPLOYING, performed_at=NOW() CP->>GW: (WebsocketEvent) deploy {deploymentId, performedAt, ...} GW->>CP: (REST) GET /api/internal/v1/apis/{apiId} CP-->>GW: API definition (zip) GW->>GW: Apply config & update policy engine GW->>CP: (WebsocketEvent) deployment.ack {deploymentId, performedAt, status=success} CP->>DB: SELECT performed_at WHERE artifact+gateway DB-->>CP: performed_at (matches) CP->>DB: SET status=DEPLOYED CP-->>User: Status: DEPLOYED2. Successful Undeployment
Scenario: User removes an API/LLM Provider/LLM Proxy from a gateway.
Flow: Control plane sets status to
UNDEPLOYINGwithperformed_at = NOW()→ pushes undeployment event → gateway removes the resource → gateway sendsdeployment.ackwithstatus=success→ control plane transitions status toUNDEPLOYED.3. No Acknowledgement (Gateway Timeout)
Scenario: Deployment pushed (status:
DEPLOYING) but gateway never responds — could be due to crash, network partition, or processing hang.Solution: Background job runs every 1 minute, queries for
DEPLOYING/UNDEPLOYINGentries older than 5 minutes (configurable), and marks them asFAILED.sequenceDiagram participant CP as Control Plane participant DB as Database participant GW as Gateway participant BG as Timeout Job CP->>DB: SET status=DEPLOYING, performed_at=NOW() CP->>GW: WS event {deploymentId, performedAt, ...} note over GW: Gateway crashes / hangs / network lost GW--xCP: (no ack) loop Every 1 minute BG->>DB: SELECT WHERE status IN (DEPLOYING, UNDEPLOYING)<br/>AND performed_at < NOW() - 5min DB-->>BG: stale entries BG->>DB: SET status=FAILED, status_reason="deployment timed out" end4. Failed Deployment with Existing Deployed State
Scenario: A second deployment attempt fails while a previous deployment is already running on the gateway.
Solution:
deployment_statusshows the latest deployment withstatus=FAILED. The gateway still runs the previous deployment's configuration since the new attempt never succeeded. The user can:5. Gateway Disconnect During Deployment
Scenario: Gateway receives the event, starts processing, then disconnects before sending ack.
Solution: The timeout background job catches this and marks as
FAILEDafter the timeout period. On reconnect, the gateway can optionally reconcile its state (future enhancement).6. New Deployment While Previous is In-Progress
Scenario: User triggers a new deployment while a previous one is still
DEPLOYING.Solution: Allow D2 to proceed. It overwrites the in-progress entry with a new
performed_at. When D1's ack arrives carrying the oldperformed_at, the handler detects a mismatch and discards it. No rejection, no queuing needed.7. Multiple Gateway Connections (Clustering)
Scenario: A gateway has multiple WebSocket connections (e.g., 3 replicas in a cluster). The deployment event is broadcast to all connections. 1 replica fails, 2 succeed.
Solution: Any failure =
FAILED. A failure ack overwritesDEPLOYINGor evenDEPLOYEDtoFAILEDas long asperformed_atmatches. A success ack only transitionsDEPLOYING → DEPLOYED; if status is alreadyFAILED, the success is discarded. No counters or extra tables needed — the result depends purely on arrival order:FAILED; the two success acks are discardedDEPLOYED; the failure ack still overwrites toFAILEDEither way, a single failing replica causes the final status to be
FAILED.8. Stale Acknowledgements
Scenario: Gateway sends an ack for a deployment that has already timed out and been marked
FAILED, or for a superseded deployment.Solution: The ack handler compares
performedAtfrom the ack against the current DB value. A mismatch means the ack is stale and is discarded. Logged for observability.Scope
api.deployed,api.undeployed)llmprovider.deployed,llmprovider.undeployed)llmproxy.deployed,llmproxy.undeployed)Changes Overview
DEPLOYING,UNDEPLOYING,FAILEDto status CHECK constraint; addperformed_atandstatus_reasoncolumns todeployment_statusreadLoop, routedeployment.ackto serviceDEPLOYING/UNDEPLOYINGinstead of final status; allow concurrent deployments (new overwrites in-progress)HandleAckmethod; handle acknowledgements, update statussendMessage/sendDeploymentAckmethods with write mutexdeploymentIdandperformedAtto undeployment events (deployment events already carrydeploymentId; addperformedAtto all)Future Work
Deployment History Tracking
Introduce a
deployment_eventstable to record every state transition (DEPLOYING, DEPLOYED, UNDEPLOYING, UNDEPLOYED, FAILED) with timestamps and error messages. This would provide a full audit trail of all deployment attempts per artifact per gateway, allowing users to query the complete history of deployments.Beta Was this translation helpful? Give feedback.
All reactions