ACR push happens to production slot, not staging slot #276

nmiodice · 2019-09-05T21:42:51Z

Description

As an application developer, I'd like to deploy new instances of my application to a staging slot so that I can validate it prior to releasing to production and also quickly rollback in the case of a bad deployment

Acceptance Criteria

Reference: [Done-Done Checklist] (https://github.com/Microsoft/code-with-engineering-playbook/blob/master/Engineering/BestPractices/DoneDone.md)

Deployment via ACR web hook should happen to staging slot

nmiodice · 2019-09-24T18:33:29Z

The following issues were identified with the ACR Web Hook approach, so it will be removed as a part of #316.

Failure modes for ACR webhook & Web App CD

In the following scenarios, assume the following:

“A” is a webapp. In each scenario it starts as the production slot
“B” is a webapp. In each scenario it starts as the staging slot

Scenario 1 – Failure to rollback application deployments via swap:

“A” is running image with image hash “foo”
“B” is running image with image hash “bar”
“A” and “B” are swapped
“A” (now staging) unexpectedly picks up the latest image (“bar”). I’m not sure why.
Image hash “bar” is determined to have a bug; a rollback is needed and is done by re-swapping the slots
“A” (now production) is incorrectly running “bar”

Scenario 2 – Production is already down in this case:

“A” is running a misconfigured container that fails start (i.e., process dies on startup due to bug in application code). App Service will (indefinitely) retry by pulling and deploying until it can startup successfully.
ACR push occurs; “B” picks up new image
On a retry attempt, “A” picks up the latest image and it is deployed unexpectedly to production

Scenario 3 – Possibly a very delayed impact:

“A” is running image with image hash “foo”
ACR push occurs; “B” is running image with image hash “bar”
“A” (re)starts for some reason (you can do this via portal, or perhaps a backend server dies, or perhaps the service plan scales out); “A” picks up the latest image and it is deployed unexpectedly to production.

Scenario 4 (low risk) – The first deployment:

“A” and “B” are newly deployed; No image is running in either slot
“A” and “B” try to start (indefinitely?) repeatedly
ACR push occurs; webhook for “B” fires
On a retry attempt, “A” picks up the latest image and it is deployed unexpectedly to production (note: similar to Scenario 1)

nmiodice added the bug Something isn't working label Sep 5, 2019

ianphil added userstory-wip pri-High High priority issue labels Sep 23, 2019

nmiodice self-assigned this Sep 23, 2019

nmiodice removed the userstory-wip label Sep 23, 2019

nmiodice closed this as completed Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACR push happens to production slot, not staging slot #276

ACR push happens to production slot, not staging slot #276

nmiodice commented Sep 5, 2019 •

edited

Loading

nmiodice commented Sep 24, 2019

ACR push happens to production slot, not staging slot #276

ACR push happens to production slot, not staging slot #276

Comments

nmiodice commented Sep 5, 2019 • edited Loading

Description

Acceptance Criteria

nmiodice commented Sep 24, 2019

Failure modes for ACR webhook & Web App CD

Scenario 1 – Failure to rollback application deployments via swap:

Scenario 2 – Production is already down in this case:

Scenario 3 – Possibly a very delayed impact:

Scenario 4 (low risk) – The first deployment:

nmiodice commented Sep 5, 2019 •

edited

Loading