Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACR push happens to production slot, not staging slot #276

Closed
1 task
nmiodice opened this issue Sep 5, 2019 · 1 comment
Closed
1 task

ACR push happens to production slot, not staging slot #276

nmiodice opened this issue Sep 5, 2019 · 1 comment
Assignees
Labels
bug Something isn't working pri-High High priority issue

Comments

@nmiodice
Copy link

nmiodice commented Sep 5, 2019

Description

As an application developer, I'd like to deploy new instances of my application to a staging slot so that I can validate it prior to releasing to production and also quickly rollback in the case of a bad deployment

Acceptance Criteria

Reference: [Done-Done Checklist] (https://github.com/Microsoft/code-with-engineering-playbook/blob/master/Engineering/BestPractices/DoneDone.md)

  • Deployment via ACR web hook should happen to staging slot
@nmiodice nmiodice added the bug Something isn't working label Sep 5, 2019
@ianphil ianphil added userstory-wip pri-High High priority issue labels Sep 23, 2019
@nmiodice nmiodice self-assigned this Sep 23, 2019
@nmiodice
Copy link
Author

The following issues were identified with the ACR Web Hook approach, so it will be removed as a part of #316.

Failure modes for ACR webhook & Web App CD

In the following scenarios, assume the following:

  • “A” is a webapp. In each scenario it starts as the production slot
  • “B” is a webapp. In each scenario it starts as the staging slot

Scenario 1 – Failure to rollback application deployments via swap:

  • “A” is running image with image hash “foo”
  • “B” is running image with image hash “bar”
  • “A” and “B” are swapped
  • “A” (now staging) unexpectedly picks up the latest image (“bar”). I’m not sure why.
  • Image hash “bar” is determined to have a bug; a rollback is needed and is done by re-swapping the slots
  • “A” (now production) is incorrectly running “bar”

Scenario 2 – Production is already down in this case:

  • “A” is running a misconfigured container that fails start (i.e., process dies on startup due to bug in application code). App Service will (indefinitely) retry by pulling and deploying until it can startup successfully.
  • ACR push occurs; “B” picks up new image
  • On a retry attempt, “A” picks up the latest image and it is deployed unexpectedly to production

Scenario 3 – Possibly a very delayed impact:

  • “A” is running image with image hash “foo”
  • ACR push occurs; “B” is running image with image hash “bar”
  • “A” (re)starts for some reason (you can do this via portal, or perhaps a backend server dies, or perhaps the service plan scales out); “A” picks up the latest image and it is deployed unexpectedly to production.

Scenario 4 (low risk) – The first deployment:

  • “A” and “B” are newly deployed; No image is running in either slot
  • “A” and “B” try to start (indefinitely?) repeatedly
  • ACR push occurs; webhook for “B” fires
  • On a retry attempt, “A” picks up the latest image and it is deployed unexpectedly to production (note: similar to Scenario 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pri-High High priority issue
Projects
None yet
Development

No branches or pull requests

2 participants