-
Notifications
You must be signed in to change notification settings - Fork 1
ENG-883 Add ECR support proposal. #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 3. The Lambda function lists all repositories in the ECR registry. | ||
| 4. For each repository, the function lists all image tags. | ||
| 5. For each image tag (or a subsection of tags), the function pulls the image using the Docker CLI. | ||
| 6. The function then pushes the image to a designated S3 bucket in the source account, organizing images by repository and tag for easy retrieval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than hand-crank this, we'd be better off trying to leverage https://github.com/containers/skopeo?tab=readme-ov-file#syncing-registries or similar. If we can get an s3fs-fuse mount into a container lambda, it should Just Work. We might want to get a spike in to prove it out, but I'd much rather not have to build this bit ourselves.
| 2. The schedule event triggers an AWS Lambda function. | ||
| 3. The Lambda function lists all repositories in the ECR registry. | ||
| 4. For each repository, the function lists all image tags. | ||
| 5. For each image tag (or a subsection of tags), the function pulls the image using the Docker CLI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would an efficient "incremental" backup cohort be identified? Would the backup complete within 15mins? I don't know how large some people's images are, but there's a 10GB ephermeral storage limit in lambda so, for some, that might be exceeded, or at least there would need to be some housekeeping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use skopeo this shouldn't be a problem. It'll do the copy in layers (which is the unit of increment we have available)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you do this in MESH?
|
|
||
| ## Step-by-Step Implementation | ||
|
|
||
| ### Stage 1: ECR to S3 Backup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for reference, on MESH, when we push and image to ECR, we also push the tarball to S3 - the build will fail if both aren't completed.
|
|
||
| ### Considerations | ||
|
|
||
| * ECR replication does not provide immutability; images can be deleted or overwritten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-tag-mutability.html - but yes, a bad actor with admin access could delete images.
| Why an Immutable ECR Backup? | ||
| While Amazon ECR provides image replication, it lacks an immutable, long-term backup solution in a separate security boundary. In a disaster recovery (DR) scenario where a primary AWS account is compromised, standard replication is not sufficient. This solution addresses that by creating an "air-gapped" backup protected by an AWS Backup Vault Lock, which provides a Write-Once-Read-Many (WORM) model. | ||
|
|
||
| ## Solution Architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about restoration? For reference, on MESH, the tarballs of the images are in the S3 remote immutable backup. For restoration, we fetch the tarball, docker load it, then push to ECR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an additional proposal for restoration feature to be built in to the blueprint as well.
The proposal doesn't include ECR at the moment but gives the framework for how the restoration will happen through the blueprint.
https://github.com/NHSDigital/terraform-aws-backup/pull/79/files
Description
The PR is a record of the proposal to support ECR through the blueprint.
Context
ECR backup solutions have been requested and the proposal details how this could be completed.
Type of changes
Checklist
Sensitive Information Declaration
To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.