Skip to content

Conversation

@mannickutd
Copy link
Contributor

Description

The PR is a record of the proposal to support ECR through the blueprint.

Context

ECR backup solutions have been requested and the proposal details how this could be completed.

Type of changes

  • Refactoring (non-breaking change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would change existing functionality)
  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I am familiar with the contributing guidelines
  • I have followed the code style of the project
  • I have added tests to cover my changes
  • I have updated the documentation accordingly
  • This PR is a result of pair or mob programming

Sensitive Information Declaration

To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.

  • I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

@mannickutd mannickutd requested a review from a team as a code owner September 20, 2025 13:36
Comment on lines +23 to +26
3. The Lambda function lists all repositories in the ECR registry.
4. For each repository, the function lists all image tags.
5. For each image tag (or a subsection of tags), the function pulls the image using the Docker CLI.
6. The function then pushes the image to a designated S3 bucket in the source account, organizing images by repository and tag for easy retrieval.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than hand-crank this, we'd be better off trying to leverage https://github.com/containers/skopeo?tab=readme-ov-file#syncing-registries or similar. If we can get an s3fs-fuse mount into a container lambda, it should Just Work. We might want to get a spike in to prove it out, but I'd much rather not have to build this bit ourselves.

2. The schedule event triggers an AWS Lambda function.
3. The Lambda function lists all repositories in the ECR registry.
4. For each repository, the function lists all image tags.
5. For each image tag (or a subsection of tags), the function pulls the image using the Docker CLI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would an efficient "incremental" backup cohort be identified? Would the backup complete within 15mins? I don't know how large some people's images are, but there's a 10GB ephermeral storage limit in lambda so, for some, that might be exceeded, or at least there would need to be some housekeeping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use skopeo this shouldn't be a problem. It'll do the copy in layers (which is the unit of increment we have available)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you do this in MESH?


## Step-by-Step Implementation

### Stage 1: ECR to S3 Backup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, on MESH, when we push and image to ECR, we also push the tarball to S3 - the build will fail if both aren't completed.


### Considerations

* ECR replication does not provide immutability; images can be deleted or overwritten.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-tag-mutability.html - but yes, a bad actor with admin access could delete images.

Why an Immutable ECR Backup?
While Amazon ECR provides image replication, it lacks an immutable, long-term backup solution in a separate security boundary. In a disaster recovery (DR) scenario where a primary AWS account is compromised, standard replication is not sufficient. This solution addresses that by creating an "air-gapped" backup protected by an AWS Backup Vault Lock, which provides a Write-Once-Read-Many (WORM) model.

## Solution Architecture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about restoration? For reference, on MESH, the tarballs of the images are in the S3 remote immutable backup. For restoration, we fetch the tarball, docker load it, then push to ECR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an additional proposal for restoration feature to be built in to the blueprint as well.
The proposal doesn't include ECR at the moment but gives the framework for how the restoration will happen through the blueprint.
https://github.com/NHSDigital/terraform-aws-backup/pull/79/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants