Support syncing Amazon Linux blobstore style repos (e.g. AL2, AL2022) #457

stewartsmith · 2022-05-30T18:24:27Z

The yum repository layout for Amazon Linux repositiors (AL1, AL2,
AL2022, and likely future versions) have a slightly interesting layout.

All Amazon Linux releases have a mirrorlist that points to the
repository, but the repository is not in a fixed location, but instead
under a GUID. This means that content can be synced (and staged) before
the atomic (and fast) operation of writing a new mirrorlist makes this
content visible.

For Amazon Linux 1 style repositories this wasn't a problem as each of
these GUID repos that the mirrorlist pointed to was a complete copy of
the repository, packages and all. This did mean that as the "updates"
repo grew, the amount of time it took to release package updates
increased.

Starting with Amazon Linux 2, instead of having each of these GUID repos
have a full copy of the repository, the repodata contains relative paths
over to a central blob store. Thus the only data that needs to be pushed
to release a package update is only the added packages, along with a new
copy of the repo metadata.

However, as of caf28c4, reposync does
not want to write files outside of the destination directory, and this
broke being able to reposync the Amazon Linux 2 style yum repositories.

The reposync shipped in the AL2 repositories carried a patch that would
just rewrite the path it was saving the RPMs to. This did have the
effect of having the --download-metadata option save metadata that still
had the old paths (to the blobstore) and thus the resultant repository
would not be usable - you would have to run createrepo_c.

So this patch implements the ability to fully mirror an Amazon Linux
style repository. When --blobstore is passed, reposync will sync the
full remote paths into the destination directory, thus preserving the
GUID repo, and the blobstore structure, which means the paths in the
metadata (if downloaded with --download-metadata) will work.

Since the synced repo will be under the GUID and not be predictable if
you want to reliably point to the repo as the repo GUID changes each
time there is an update. So, this is where the
new --save-relative-mirrorlist option comes in. It will save a copy of
the mirror list locally, with the URLs trimmed to be relative of the
destination directory rather than absolute URLs. If used with the
corresponding --save-relative-mirrorlist-prefix, you can construct a
usable mirrorlist if you're using reposync to construct an actual
mirror.

Much like the original patch in Amazon Linux 2's reposync, if none of
the new options are specified, just the packages are fetched, and any
fetched metadata will point to the wrong place.

Looks as though https://bugzilla.redhat.com/show_bug.cgi?id=1898089 / #441 is related, although this patch ends up with a bit more functionality to the reposync of such repositories, such as being able to end up with synced metadata that is functional.

The yum repository layout for Amazon Linux repositiors (AL1, AL2, AL2022, and likely future versions) have a slightly interesting layout. All Amazon Linux releases have a mirrorlist that points to the repository, but the repository is not in a fixed location, but instead under a GUID. This means that content can be synced (and staged) before the atomic (and fast) operation of writing a new mirrorlist makes this content visible. For Amazon Linux 1 style repositories this wasn't a problem as each of these GUID repos that the mirrorlist pointed to was a complete copy of the repository, packages and all. This did mean that as the "updates" repo grew, the amount of time it took to release package updates increased. Starting with Amazon Linux 2, instead of having each of these GUID repos have a full copy of the repository, the repodata contains relative paths over to a central blob store. Thus the only data that needs to be pushed to release a package update is only the added packages, along with a new copy of the repo metadata. However, as of caf28c4, reposync does not want to write files outside of the destination directory, and this broke being able to reposync the Amazon Linux 2 style yum repositories. The reposync shipped in the AL2 repositories carried a patch that would just rewrite the path it was saving the RPMs to. This did have the effect of having the --download-metadata option save metadata that still had the old paths (to the blobstore) and thus the resultant repository would not be usable - you would *have* to run createrepo_c. So this patch implements the ability to fully mirror an Amazon Linux style repository. When --blobstore is passed, reposync will sync the full remote paths into the destination directory, thus preserving the GUID repo, and the blobstore structure, which means the paths in the metadata (if downloaded with --download-metadata) will work. Since the synced repo will be under the GUID and not be predictable if you want to reliably point to the repo as the repo GUID changes each time there is an update. So, this is where the new --save-relative-mirrorlist option comes in. It will save a copy of the mirror list locally, with the URLs trimmed to be relative of the destination directory rather than absolute URLs. If used with the corresponding --save-relative-mirrorlist-prefix, you can construct a usable mirrorlist if you're using reposync to construct an actual mirror. Much like the original patch in Amazon Linux 2's reposync, if none of the new options are specified, just the packages are fetched, and any fetched metadata will point to the wrong place. = changelog = msg: Support syncing Amazon Linux blobstore style repos type: enhancement resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1898089

m-blaha · 2022-06-15T06:48:40Z

plugins/reposync.py

            if self.opts.delete:
                self.delete_old_local_packages(repo, pkglist)
        if not gpgcheck_ok:
            raise dnf.exceptions.Error(_("GPG signature check failed."))

    def repo_target(self, repo):
+        if self.opts.blobstore:
+            mirrors = repo._repo.getMirrors()
+            dest_path = urllib.parse.urlparse(mirrors[0]).path[1:]


Repositories are not required to have mirrorlist. They can have metalink or baseurl set. In such case mirrors is an empty tuple and this line raises an IndexError exception.

We don't (currently at least, and don't have immediate plans to) have a metalink pointer for any of our repos... although I could sync one somewhere, create a metalink and test it that way. I'll give it a go and see if I can suitably test, or at least more gracefully bail out until such a repo exists.

Good point on the baseurl though, as that should probably work here as well, even if not the common path for Amazon Linux repos.

m-blaha · 2022-06-15T06:55:58Z

plugins/reposync.py

+
+        # If we haven't set the blobstore option (i.e. preserve full path)
+        # and we see references to a file outside our destination,
+        # then we should just store the package in the target location


But this changes the behavior of the plugin in case --download-metadata switch is used. According to the documentation of the option: "Download all repository metadata. Downloaded copy is instantly usable as a repository, no need to run createrepo_c on it.".
Now (with changed package download location) this is not true any more and user is not even informed.

Agree that in that case we should probably error out.

Logic should be something like:

if blobstore style repo

and --download-metadata but not --blobstore: error out as downloaded metadata will not be usable

if --blobstore and --download-metadata: download the metadata and things will work

if not blobstore style repo:

preserve existing behavior of --download-metadata or not.
?

m-blaha · 2022-06-15T07:06:22Z

plugins/reposync.py

-                    pkg_download_path, repo_target))
+        repo_target_check = repo_target
+        if self.opts.blobstore:
+            repo_target_check = _pkgdir(pkg.repo.id if not self.opts.norepopath else '', '')


What's the reason for computing repo_target_check value here? Later in code there is no usage for it.
Also these two subsequent conditions if not self.opts.blobstore and if self.opts.blobstore look a bit strange.

m-blaha · 2022-06-15T07:08:31Z

plugins/reposync.py

+                pkg_location = os.path.basename(pkg_location)
+                pkg_download_path = os.path.realpath(
+                    os.path.join(repo_target, pkg_location))
+
        # join() ensures repo_target ends with a path separator (otherwise the
        # check would pass if pkg_download_path was a "sibling" path component
        # of repo_target that has the same prefix).


This commend now lost it's meaning. It should be also moved (together with os.path.join(repo_target, '') move)

m-blaha · 2022-06-15T07:26:04Z

plugins/reposync.py

+                            action='store_true',
+                            help=_('Save a modified copy of the repo mirror list '
+                                   'pointing to the repo data. pointing to the '
+                                   'repo downloaded. Useful with --blobstore '


pointing to the repo data. pointing to the repo downloaded. - probably remnant of previous version of the sentence.

Ahhh whoops, will fix.

m-blaha · 2022-06-15T07:36:45Z

@stewartsmith is it possible to post here a concrete example of AL2 repository config file? Just to be sure I've understood this GUID layout correctly.

stewartsmith · 2022-06-15T18:48:19Z

Sure, I'll write up something that I can post publicly that has a more detailed explanation of how things work today in AL1/AL2/AL2022 and the motivation behind it, along with how the config files look today for each.

kyle-walker · 2022-08-26T14:58:57Z

Just to add a +1 to @m-blaha's line of inquiry. I am not familiar with this style of repository layout, and the behaviour around relative paths were locked down previously due to at least one CVE reported.

What tooling composes those repos, is it custom/internal to Amazon? Is there a design doc that specifies the layout, or is that the intent of the writeup mentioned in #457 (comment) :)

stewartsmith · 2022-08-26T16:49:05Z

The write-up I mention in #457 (comment) is to ensure that a write-up makes it out into the open rather than sitting internally to Amazon, which isn't as useful place for it as sitting, well, literally anywhere else :)

kyle-walker · 2022-09-02T15:33:35Z

What with CVE-2018-10897, I would very much like to avoid accesses in reposync outside of the directory. If you replace the relative paths in the repository with symlinks, you end up with similar outcomes. Without needing to relax the restrictions, or add additional repository-specific flags, on the client side.

After looking at this problem for a while, I was wondering... Is there any leeway for changing the repository layout?

mjaitly-amazon · 2023-03-02T01:57:42Z

#481 @stewartsmith @m-blaha @kyle-walker

Conan-Kudo · 2023-04-02T21:39:52Z

I would also prefer Amazon Linux to change its repository layout for similar reason to @kyle-walker. Deliberately creating a configuration that creates CVEs is no fun. :)

stewartsmith and others added 2 commits May 30, 2022 12:18

Add Stewart Smith to contributors

f9f5a07

stewartsmith force-pushed the mirror-blobstore-repos branch from 465880a to f9f5a07 Compare May 30, 2022 18:52

m-blaha self-assigned this Jun 15, 2022

m-blaha reviewed Jun 15, 2022

View reviewed changes

stewartsmith mentioned this pull request Aug 19, 2022

dnf reposync amazonlinux - ERROR amazonlinux/amazon-linux-2023#165

Closed

Conan-Kudo mentioned this pull request Apr 2, 2023

Allow reposync to download from Amazon Linux repos #481

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support syncing Amazon Linux blobstore style repos (e.g. AL2, AL2022) #457

Support syncing Amazon Linux blobstore style repos (e.g. AL2, AL2022) #457

stewartsmith commented May 30, 2022

m-blaha Jun 15, 2022

stewartsmith Jun 15, 2022

m-blaha Jun 15, 2022

stewartsmith Jun 15, 2022

m-blaha Jun 15, 2022

m-blaha Jun 15, 2022

m-blaha Jun 15, 2022

stewartsmith Jun 15, 2022

m-blaha commented Jun 15, 2022

stewartsmith commented Jun 15, 2022

kyle-walker commented Aug 26, 2022

stewartsmith commented Aug 26, 2022

kyle-walker commented Sep 2, 2022 •

edited

Loading

mjaitly-amazon commented Mar 2, 2023

Conan-Kudo commented Apr 2, 2023

Support syncing Amazon Linux blobstore style repos (e.g. AL2, AL2022) #457

Are you sure you want to change the base?

Support syncing Amazon Linux blobstore style repos (e.g. AL2, AL2022) #457

Conversation

stewartsmith commented May 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-blaha commented Jun 15, 2022

stewartsmith commented Jun 15, 2022

kyle-walker commented Aug 26, 2022

stewartsmith commented Aug 26, 2022

kyle-walker commented Sep 2, 2022 • edited Loading

mjaitly-amazon commented Mar 2, 2023

Conan-Kudo commented Apr 2, 2023

kyle-walker commented Sep 2, 2022 •

edited

Loading