-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support syncing Amazon Linux blobstore style repos (e.g. AL2, AL2022) #457
base: master
Are you sure you want to change the base?
Support syncing Amazon Linux blobstore style repos (e.g. AL2, AL2022) #457
Conversation
The yum repository layout for Amazon Linux repositiors (AL1, AL2, AL2022, and likely future versions) have a slightly interesting layout. All Amazon Linux releases have a mirrorlist that points to the repository, but the repository is not in a fixed location, but instead under a GUID. This means that content can be synced (and staged) before the atomic (and fast) operation of writing a new mirrorlist makes this content visible. For Amazon Linux 1 style repositories this wasn't a problem as each of these GUID repos that the mirrorlist pointed to was a complete copy of the repository, packages and all. This did mean that as the "updates" repo grew, the amount of time it took to release package updates increased. Starting with Amazon Linux 2, instead of having each of these GUID repos have a full copy of the repository, the repodata contains relative paths over to a central blob store. Thus the only data that needs to be pushed to release a package update is only the added packages, along with a new copy of the repo metadata. However, as of caf28c4, reposync does not want to write files outside of the destination directory, and this broke being able to reposync the Amazon Linux 2 style yum repositories. The reposync shipped in the AL2 repositories carried a patch that would just rewrite the path it was saving the RPMs to. This did have the effect of having the --download-metadata option save metadata that still had the old paths (to the blobstore) and thus the resultant repository would not be usable - you would *have* to run createrepo_c. So this patch implements the ability to fully mirror an Amazon Linux style repository. When --blobstore is passed, reposync will sync the full remote paths into the destination directory, thus preserving the GUID repo, and the blobstore structure, which means the paths in the metadata (if downloaded with --download-metadata) will work. Since the synced repo will be under the GUID and not be predictable if you want to reliably point to the repo as the repo GUID changes each time there is an update. So, this is where the new --save-relative-mirrorlist option comes in. It will save a copy of the mirror list locally, with the URLs trimmed to be relative of the destination directory rather than absolute URLs. If used with the corresponding --save-relative-mirrorlist-prefix, you can construct a usable mirrorlist if you're using reposync to construct an actual mirror. Much like the original patch in Amazon Linux 2's reposync, if none of the new options are specified, just the packages are fetched, and any fetched metadata will point to the wrong place. = changelog = msg: Support syncing Amazon Linux blobstore style repos type: enhancement resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1898089
465880a
to
f9f5a07
Compare
if self.opts.delete: | ||
self.delete_old_local_packages(repo, pkglist) | ||
if not gpgcheck_ok: | ||
raise dnf.exceptions.Error(_("GPG signature check failed.")) | ||
|
||
def repo_target(self, repo): | ||
if self.opts.blobstore: | ||
mirrors = repo._repo.getMirrors() | ||
dest_path = urllib.parse.urlparse(mirrors[0]).path[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repositories are not required to have mirrorlist. They can have metalink or baseurl set. In such case mirrors is an empty tuple and this line raises an IndexError
exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't (currently at least, and don't have immediate plans to) have a metalink pointer for any of our repos... although I could sync one somewhere, create a metalink and test it that way. I'll give it a go and see if I can suitably test, or at least more gracefully bail out until such a repo exists.
Good point on the baseurl
though, as that should probably work here as well, even if not the common path for Amazon Linux repos.
|
||
# If we haven't set the blobstore option (i.e. preserve full path) | ||
# and we see references to a file outside our destination, | ||
# then we should just store the package in the target location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this changes the behavior of the plugin in case --download-metadata
switch is used. According to the documentation of the option: "Download all repository metadata. Downloaded copy is instantly usable as a repository, no need to run createrepo_c on it.".
Now (with changed package download location) this is not true any more and user is not even informed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that in that case we should probably error out.
Logic should be something like:
- if blobstore style repo
- and
--download-metadata
but not--blobstore
: error out as downloaded metadata will not be usable - if
--blobstore
and--download-metadata
: download the metadata and things will work
- and
- if not blobstore style repo:
- preserve existing behavior of
--download-metadata
or not.
?
- preserve existing behavior of
pkg_download_path, repo_target)) | ||
repo_target_check = repo_target | ||
if self.opts.blobstore: | ||
repo_target_check = _pkgdir(pkg.repo.id if not self.opts.norepopath else '', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for computing repo_target_check
value here? Later in code there is no usage for it.
Also these two subsequent conditions if not self.opts.blobstore
and if self.opts.blobstore
look a bit strange.
pkg_location = os.path.basename(pkg_location) | ||
pkg_download_path = os.path.realpath( | ||
os.path.join(repo_target, pkg_location)) | ||
|
||
# join() ensures repo_target ends with a path separator (otherwise the | ||
# check would pass if pkg_download_path was a "sibling" path component | ||
# of repo_target that has the same prefix). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commend now lost it's meaning. It should be also moved (together with os.path.join(repo_target, '')
move)
action='store_true', | ||
help=_('Save a modified copy of the repo mirror list ' | ||
'pointing to the repo data. pointing to the ' | ||
'repo downloaded. Useful with --blobstore ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pointing to the repo data. pointing to the repo downloaded.
- probably remnant of previous version of the sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh whoops, will fix.
@stewartsmith is it possible to post here a concrete example of AL2 repository config file? Just to be sure I've understood this GUID layout correctly. |
Sure, I'll write up something that I can post publicly that has a more detailed explanation of how things work today in AL1/AL2/AL2022 and the motivation behind it, along with how the config files look today for each. |
Just to add a +1 to @m-blaha's line of inquiry. I am not familiar with this style of repository layout, and the behaviour around relative paths were locked down previously due to at least one CVE reported. What tooling composes those repos, is it custom/internal to Amazon? Is there a design doc that specifies the layout, or is that the intent of the writeup mentioned in #457 (comment) :) |
The write-up I mention in #457 (comment) is to ensure that a write-up makes it out into the open rather than sitting internally to Amazon, which isn't as useful place for it as sitting, well, literally anywhere else :) |
What with CVE-2018-10897, I would very much like to avoid accesses in reposync outside of the directory. If you replace the relative paths in the repository with symlinks, you end up with similar outcomes. Without needing to relax the restrictions, or add additional repository-specific flags, on the client side. After looking at this problem for a while, I was wondering... Is there any leeway for changing the repository layout? |
I would also prefer Amazon Linux to change its repository layout for similar reason to @kyle-walker. Deliberately creating a configuration that creates CVEs is no fun. :) |
The yum repository layout for Amazon Linux repositiors (AL1, AL2,
AL2022, and likely future versions) have a slightly interesting layout.
All Amazon Linux releases have a mirrorlist that points to the
repository, but the repository is not in a fixed location, but instead
under a GUID. This means that content can be synced (and staged) before
the atomic (and fast) operation of writing a new mirrorlist makes this
content visible.
For Amazon Linux 1 style repositories this wasn't a problem as each of
these GUID repos that the mirrorlist pointed to was a complete copy of
the repository, packages and all. This did mean that as the "updates"
repo grew, the amount of time it took to release package updates
increased.
Starting with Amazon Linux 2, instead of having each of these GUID repos
have a full copy of the repository, the repodata contains relative paths
over to a central blob store. Thus the only data that needs to be pushed
to release a package update is only the added packages, along with a new
copy of the repo metadata.
However, as of caf28c4, reposync does
not want to write files outside of the destination directory, and this
broke being able to reposync the Amazon Linux 2 style yum repositories.
The reposync shipped in the AL2 repositories carried a patch that would
just rewrite the path it was saving the RPMs to. This did have the
effect of having the --download-metadata option save metadata that still
had the old paths (to the blobstore) and thus the resultant repository
would not be usable - you would have to run createrepo_c.
So this patch implements the ability to fully mirror an Amazon Linux
style repository. When --blobstore is passed, reposync will sync the
full remote paths into the destination directory, thus preserving the
GUID repo, and the blobstore structure, which means the paths in the
metadata (if downloaded with --download-metadata) will work.
Since the synced repo will be under the GUID and not be predictable if
you want to reliably point to the repo as the repo GUID changes each
time there is an update. So, this is where the
new --save-relative-mirrorlist option comes in. It will save a copy of
the mirror list locally, with the URLs trimmed to be relative of the
destination directory rather than absolute URLs. If used with the
corresponding --save-relative-mirrorlist-prefix, you can construct a
usable mirrorlist if you're using reposync to construct an actual
mirror.
Much like the original patch in Amazon Linux 2's reposync, if none of
the new options are specified, just the packages are fetched, and any
fetched metadata will point to the wrong place.
Looks as though https://bugzilla.redhat.com/show_bug.cgi?id=1898089 / #441 is related, although this patch ends up with a bit more functionality to the reposync of such repositories, such as being able to end up with synced metadata that is functional.