-
Notifications
You must be signed in to change notification settings - Fork 18
POC: Refactor image artifact storage & execution profiles #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
These parts will be useful for the upcoming skopeo adapter as well.
"get" is probably clearer than "list", and tacking on "_ids" makes it clearer what the return value is. Also, drop the leading underscore, which is a holdover from the function being in the adapters.docker module.
The minimum Python version of DataLad is new enough that we can assume subprocess.run() is available. It's recommended by the docs, and I like it more, so switch to it. Note that we might want to eventually switch to using WitlessRunner here. The original idea with using the subprocess module directly was that it'd be nice for the docker adapter to be standalone, as nothing in the adapter depended on datalad at the time. That's not the case anymore after the adapters.utils split and the use of datalad.utils within it. (And the upcoming skopeo adapter will make heavier use of datalad for adding URLs to the layers.)
This logic will get a bit more involved in the next commit, and it will be needed by the skopeo adapter too.
When the adapter is called from the command line (as containers-run does) and datalad gets imported, the level set via the --verbose argument doesn't have an effect and logging happens twice, once through datalad's handler and once through the adapter's. Before 313c4f0 (WIN/Workaround: don't pass gid and uid to docker run call, 2020-11-10), the above was the case when docker.main() was triggered with the documented `python -m datalad_container.adapters ...` invocation, but not when the script path was passed to python. Following that commit, the adapter imports datalad, so datalad's logger is always configured. Adjust setup_logger() to set the log level of loggers under the datalad.containers.adapters namespace so that the adapter's logging level is in effect for command line calls to the adapter. As mentioned above, datalad is now loaded in all cases, so a handler is always configured, but, in case this changes in the future, add a simpler handler if one isn't already configured.
The same handling will be needed in the skopeo adapter. Avoid repeating it.
Some of the subprocess calls capture stderr. Show it to the caller on failure.
In order to be able to track Docker containers in a dataset, we introduced the docker-save-based docker adapter in 68a1462 (Add prototype of a Docker adapter, 2018-05-18). It's not clear how much this has been used, but at least conceptually it seems to be viable. One problem, however, is that ideally we'd be able to assign Docker registry URLs to the image files stored in the dataset (particularly the large non-configuration files). There doesn't seem to be a way to do this with the docker-save archives. Another option for storing the image in a dataset is the Open Container Initiative image format. Skopeo can be used to copy images in Docker registries (and some other destinations) to an OCI-compliant directory. When Docker Hub is used as the source, the resulting layers blobs can be re-obtained via GET /v2/NAME/blobs/ID. Using skopeo/OCI also has the advantage of making it easier to execute via podman in the future. Add an initial skopeo-based OCI adapter. At this point, it has the same functionality as the docker adapter.
After running `skopeo copy docker://docker.io/... oci:<dir>`, we can
link up the layer to the Docker registry. However, other digests
aren't preserved. One notable mismatch is between the image ID if you
run
docker pull x
versus
skopeo copy docker://x oci:x && skopeo copy oci:x docker-daemon:x
I haven't really wrapped my head around all the different digests and
when they can change. However, skopeo's issue tracker has a good deal
of discussion about this, and it looks complicated (e.g., issues 11,
469, 949, 1046, and 1097).
The adapter docstring should probably note this, though at this point
I'm not sure I could say something coherent. Anyway, add a to-do
note...
I _think_ containers-storage: is what we'd use for podman-run, but I haven't attempted it.
Prevent skopeo-copy output from being shown, since it's probably confusing to see output under run's "Command start (output follows)" tag for a command that the user didn't explicitly call. However, for large images, this has the downside that the user might want some signs of life, so this may need to be revisited.
We'll need this information in order to add a tag to the oci: destination and to make the entry copied to docker-daemon more informative. I've tried to base the rules on containers/image implementation, which is what skopeo uses underneath.
An image stored as an OCI directory can have a tag. If the source has a tag specified, copy it over to the destination. Note that in upcoming commits will store the full source specification as an image annotation, so we won't rely on this when copying the image to docker-daemon:, but it still seems nice to have (e.g., when looking at the directory with skopeo-inspect).
These will be used to store the value of the skopeo-copy source and then retrieve it at load time to make the docker-daemon: entry more informative.
The OCI format allows annotations. Add one with the source value (which will be determined by what the caller gives to containers-add) so that we can use this information when copying the information to a docker-daemon: destination.
The images copied to the daemon look like this
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
datalad-container/bb sha256-98345e4 98345e418eb7 3 weeks ago 69.2MB
That tag isn't useful because it just repeats the image ID. And the
name after "datalad-container/" is the name of the directory, so with
the default containers-add location it would be an uninformative
"image".
With the last commit, we store the source specification as an
annotation in the OCI directory. Parse it and reuse the original
repository name and tag.
REPOSITORY TAG IMAGE ID CREATED SIZE
datalad-container/debian buster-slim 98345e418eb7 3 weeks ago 69.2MB
If the source has a digest instead of the tag, construct the daemon
tag from that.
Add a new oci: scheme. The stacking of the schemes isn't ideal (oci:docker://, oci:docker-daemon:), but it allows for any skopeo transport to be used. Note: I'm not avoiding appending "//" for a conceptual reason (although there might be a valid one), but because I find "oci://docker://" to be ugly. Perhaps the consistency with "shub://" and "dhub://" outweighs that though.
The next commit will use this logic in the oci adapter as well, and, it'd be nice (though not strictly necessary) to avoid oci and containers_add importing each other.
TODO: Finalize approach in Datalad for Docker Registry URLs.
* origin/master: (217 commits) [DATALAD RUNCMD] Run pre-commit to harmonize code throughout Update __version__ to 1.2.6 [skip ci] Update CHANGELOG BF: use setuptools.errors.OptionError instead of now removed import of distutils.DistutilsOptionError BF: docbuild - use python 3.9 (not 3.8) and upgrade setuptools [DATALAD RUNCMD] Run pre-commit to harmonize code throughout rm duplicate .codespellrc and move some of its skips into pyproject.toml progress codespell in pre-commit Add precommit configuration as in datalad ATM [release-action] Autogenerate changelog snippet for PR 268 MNT: Account for a number of deprecations in core Revert linting a target return value for a container Fix lint errors other than line length upper case CWD acronym CI/tools: Add fuse2fs dependency for singularity installation Improving documentation for --url parameter Update __version__ to 1.2.5 [skip ci] Update CHANGELOG Add changelog entry for isort PR [DATALAD RUNCMD] isort all files for consistency ... Conflicts - some were tricky: datalad_container/adapters/docker.py datalad_container/containers_add.py datalad_container/utils.py - both added but merge looked funny
otherwise even singularity does not install
=== Do not change lines below ===
{
"chain": [],
"cmd": "sed -i -e 's,from distutils.spawn import find_executable,from shutil import which,g' -e 's,find_executable(,which(,g' datalad_container/adapters/tests/test_oci_more.py",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
Added comprehensive documentation for Claude Code to work effectively with this codebase, including architecture overview, development commands, and key implementation details. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Extended the OCI adapter to support any container registry without
hardcoding endpoints. The link() function now dynamically constructs
registry API endpoints using the pattern https://{registry}/v2/, with
Docker Hub as the only special case (registry-1.docker.io).
This enables automatic support for registries like:
- quay.io (Quay.io registry)
- gcr.io (Google Container Registry)
- ghcr.io (GitHub Container Registry)
- Any other V2-compatible registry
Changes:
- Removed hardcoded _ENDPOINTS dictionary
- Added dynamic endpoint construction in link() function
- Added unit tests for parsing references from alternative registries
- Added integration tests using real images:
- ghcr.io/astral-sh/uv:latest for ghcr.io testing
- quay.io/linuxserver.io/baseimage-alpine:3.18 for quay.io testing
The link() function will add registry URLs to annexed layer images for
any registry when proper provider configuration is available, enabling
efficient retrieval through git-annex.
All new tests are marked with @pytest.mark.ai_generated as per project
standards.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Enhanced the parametrized registry test to include: 1. Docker Hub (docker.io) with busybox:1.30 for consistency 2. Verification that annexed blobs exist in the OCI image 3. Check that all annexed files have URLs registered in either the datalad or web remote for efficient retrieval The test now verifies that `git annex find --not --in datalad --and --not --in web` returns empty, ensuring all blobs are accessible through git-annex remotes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Enhanced the parametrized registry test to verify the complete drop/get cycle for the entire dataset: 1. Drops all annexed content in the dataset 2. Verifies that files were actually dropped (non-empty results) 3. Gets everything back from remotes 4. Verifies that files were retrieved (non-empty results) This ensures that the registered URLs in datalad/web remotes are functional and files can be successfully retrieved from the registry. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This fixture ensures that sys.executable's directory is first in PATH for the duration of tests. This is needed when tests spawn subprocesses that need to import modules from the same Python environment that's running pytest, preventing "No module named X" errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…er handling - Add parametrized integration test covering docker.io, gcr.io, and quay.io - Test container addition, execution, and annexed blob verification - Add drop/get cycle testing to verify remote retrieval works - Fix link() to create datalad remote even without provider configuration - Issue warning instead of skipping when provider not found - Allows URLs to be registered and files to be retrieved from any registry - Use pytest tmp_path fixture instead of @with_tempfile decorator 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
* origin/master: [release-action] Autogenerate changelog snippet for PR 278 install libffi7 since otherwise git-annex install fails chore: appveyor -- progress Ubuntu to 2204
- Two concepts only: Image (artifact) and Profile (execution recipe) - Profiles point to images, can extend other profiles - Child profiles clobber parent values (no merging) - YAML files instead of .datalad/config for images and profiles - ReproNim provides base profiles, users extend with their settings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- images/<name>/<version>/ for versioned image storage - sources.yaml consolidates provenance for all versions of an image - profiles/ for execution profiles referencing image/<version> - Eliminates separate environments/ directory 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
- Add parse_registry_url() for docker:// scheme (quay://, ghcr:// commented pending testing) - Store images at .datalad/containers/images/<name>/<version>/image/ - Version extracted from URL tag, defaults to 'latest' - Comment out cmdexec handling (TODO Phase 4 backwards compatibility) - Comment out deprecated schemes: dhub://, oci:, shub:// (TODO Phase 4) - Add clear error for unsupported URL schemes BREAKING CHANGES: - docker:// now stores OCI directory via skopeo (was: singularity build to SIF) - cmdexec/--call-fmt no longer set by containers-add - New storage path structure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Call oci.load() after saving and linking image - Image is immediately usable with docker run - TODO: add --load flag to make this optional 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Use short form (datalad-container/{name}:{tag}) that Docker recognizes
- Add TODOs for: other registries, non-library images, surfacing name in output
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Allow name:version in container names (e.g., 'alpine:latest', 'mriqc:23.1.0') - Parse name:version to determine storage path - Tag loaded Docker image with meaningful name (datalad-container/name:version) - docker run datalad-container/alpine:mytag now works after containers-add 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- New --image flag: specifies container name:version
- New --exec flag: execution template with {img} and {cmd} placeholders
- Resolves image to datalad-container/<name>:<version> for Docker
- Tracks image directory as extra_input for provenance
- Records resolved command (not shim) in run record
- Legacy -n/--container-name path preserved
Usage:
datalad containers-run --image alpine:latest \
--exec "docker run --rm {img} {cmd}" echo hello
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- {img} = Docker image name (datalad-container/name:version)
- {img_path} = OCI directory path (.datalad/containers/images/.../image)
- Enables: apptainer exec oci:{img_path} {cmd}
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add pyyaml dependency to setup.cfg - New profiles.py module with profile loading, extends resolution (clobber semantics), and image validation (error early) - New containers-profiles command to list available profiles - Add --profile flag to containers-run - CLI args (--image, --exec) override profile fields Profiles are YAML files in .datalad/containers/profiles/ defining 'image' and 'exec' template. Supports inheritance via 'extends' key. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Profiles can now omit the 'image' key, making them reusable base templates for different runtimes. When using such profiles, the --image flag is required. This enables shipping docker-default, apptainer-default, podman-default profiles that work with any image. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Member
Author
|
Added a demo.sh for easy review. ./demo.shOutput |
Member
Author
|
This PR served its purpose-- I'm proceeding on the design doc PR with fewer breaking changes. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
POC implementation of #284
The changes were substantial, so I wanted to go ahead and try it out to test sanity. Some implementation
did change the design doc.
All 3 phases are complete:
--imageand--execflags for containers-runBreaking changes from master
docker://now stores as OCI directory (was: Singularity SIF)cmdexecnot set by containers-add (commented out, TODO Phase 4)dhub://,oci:,shub://(TODO Phase 4)New features
containers-profilescommand to list available profilesextends:for inheritance--image,--exec) override profile fields--imageflagCan be further improved by:
{binds}so profiles can be better composablecontainers-initor similar to install default profiles (docker-default, apptainer-default)ReproNim/containers integration
To fully replace the
singularity_cmdshim, profiles would need additional features:pre-run/post-runhooks for dynamic setup/cleanup scriptsenv:block for static environment variables{env.VARNAME}placeholder for dynamic env var expansionThese are documented in the design doc as optional upstream RFEs. The current implementation covers the
core use case; ReproNim could adopt profiles now and request these extensions as needed.