-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Migrate wheel builds to wanda (x86_64) #59935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
f104bf5 to
288cadb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request migrates the x86_64 wheel builds from a CLI-based approach to wanda-based container builds. This involves adding new wanda configuration files, Dockerfiles, and updating the Buildkite CI configuration. The changes look good and are a solid step towards modernizing the build process. I have a couple of suggestions to improve maintainability and build efficiency.
| commands: | ||
| - bazel run //ci/ray_ci:build_in_docker -- wheel --python-version {{matrix}} --architecture x86_64 --upload | ||
| # Extract wheel from wanda image (Wanda cache in ECR) | ||
| - wanda_image="$RAYCI_WORK_REPO:$RAYCI_BUILD_ID-ray-wheel-py{{matrix}}" | ||
| - container_id=$(docker create $wanda_image) | ||
| - mkdir -p /tmp/wheels | ||
| - docker cp ${container_id}:/ /tmp/wheels/ | ||
| - docker rm ${container_id} | ||
| - mv /tmp/wheels/*.whl .whl/ | ||
| - ./ci/build/copy_build_artifacts.sh wheel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of commands is nearly identical to the one in the linux_cpp_wheels_upload step (lines 130-138). To improve maintainability and reduce duplication, consider extracting this logic into a reusable script.
For example, you could create a script ci/build/extract_and_upload_wheel.sh that takes the image name suffix as an argument:
#!/bin/bash
set -euo pipefail
IMAGE_NAME_SUFFIX=$1
WANDA_IMAGE="$RAYCI_WORK_REPO:$RAYCI_BUILD_ID-$IMAGE_NAME_SUFFIX"
echo "--- Extracting wheel from $WANDA_IMAGE"
CONTAINER_ID=$(docker create "$WANDA_IMAGE")
# Use trap to ensure the container is removed even if the script fails
trap 'docker rm -f "$CONTAINER_ID"' EXIT
# Clean up previous runs and prepare directories
rm -rf /tmp/wheels
mkdir -p /tmp/wheels .whl
# Copy all files from the root of the container.
docker cp "${CONTAINER_ID}:/." /tmp/wheels/
# Move the wheel to the .whl directory for the upload script
mv /tmp/wheels/*.whl .whl/
# Upload the artifact
./ci/build/copy_build_artifacts.sh wheelThen you could simplify the commands in both steps to a single line:
- For
linux_wheels_upload:commands: - ./ci/build/extract_and_upload_wheel.sh ray-wheel-py{{matrix}} - For
linux_cpp_wheels_upload:commands: - ./ci/build/extract_and_upload_wheel.sh ray-cpp-wheel-py{{matrix}}
.buildkite/build.rayci.yml
Outdated
| # Gating: only on releases/* OR (master AND nightly) | ||
| - label: ":s3: upload: wheel py{{matrix}} (x86_64)" | ||
| key: linux_wheels_upload | ||
| if: build.branch =~ /^releases\// || (build.branch == "master" && build.env("RAYCI_SCHEDULE") == "nightly") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following logic on ci/build/copy_build_artifacts.sh, it seemed solid to just skip this step entirely when we're not in a postmerge state. I can also remove this if it's unnecessary
https://github.com/ray-project/ray/blob/master/ci/build/copy_build_artifacts.sh#L49-L57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it already has skip-on-premerge tag, so this if statement is not really needed.
there are cases where people sometimes want to build wheels and upload to the s3 from random branches, and we do allow that.
| @@ -1 +1 @@ | |||
| 0.21.0 | |||
| 0.22.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bump is necessary to get Wanda's symlink functionality. Covered by #59744
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood, let's bump up wanda version in another PR first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
End stack release test running here: https://buildkite.com/ray-project/release/builds/74229/steps/canvas Mostly passing, with failures possibly related to flakes. Re-running those |
164e2e4 to
b4fb3e0
Compare
cd79253 to
7d257e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high-level looks really promising!
maybe split the ray-cpp wheel parts out in a follow up PR? although we are installing ray-cpp wheel in the release image, actually no release tests need ray-cpp wheel in there, so we might be able to just skip it when doing release tests, and we can maybe consider release another ray-cpp image, just to reduce the size of our released ray image.
.buildkite/build.rayci.yml
Outdated
| matrix: | ||
| - "3.10" | ||
| - "3.11" | ||
| - "3.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does ray-cpp does not yet have python 3.13?
one thing is that ray-cpp wheel's content is actually exactly the same for different python versions.. we might actually make it python version agnostic since we are here..
.buildkite/build.rayci.yml
Outdated
| - wanda_image="$RAYCI_WORK_REPO:$RAYCI_BUILD_ID-ray-wheel-py{{matrix}}" | ||
| - container_id=$(docker create $wanda_image) | ||
| - mkdir -p /tmp/wheels | ||
| - docker cp ${container_id}:/ /tmp/wheels/ | ||
| - docker rm ${container_id} | ||
| - mkdir -p .whl | ||
| - mv /tmp/wheels/*.whl .whl/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you make these a bash script in ./ci/build/ and call that with bash?
same with the ray-cpp one.
we try not to write many shell command lines on buildkite test step definitions. this makes the scripts easier to run and debug locally.
| - docker rm ${container_id} | ||
| - mkdir -p .whl | ||
| - mv /tmp/wheels/*.whl .whl/ | ||
| - ./ci/build/copy_build_artifacts.sh wheel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add explicit bash here too?
.buildkite/build.rayci.yml
Outdated
| # Gating: only on releases/* OR (master AND nightly) | ||
| - label: ":s3: upload: wheel py{{matrix}} (x86_64)" | ||
| key: linux_wheels_upload | ||
| if: build.branch =~ /^releases\// || (build.branch == "master" && build.env("RAYCI_SCHEDULE") == "nightly") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it already has skip-on-premerge tag, so this if statement is not really needed.
there are cases where people sometimes want to build wheels and upload to the s3 from random branches, and we do allow that.
ci/docker/ray-cpp-wheel.wanda.yaml
Outdated
| - pyproject.toml | ||
| - README.rst | ||
| - ci/build/build-manylinux-wheel.sh | ||
| - python/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't think ray-cpp wheel building needs most of the files under python/.. because the wheel does not really contain python files, but mostly just cpp headers.
which means this ray-cpp wheel build might be cache-able.
|
also, could you verify that the wheel being built in this way, if from the same source code, has exactly the same content as the wheels that are built in the current/old way? |
7d257e4 to
e93a821
Compare
e93a821 to
8a4bdf5
Compare
|
I've made updates to split out the
Let me get some validation going to confirm the full structure. Then I'll re-mark this as ready for review! |
8a4bdf5 to
a0b03e9
Compare
|
Ran verifications against commit f6c2b5f on branch #59971 ResultsCompared wanda-built wheel against S3 reference wheel:
Files Compared
|
| Property | Remote | Local | Status |
|---|---|---|---|
| Size | 42,215,040 bytes | 42,215,040 bytes | Match |
| Symbols | - | - | No differences |
| Embedded paths | none | none | Clean |
Conclusion: Binary difference is due to non-deterministic compilation (linker ordering, etc.), not code differences. Functionally equivalent.
Expected Differences
_version.py: Different commit hash due to cherry-pick (f6c2b5f7...vsd70dfe31...)METADATA:Requires-Distentries in different order (Python dict iteration order variance) - same dependencies, different serialization order
32347fa to
9074421
Compare
9074421 to
1b2c341
Compare
ci/docker/ray-wheel.Dockerfile
Outdated
| # Verify required artifacts exist before unpacking | ||
| for f in /tmp/ray_pkg.zip /tmp/ray_py_proto.zip /tmp/ray_java_pkg.zip /tmp/dashboard.tar.gz; do | ||
| [[ -f "$f" ]] || { echo "ERROR: missing artifact: $f"; exit 1; } | ||
| done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are copied? how will they possibly be missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Removing
ci/docker/ray-wheel.Dockerfile
Outdated
| ./ci/build/build-manylinux-wheel.sh "$PY_BIN" | ||
|
|
||
| # Sanity check: ensure wheels exist | ||
| shopt -s nullglob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this line do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It changes how loops operate when globs don't match anything
Without nullglob:
for f in *.log; do
echo "processing $f"
done
If there are no .log files, the loop runs once with f literally equal to *.log.
With nullglob:
If there are no matches, the loop runs zero times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, where that is a nice trick, it is changing the global settings. one can append some other script without knowing about this detail.
maybe just use wheels = ($(find . -maxdepth 0 -name '*.whl'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
ci/docker/ray-wheel.Dockerfile
Outdated
| ENV BUILDKITE_COMMIT=${BUILDKITE_COMMIT:-unknown} \ | ||
| PYTHON_VERSION=${PYTHON_VERSION} \ | ||
| SKIP_BAZEL_BUILD=1 \ | ||
| RAY_DISABLE_EXTRA_CPP=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move SKIP_BAZEL_BUILD RAY_DISABLE_EXTRA_CPP into the inlined bash script rather than making it image-wide env, and make it closer to the wheel building script's calling point? the other ones are also not necessary to be defined here I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ci/docker/ray-wheel.Dockerfile
Outdated
| COPY --chown=2000:100 ci/build/build-manylinux-wheel.sh ci/build/ | ||
| COPY --chown=2000:100 README.rst pyproject.toml ./ | ||
| COPY --chown=2000:100 rllib/ rllib/ | ||
| COPY --chown=2000:100 python/ python/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does 2000:100 mean? is it also used somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was shooting for passing ownership as non-root to the forge user (2000) . There's a similar pattern here, though it may not be necessary after all
ray/ci/ray_ci/builder_container.py
Line 39 in dad1efb
| "chown -R 2000:100 /artifact-mount", |
I'll try running a build without this to see if it succeeds!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried running without the ownership passing, and since we're running as USER forge, it runs into issues since it cannot create the necessary directory structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can however swap up to just chowning to the user. Let me try that for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ci/build/extract_wanda_wheels.sh
Outdated
|
|
||
| # Image has no default CMD, so provide a dummy command. | ||
| container_id="$(docker create "${WANDA_IMAGE}" /no-such-cmd)" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can actually use crane export to directly get a tarball of the files, especially for container images that only have one layer with some files that is stored in a container registry, it works pretty reliably.
and you can avoid doing all the docker pull/cp/rm dance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! So long as the uploader workers have crane, this'll be awesome. Added https://github.com/ray-project/ray/compare/1b2c341fd206b05cd29344f244ac8d7524884d91..05785cb42e44f52e04542fae73c7f4e71cb767ed#diff-236c8a31e5fa459f5409c9ba6f36706170109b0d377f98ba38b3b0d4943a2c97R28
1b2c341 to
05785cb
Compare
ci/docker/ray-wheel.Dockerfile
Outdated
| ./ci/build/build-manylinux-wheel.sh "$PY_BIN" | ||
|
|
||
| # Sanity check: ensure wheels exist | ||
| shopt -s nullglob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, where that is a nice trick, it is changing the global settings. one can append some other script without knowing about this detail.
maybe just use wheels = ($(find . -maxdepth 0 -name '*.whl'))
ci/docker/ray-wheel.Dockerfile
Outdated
| wheels=(.whl/*.whl) | ||
| if (( ${#wheels[@]} == 0 )); then | ||
| echo "ERROR: No wheels produced in .whl/" | ||
| ls -la .whl || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why add the || true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for a nit reason. In the case .whl doesn't exist, we return with exit code 2. This makes sure we consistently return exit code 1 on failure
| @@ -1 +1 @@ | |||
| 0.21.0 | |||
| 0.22.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood, let's bump up wanda version in another PR first.
This migrates ray wheel builds from CLI-based approach to wanda-based container builds for x86_64. Changes: - Add ray-wheel.wanda.yaml and Dockerfile for wheel builds - Update build.rayci.yml wheel steps to use wanda - Add wheel upload steps that extract from wanda cache Topic: ray-wheel Signed-off-by: andrew <andrew@anyscale.com>
05785cb to
37410f9
Compare
This migrates ray wheel builds from CLI-based approach to wanda-based
container builds for x86_64.
Changes:
Topic: ray-wheel