Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce retention and early materialization of VirtualActionInput #24891

Closed
wants to merge 13 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Jan 10, 2025

b82dcdc introduced a way for parameter files to be materialized lazily and in a streaming fashion. For this to be effective, the contents of VirtualActionInputs should not be retained and, ideally, never or only briefly materialized in full.

This is achieved by replacing the digest computations with a streaming approach and avoiding retention of VirtualActionInput in MerkleTrees, which may be cached. All cache protocols are streamlined to accept a Blob, which is a closeable supplier of InputStreams that is automatically closed after the upload has completed (which may require retries).

Along the way, file nodes in MerkleTrees are stored unwrapped to reduce memory usage.

@fmeum fmeum marked this pull request as ready for review January 10, 2025 13:16
@fmeum fmeum requested a review from a team as a code owner January 10, 2025 13:16
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Jan 10, 2025
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 10, 2025

@justinhorvitz Context: I'm currently investigating higher memory usage with path mapping enabled and am thus planning to make it compatible with your command line preprocessing logic. While working on that, I noticed that Bazel may not make good use of this optimization as it's RE implementation may retain full param file contents and some other places would at least temporarily materialize the full file just to calculate a digest. This PR is meant to address that. Ignoring the specifics of the integration with Bazel's RE backend, which other folks can review, do you consider it reasonable?

@fmeum fmeum force-pushed the efficient-action-input-writes branch from 61fbbcb to 14916d4 Compare January 10, 2025 14:36
// TODO: Avoid materializing the entire file in memory. This requires changing the
// upload to be driven by an OutputStream rather than consuming an InputStream.
remoteCacheClient.uploadBlob(
context, digest, () -> virtualActionInput.getBytes().newInput());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might virtualActionInput.getBytes() be called multiple times by Chunker? The approach we use for the Google-internal RE is:

  • Materialize the ByteString as late as possible, using double-checked locking, storing it as a field in the "supplier"
  • The "supplier" has a close() method that is called when all chunks have been sent to the CAS. Calling this nulls the ByteString field.

@justinhorvitz
Copy link
Contributor

temporarily materialize the full file just to calculate a digest

This is a good catch. The google-internal RE code does not do this. Instead, it makes use of VirtualActionInput#writeTo, passing an OutputStream that computes the digest on the fly. It looks like that's the same approach you've gone with in this PR.

I left a comment about the way you're passing a supplier into Chunker. I think you have the right approach, just need to make sure that VirtualActionInput#getBytes is only called once.

@fmeum fmeum force-pushed the efficient-action-input-writes branch from 3adf3ed to 560b5b2 Compare January 11, 2025 23:09
@fmeum fmeum marked this pull request as draft January 12, 2025 08:22
@fmeum fmeum force-pushed the efficient-action-input-writes branch 2 times, most recently from ada9cdb to 5555522 Compare January 13, 2025 10:50
@fmeum fmeum marked this pull request as ready for review January 13, 2025 10:52
@fmeum fmeum requested review from justinhorvitz and tjgq and removed request for a team January 13, 2025 10:54
@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Jan 14, 2025
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 14, 2025

@bazel-io fork 8.1.0

@iancha1992
Copy link
Member

@fmeum Missing a Javadoc comment in line 62 and 64 from src/main/java/com/google/devtools/build/lib/remote/merkletree/MerkleTree.java

@fmeum fmeum force-pushed the efficient-action-input-writes branch from 5555522 to ed0a1ee Compare January 15, 2025 06:39
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 15, 2025

@fmeum Missing a Javadoc comment in line 62 and 64 from src/main/java/com/google/devtools/build/lib/remote/merkletree/MerkleTree.java

@iancha1992 Done!

@tjgq
Copy link
Contributor

tjgq commented Jan 20, 2025

Note: I had to amend the VirtualActioninput#getBytes signature to throws IOException as we have Google-internal overrides with a throwing signature. (I suspect they all boil down to an in-memory operation that cannot throw, but there's too much indirection for me to easily fix it. Sorry.)

@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Jan 21, 2025
iancha1992 pushed a commit to iancha1992/bazel that referenced this pull request Jan 21, 2025
bazelbuild@b82dcdc introduced a way for parameter files to be materialized lazily and in a streaming fashion. For this to be effective, the contents of `VirtualActionInput`s should not be retained and, ideally, never or only briefly materialized in full.

This is achieved by replacing the digest computations with a streaming approach and avoiding retention of `VirtualActionInput` in `MerkleTree`s, which may be cached. All cache protocols are streamlined to accept a `Blob`, which is a closeable supplier of `InputStream`s that is automatically closed after the upload has completed (which may require retries).

Along the way, file nodes in `MerkleTree`s are stored unwrapped to reduce memory usage.

Closes bazelbuild#24891.

PiperOrigin-RevId: 718042524
Change-Id: I1567cf6ca36cbf5e1146c91df34157c9192bb822
github-merge-queue bot pushed a commit that referenced this pull request Jan 23, 2025
…nput` (#25008)

b82dcdc
introduced a way for parameter files to be materialized lazily and in a
streaming fashion. For this to be effective, the contents of
`VirtualActionInput`s should not be retained and, ideally, never or only
briefly materialized in full.

This is achieved by replacing the digest computations with a streaming
approach and avoiding retention of `VirtualActionInput` in
`MerkleTree`s, which may be cached. All cache protocols are streamlined
to accept a `Blob`, which is a closeable supplier of `InputStream`s that
is automatically closed after the upload has completed (which may
require retries).

Along the way, file nodes in `MerkleTree`s are stored unwrapped to
reduce memory usage.

Closes #24891.

PiperOrigin-RevId: 718042524
Change-Id: I1567cf6ca36cbf5e1146c91df34157c9192bb822

Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
github-merge-queue bot pushed a commit that referenced this pull request Jan 24, 2025
…nput` (#25008)

b82dcdc
introduced a way for parameter files to be materialized lazily and in a
streaming fashion. For this to be effective, the contents of
`VirtualActionInput`s should not be retained and, ideally, never or only
briefly materialized in full.

This is achieved by replacing the digest computations with a streaming
approach and avoiding retention of `VirtualActionInput` in
`MerkleTree`s, which may be cached. All cache protocols are streamlined
to accept a `Blob`, which is a closeable supplier of `InputStream`s that
is automatically closed after the upload has completed (which may
require retries).

Along the way, file nodes in `MerkleTree`s are stored unwrapped to
reduce memory usage.

Closes #24891.

PiperOrigin-RevId: 718042524
Change-Id: I1567cf6ca36cbf5e1146c91df34157c9192bb822

Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
@fmeum fmeum deleted the efficient-action-input-writes branch January 24, 2025 08:13
fmeum added a commit to fmeum/bazel that referenced this pull request Jan 29, 2025
bazelbuild@b82dcdc introduced a way for parameter files to be materialized lazily and in a streaming fashion. For this to be effective, the contents of `VirtualActionInput`s should not be retained and, ideally, never or only briefly materialized in full.

This is achieved by replacing the digest computations with a streaming approach and avoiding retention of `VirtualActionInput` in `MerkleTree`s, which may be cached. All cache protocols are streamlined to accept a `Blob`, which is a closeable supplier of `InputStream`s that is automatically closed after the upload has completed (which may require retries).

Along the way, file nodes in `MerkleTree`s are stored unwrapped to reduce memory usage.

Closes bazelbuild#24891.

PiperOrigin-RevId: 718042524
Change-Id: I1567cf6ca36cbf5e1146c91df34157c9192bb822
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants