Skip to content

Conversation

@ranger-ross
Copy link
Member

@ranger-ross ranger-ross commented Oct 11, 2025

What does this PR try to resolve?

This is an experiment at adding fine grain locking (at a build unit level) during compilation.
With #15947 merged, this unblocks us to start experimenting with more granular locking tracked in #4282

The primary goal of this PR is to evaluate locking schemes and review their trades offs (i.e. performance, complexity, etc)

Implementation approach / details

The approach is to add a lock file to each build unit dir (build-dir/<profile>/build/<pkg>/<hash>/lock) and acquire an exclusive lock during the compilation of that unit as well as a shared lock of all of its dependencies. These locks are taken using std::fs::File::{lock, lock_shared}.

For this experiment, I found it easier to create the locking from scratch rather than re-using the using locking systems in Filesystem and CacheLocker as their interfaces require gctx which is out of scope during the actual compilation phase passed to Work::new(). (and plumbing gctx into it, while possible was a bit annoying due to lifetime issues)

I encapsulated all of the locking logic into CompilationLock in locking.rs.

Note: For now I simply reused the -Zbuild-dir-new-layout flag to enable fine grain locking, though we may want a stand alone flag for this in the future.

Benchmarking and experimenting

After verifying that the compilation functionality is working, I did some basic benchmarks with hyperfine on a test crate with about ~200 total dependencies to represent a basic small to medium sized crate. Bench marks were run on a Fedora linux x86 machine with a 20 core CPU.

Cargo.toml
[dependencies]
clap = { version = "4.5.48", features = ["derive"] }
syn = "2.0.106"
tokio = { version = "1", features = ["full"]}
actix-web = "4"

(I didn't a lot of thought into the specific dependencies. I simply grabbed some crates a new that had a good amount of transitive dependencies so I did not need at a lot of dependencies manually.)

Results:

> hyperfine --runs 10 --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo build' --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build'
Benchmark 1: /home/ross/projects/cargo/target/release/cargo build
  Time (mean ± σ):      9.997 s ±  0.078 s    [User: 78.805 s, System: 12.906 s]
  Range (min … max):    9.888 s … 10.122 s    10 runs

Benchmark 2: /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build
  Time (mean ± σ):     10.940 s ±  0.167 s    [User: 76.551 s, System: 12.809 s]
  Range (min … max):   10.652 s … 11.157 s    10 runs

Summary
  /home/ross/projects/cargo/target/release/cargo build ran
    1.09 ± 0.02 times faster than /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build

From the results above we can see we are taking nearly a ~10% performance hit due to the locking overhead. Which is quiet bad IMO...

Out of curiosity, I also tried taking the shared locks in parallel using rayon's .par_iter() to see if that would improve the situation.

Code Change
// src/cargo/core/compiler/locking.rs
        let dependency_locks = self
            .dependency_units
            .par_iter() // <------- CHANGED THIS
            .map(|d| {
                let f = OpenOptions::new()
                    .create(true)
                    .write(true)
                    .append(true)
                    .open(d)
                    .unwrap();
                f.lock_shared().unwrap();
                f
            })
            .collect::<Vec<_>>();
> hyperfine --runs 10 --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo build' --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build'
Benchmark 1: /home/ross/projects/cargo/target/release/cargo build
  Time (mean ± σ):     10.065 s ±  0.084 s    [User: 78.569 s, System: 12.987 s]
  Range (min … max):    9.945 s … 10.215 s    10 runs

Benchmark 2: /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build
  Time (mean ± σ):     10.904 s ±  0.100 s    [User: 75.767 s, System: 12.876 s]
  Range (min … max):   10.758 s … 11.068 s    10 runs

Summary
  /home/ross/projects/cargo/target/release/cargo build ran
    1.08 ± 0.01 times faster than /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build

However we can see this did really improve it by much if at all.

Another idea I had was to see if taking a lock on the build unit directory (build-dir/<profile>/build/<pkg>/<hash>) directly instead of writing a dedicated lock file would have any effect. However, this also had minimal if any improvement compared to using a standalone file.

> hyperfine --runs 10 --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo build' --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build'
Benchmark 1: /home/ross/projects/cargo/target/release/cargo build
  Time (mean ± σ):     10.082 s ±  0.055 s    [User: 78.192 s, System: 12.938 s]
  Range (min … max):    9.984 s … 10.183 s    10 runs

Benchmark 2: /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build
  Time (mean ± σ):     10.829 s ±  0.104 s    [User: 76.385 s, System: 12.765 s]
  Range (min … max):   10.613 s … 10.987 s    10 runs

Summary
  /home/ross/projects/cargo/target/release/cargo build ran
    1.07 ± 0.01 times faster than /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build

I also benchmarked with a larger project with about ~750 dependencies to see how the changes scale with large projects.
Note: This is without rayon and using the lock file setup from the first benchmark above.

Cargo.toml
[dependencies]
clap = { version = "4.5.48", features = ["derive"] }
syn = "2.0.106"
tokio = { version = "1", features = ["full"]}
actix-web = "4"
axum = "0.8"
ratatui = "0.29"
aws-sdk-s3 = "1"
aws-sdk-dynamodb = "1"
serde = { version = "1", features = ["derive"] }
rand = "0.9"
sqlx = { version = "0.8", features = ["runtime-tokio-rustls", "postgres", "mysql", "macros"] }
bevy = "0.17"
> hyperfine --runs 10 --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo build' --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build'
Benchmark 1: /home/ross/projects/cargo/target/release/cargo build
  Time (mean ± σ):     63.624 s ±  0.895 s    [User: 645.249 s, System: 77.388 s]
  Range (min … max):   62.818 s … 65.855 s    10 runs

Benchmark 2: /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build
  Time (mean ± σ):     70.956 s ±  0.546 s    [User: 563.547 s, System: 69.584 s]
  Range (min … max):   70.090 s … 71.517 s    10 runs

Summary
  /home/ross/projects/cargo/target/release/cargo build ran
    1.12 ± 0.02 times faster than /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build

Other observations

  • The penalty appears to scale with project size. For projects with less than 30 dependencies, the penalty was generally less than 1%. Also it seemingly flattening out around a 10%-15% penalty.

I also ran a baseline to make sure the performance loss was not coming from layout restructuring (as opposed to adding locking) by running the same bench with out the locking changes. (built from commit 81c3f77)

> hyperfine --runs 10 --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo build' --prepare 'rm -rf target' '/home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build'
Benchmark 1: /home/ross/projects/cargo/target/release/cargo build
  Time (mean ± σ):      9.522 s ±  0.099 s    [User: 73.558 s, System: 11.183 s]
  Range (min … max):    9.332 s …  9.676 s    10 runs

Benchmark 2: /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build
  Time (mean ± σ):      9.489 s ±  0.104 s    [User: 73.694 s, System: 11.129 s]
  Range (min … max):    9.291 s …  9.668 s    10 runs

Summary
  /home/ross/projects/cargo/target/release/cargo -Zbuild-dir-new-layout build ran
    1.00 ± 0.02 times faster than /home/ross/projects/cargo/target/release/cargo build

@rustbot rustbot added A-build-execution Area: anything dealing with executing the compiler A-layout Area: target output directory layout, naming, and organization labels Oct 11, 2025
@ranger-ross
Copy link
Member Author

ranger-ross commented Oct 11, 2025

After some more digging, I think a large part of the performance regression here is due to the locking causing jobs to wait for both rmeta AND rlibs to be generated before proceeding.

Below is a trace view to illustrate:

image

The lock span is the time waiting for a job to acquire the locks it needs to proceed.

We can as soon as the .rmeta is produces the job queue will allow the next job to run, but since the exclusive lock is not released until the crate is fully compiled the next job waits because it cannot get a shared lock.


We may need to create a more complicated locking mechanism similar to the crate cache that would allow us to downgrade the to a shared lock or have dedicated lock states like rmeta_produced

@ehuss
Copy link
Contributor

ehuss commented Oct 11, 2025

How do you plan to handle deadlocks?

EDIT: Though thinking more... Probably not an issue. I was thinking of cycles, but maybe dev-dep cycles will have a different hash?

@ranger-ross
Copy link
Member Author

How do you plan to handle deadlocks?

EDIT: Though thinking more... Probably not an issue. I was thinking of cycles, but maybe dev-dep cycles will have a different hash?

Yeah, my assumption is that there would be no cycles in the unit graph, so if unit is scheduled to run all of it's dependencies have already been built and their locks had been released.

@rustbot

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. label Oct 13, 2025
@ranger-ross ranger-ross force-pushed the experiment-with-fine-grain-locking branch from 64e5bf3 to f221f5e Compare October 14, 2025 14:50
@ranger-ross ranger-ross force-pushed the experiment-with-fine-grain-locking branch from f221f5e to 3fc0143 Compare October 16, 2025 11:08
@ranger-ross
Copy link
Member Author

I reverted the multiple locks per build unit approach for now.
I posted a comment on the tracking issue with some design proposals, but we still have not fleshed out the direction we want to go with this.

Plan to discuss more about the path forward in the next Cargo team meeting.

@ranger-ross
Copy link
Member Author

Closing this in favor of #16155

github-merge-queue bot pushed a commit that referenced this pull request Dec 30, 2025
This PR adds fine grain locking for the build cache using build unit
level locking.
I'd recommend reading the design details in this description and then
reviewing commit by commit.
Part of #4282

Previous attempt: #16089

## Design decisions / rational

- Still hold `build-dir/<profile>/.cargo-lock`
  - to protect against `cargo clean` (exclusive)
  - changed from exclusive to shared for builds
- Using build unit level locking with a single lock per build unit.
- Before checking fingerprint freshness we take a shared lock. This
prevents reading a fingerprint while another build is active.
- For units that are dirty, when the job server queues the job we take
an exclusive lock to prevent others from reading while we compile.
- This is done by dropping the shared lock and then acquiring an
exclusive lock, rather than downgrading the lock, to protect against
deadlock, see
#16155 (comment)
- After the unit's compilation is complete, we downgrade back to a
shared lock allowing other readers.
  - All locks are released at the end of the entire build process
- artifact-dir was handled in #16307.

For the rational for this design see the discussion [#t-cargo > Build
cache and locking design @
💬](https://rust-lang.zulipchat.com/#narrow/channel/246057-t-cargo/topic/Build.20cache.20and.20locking.20design/near/561677181)

## Open Questions

- [ ] Do we need rlimit checks and dynamic rlimits?
#16155 (comment)
- [ ] Proper handling of blocking message
(#16155 (comment))
- Update Dec 18 2025: With updated impl, we now get the blocking message
when taking the initial shared lock, but we get no message when taking
the exclusive lock right before compiling.
- [ ] Reduce parallelism when blocking
- [x] How do we want to handle locking on the artifact directory?
- We could simply continue using coarse grain locking, locking and
unlocking when files are uplifted.
- One downside of locking/unlocking multiple times per invocation is
that artifact-dir is touch many times across the compilation process
(for example, there is a pre-rustc [clean up
step](https://github.com/rust-lang/cargo/blob/master/src/cargo/core/compiler/mod.rs#L402)
Also we need to take into account other commands like `cargo doc`
- Another option would to only take a lock on the artifact-dir for
commands that we know will uplift files. (e.g. `cargo check` would not
take a lock artifact-dir but `cargo build` would). This would mean that
2 `cargo build` invocations would not run in parallel because one of
them would hold the lock artifact-dir (blocking the other). This might
actually be ideal to avoid 2 instances fighting over the CPU while
recompiling the same crates.
    - Solved by #16307
- [ ] What should our testing strategy for locking be?
- My testing strategy thus far has been to run cargo on dummy projects
to verify the locking.
- For the max file descriptor testing, I have been using the Zed
codebase as a testbed as it has over 1,500 build units which is more
than the default ulimit on my linux system. (I am happy to test this on
other large codebase that we think would be good to verify against)
- It’s not immediately obvious to me as to how to create repeatable unit
tests for this or what those tests should be testing for.
- For performance testing, I have been using hyperfine to benchmark
builds with and without `-Zbuild-dir-new-layout`. With the current
implementation I am not seeing any perf regression on linux but I have
yet to test on windows/macos.

---

<details><summary>Original Design</summary>

- Using build unit level locking instead of a temporary working
directory.
- After experimenting with multiple approaches, I am currently leaning
to towards build unit level locking.
- The working directory approach introduces a fair bit of uplifting
complexity and I further along I pushed my prototype the more I ran into
unexpected issues.
- mtime changes in fingerprints due to uplifting/downlifting order
- tests/benches need to be ran before being uplifted OR uplifted and
locked during execution which leads to more locking design needed. (also
running pre-uplift introduces other potential side effects like the path
displayed to the user being deleted as its temporary)
- The trade off here is that with build unit level locks, we need a more
advanced locking mechanism and we will have more open locks at once.
- The reason I think this is a worth while trade of is that the locking
complexity can largely be contained to single module where the uplifting
complexity would be spread through out the cargo codebase anywhere we do
uplifting. The increased locks count while unavoidable can be mitigated
(see below for more details)
- Risk of too many locks (file descriptors)
- On Linux 1024 is a fairly common default soft limit. Windows is even
lower at 256.
- Having 2 locks per build unit makes is possible to hit with a moderate
amount of dependencies
- There are a few mitigations I could think of for this problem (that
are included in this PR)
- Increasing the file descriptor limits of based on the number of build
units (if hard limit is high enough)
- Share file descriptors for shared locks across jobs (within a single
process) using a virtual lock
            - This could be implemented using reference counting.
- Falling back to coarse grain locking if some heuristic is not met

### Implementation details

- We have a stateful lock per build unit made up of multiple file locks
`primary.lock` and `secondary.lock` (see
[`locking.rs`](http://locking.rs) module docs for more details on the
states)
    - This is needed to enable pipelined builds
- We fall back to coarse grain locking if fine grain locking is
determined to be unsafe (see `determine_locking_mode()`)
- Fine grain locking continues to take the existing `.cargo-lock` lock
as RO shared to continue working with older cargo versions while
allowing multiple newer cargo instances to run in parallel.
- Locking is disabled on network filesystems. (keeping existing behavior
from #2623)
- `cargo clean` continues to use coarse grain locking for simplicity.
- File descriptors
- I added functionality to increase the file descriptors if cargo
detects that there will not be enough based on the number of build units
in the `UnitGraph`.
- If we aren’t able to increase a threshold (currently `number of build
units * 10`) we automatically fallback to coarse grain locking and
display a warning to the user.
- I picked 10 times the number of build units a conservative estimate
for now. I think lowering this number may be reasonable.
- While testing, I was seeing a peak of ~3,200 open file descriptors
while compiling Zed. This is approximately x2 the number of build units.
- Without the `RcFileLock` I was seeing peaks of ~12,000 open fds which
I felt was quiet high even for a large project like Zed.
- We use a global `FileLockInterner` that holds on to the file
descriptors (`RcFileLock`) until the end of the process. (We could
potentially add it to `JobState` if preferred, it would just be a bit
more plumbing)

See #16155 (comment)
for proposal to transition away from this to the current scheme

</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-build-execution Area: anything dealing with executing the compiler A-layout Area: target output directory layout, naming, and organization S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants