Skip to content

feat: add graceful cancellation and --stop-on-first-error option in lake#13075

Open
marcelolynch wants to merge 10 commits intoleanprover:masterfrom
marcelolynch:2026/03/StopOnFirstError
Open

feat: add graceful cancellation and --stop-on-first-error option in lake#13075
marcelolynch wants to merge 10 commits intoleanprover:masterfrom
marcelolynch:2026/03/StopOnFirstError

Conversation

@marcelolynch
Copy link
Contributor

@marcelolynch marcelolynch commented Mar 24, 2026

This PR adds a --stop-on-first-error flag to lake build that stops scheduling new build jobs as soon as the first required-target failure is detected, then waits for already-running jobs to drain to completion before reporting failures and exiting.

Previously, Lake had no way to abort a build early on failure: it always waited for every scheduled job to finish. On large workspaces this means waiting for dozens of unrelated compilations to complete after the first error is already known.

The implementation introduces a generic cancellation mechanism: a cancelling? : Option IO.CancelToken field on BuildContext that is always created and threaded through the build. The --stop-on-first-error flag is one consumer that sets it. When the token is set, recBuildWithIndex short-circuits with Job.cancelled instead of scheduling new work, and registerJob skips adding the job to the monitor queue so interrupted pending jobs do not appear as phantom failures in the output. Jobs that are already running complete normally.

To support this cleanly, JobResult is promoted from an abbrev over EResult to a standalone inductive type with a third constructor .cancelled. The new constructor makes cancellation explicit and propagates correctly through all job combinators (zipWith, bindM, mapM, etc.), with error taking priority over cancellation in combined results.

Beyond the new flag, this PR establishes a general graceful cancellation primitive for Lake builds that opens up future use cases such as SIGINT/Ctrl-C handling (today, interrupting a Lake build hard-kills the process and orphans running subprocesses) and build timeouts for CI environments.

Closes #13074, addresses #2763

Prepared with Claude Code

@marcelolynch marcelolynch requested a review from tydeu as a code owner March 24, 2026 01:11
@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Mar 24, 2026
@mathlib-lean-pr-testing
Copy link

mathlib-lean-pr-testing bot commented Mar 24, 2026

Mathlib CI status (docs):

  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase b5036e4d81b399447a5c9f684da2d46d84910854 --onto 4bf7fa7447eea00cecba8327bb9c9e5f4485f0a7. You can force Mathlib CI using the force-mathlib-ci label. (2026-03-24 02:04:10)
  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase b5036e4d81b399447a5c9f684da2d46d84910854 --onto e6df474dd9c3ad0e21771eaa808c53f66222216d. You can force Mathlib CI using the force-mathlib-ci label. (2026-03-24 17:42:38)

@leanprover-bot
Copy link
Collaborator

Reference manual CI status:

  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase b5036e4d81b399447a5c9f684da2d46d84910854 --onto cfa8c5a036d6990635c6ec50b02d0e806995cec3. You can force reference manual CI using the force-manual-ci label. (2026-03-24 02:04:12)

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
match info with
| .target pkg target => do
if let some tk := (← getBuildContext).cancelling? then
if ← tk.isSet then return .error -- cancelled: skip scheduling new work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be

Suggested change
if ← tk.isSet then return .error -- cancelled: skip scheduling new work
if ← tk.isSet then return .error "Cancelled"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the .error that builds a failed Job with an empty log rather than the error that fails the monad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably this is a bit shaky, it is tied a bit on the monitor's notion of failure being based on the log:

let failed := strictAnd log.hasEntries (maxLv ≥ failLv)

So an empty-log job is invisible to the monitor, and it won't appear in "Some required targets logged failures:".

Maybe this warrants a first-class "cancelled" state as JobResult instead, but it seems like this requires some deeper refactors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth a comment noting that the log is left empty for this reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a cancelled state. I think the diff looks reasonable and the test looks fine, but it does seem like a bigger change now, and to be honest I don't know the full consequences of changing public abbrev JobResult α := EResult Log.Pos JobState α to a new inductive type.

Copy link
Contributor Author

@marcelolynch marcelolynch Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to revert the last commit if the other thing looks reasonable enough. On my (Lean noob) opinion, I'm happier with adding it as a first-class citizen to support this and future purposes (Ctrl-C seems like low hanging fruit)

/-! ## JobTask -/

/-- The result of a Lake job. -/
public abbrev JobResult α := EResult Log.Pos JobState α
Copy link
Contributor

@eric-wieser eric-wieser Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this with less work with

inductive JobError
| log : Log.Pos → JobError
| cancelled

public abbrev JobResult α := EResult JobError JobState α

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty nice, although now that I bit the bullet I don't know if I prefer cancellation being an "error" rather than a distinct state. Does keeping the abbrev have some more benefits except for avoiding all those new adaptation functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main advantage is you get to reuse all the EResult machinery, and the code for error propagation (in bind) can pass the JobError object along unchanged without taking it apart and putting it back together each time, which is likely to be faster at runtime.

You could use JobException or JobStatus instead, I'm not attached to the name. But for reference, Python uses asyncio.CancelledError for such a concept, so I don't think treating cancellations as an error is really that strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sold. Some pattern matching is a bit more awkward, but still pretty okay.

| .error e sb => .error ⟨sa.log.size + e.val⟩ {sa.merge sb with trace := sb.trace}
| .error e sa => return Task.pure (.error e sa)
| .error (.errorLogged e) sb => .error (.errorLogged ⟨sa.log.size + e.val⟩) {sa.merge sb with trace := sb.trace}
| .error e sb => .error e {sa.merge sb with trace := sb.trace}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of duplicating the sa.merge sb, you could consider an inner match just to populate the first argument of .error

Comment on lines +95 to +96
| .error (.errorLogged e) s => .error (.errorLogged e) (s.logEntry entry)
| .error .cancelled s => .error .cancelled s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marginal, but perhaps faster as

Suggested change
| .error (.errorLogged e) s => .error (.errorLogged e) (s.logEntry entry)
| .error .cancelled s => .error .cancelled s
| .error e s => .error e <| match e with | .errorLogged _ => s.logEntry entry | _ => s

Copy link
Contributor Author

@marcelolynch marcelolynch Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit cryptic, I'm tempted to leave as-is

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: add --stop-on-first-error flag in lake build

3 participants