feat: add graceful cancellation and --stop-on-first-error option in lake#13075
feat: add graceful cancellation and --stop-on-first-error option in lake#13075marcelolynch wants to merge 10 commits intoleanprover:masterfrom
--stop-on-first-error option in lake#13075Conversation
|
Mathlib CI status (docs):
|
|
Reference manual CI status:
|
Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
src/lake/Lake/Build/Index.lean
Outdated
| match info with | ||
| | .target pkg target => do | ||
| if let some tk := (← getBuildContext).cancelling? then | ||
| if ← tk.isSet then return .error -- cancelled: skip scheduling new work |
There was a problem hiding this comment.
Should this be
| if ← tk.isSet then return .error -- cancelled: skip scheduling new work | |
| if ← tk.isSet then return .error "Cancelled" |
There was a problem hiding this comment.
This is the .error that builds a failed Job with an empty log rather than the error that fails the monad
There was a problem hiding this comment.
Arguably this is a bit shaky, it is tied a bit on the monitor's notion of failure being based on the log:
let failed := strictAnd log.hasEntries (maxLv ≥ failLv)
So an empty-log job is invisible to the monitor, and it won't appear in "Some required targets logged failures:".
Maybe this warrants a first-class "cancelled" state as JobResult instead, but it seems like this requires some deeper refactors
There was a problem hiding this comment.
Perhaps worth a comment noting that the log is left empty for this reason.
There was a problem hiding this comment.
I added a cancelled state. I think the diff looks reasonable and the test looks fine, but it does seem like a bigger change now, and to be honest I don't know the full consequences of changing public abbrev JobResult α := EResult Log.Pos JobState α to a new inductive type.
There was a problem hiding this comment.
I'm happy to revert the last commit if the other thing looks reasonable enough. On my (Lean noob) opinion, I'm happier with adding it as a first-class citizen to support this and future purposes (Ctrl-C seems like low hanging fruit)
| /-! ## JobTask -/ | ||
|
|
||
| /-- The result of a Lake job. -/ | ||
| public abbrev JobResult α := EResult Log.Pos JobState α |
There was a problem hiding this comment.
You could do this with less work with
inductive JobError
| log : Log.Pos → JobError
| cancelled
public abbrev JobResult α := EResult JobError JobState αThere was a problem hiding this comment.
This is pretty nice, although now that I bit the bullet I don't know if I prefer cancellation being an "error" rather than a distinct state. Does keeping the abbrev have some more benefits except for avoiding all those new adaptation functions?
There was a problem hiding this comment.
The main advantage is you get to reuse all the EResult machinery, and the code for error propagation (in bind) can pass the JobError object along unchanged without taking it apart and putting it back together each time, which is likely to be faster at runtime.
You could use JobException or JobStatus instead, I'm not attached to the name. But for reference, Python uses asyncio.CancelledError for such a concept, so I don't think treating cancellations as an error is really that strange.
There was a problem hiding this comment.
Sold. Some pattern matching is a bit more awkward, but still pretty okay.
| | .error e sb => .error ⟨sa.log.size + e.val⟩ {sa.merge sb with trace := sb.trace} | ||
| | .error e sa => return Task.pure (.error e sa) | ||
| | .error (.errorLogged e) sb => .error (.errorLogged ⟨sa.log.size + e.val⟩) {sa.merge sb with trace := sb.trace} | ||
| | .error e sb => .error e {sa.merge sb with trace := sb.trace} |
There was a problem hiding this comment.
Instead of duplicating the sa.merge sb, you could consider an inner match just to populate the first argument of .error
| | .error (.errorLogged e) s => .error (.errorLogged e) (s.logEntry entry) | ||
| | .error .cancelled s => .error .cancelled s |
There was a problem hiding this comment.
Marginal, but perhaps faster as
| | .error (.errorLogged e) s => .error (.errorLogged e) (s.logEntry entry) | |
| | .error .cancelled s => .error .cancelled s | |
| | .error e s => .error e <| match e with | .errorLogged _ => s.logEntry entry | _ => s |
There was a problem hiding this comment.
It looks a bit cryptic, I'm tempted to leave as-is
Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
This PR adds a
--stop-on-first-errorflag tolake buildthat stops scheduling new build jobs as soon as the first required-target failure is detected, then waits for already-running jobs to drain to completion before reporting failures and exiting.Previously, Lake had no way to abort a build early on failure: it always waited for every scheduled job to finish. On large workspaces this means waiting for dozens of unrelated compilations to complete after the first error is already known.
The implementation introduces a generic cancellation mechanism: a
cancelling? : Option IO.CancelTokenfield onBuildContextthat is always created and threaded through the build. The--stop-on-first-errorflag is one consumer that sets it. When the token is set,recBuildWithIndexshort-circuits withJob.cancelledinstead of scheduling new work, andregisterJobskips adding the job to the monitor queue so interrupted pending jobs do not appear as phantom failures in the output. Jobs that are already running complete normally.To support this cleanly,
JobResultis promoted from anabbrevoverEResultto a standalone inductive type with a third constructor.cancelled. The new constructor makes cancellation explicit and propagates correctly through all job combinators (zipWith,bindM,mapM, etc.), with error taking priority over cancellation in combined results.Beyond the new flag, this PR establishes a general graceful cancellation primitive for Lake builds that opens up future use cases such as SIGINT/Ctrl-C handling (today, interrupting a Lake build hard-kills the process and orphans running subprocesses) and build timeouts for CI environments.
Closes #13074, addresses #2763
Prepared with Claude Code