-
Notifications
You must be signed in to change notification settings - Fork 753
feat: Added number of spot interruptions to Tower/Platform telemetry #6606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: munishchouhan <hrma017@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
pditommaso
reviewed
Dec 15, 2025
plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
pditommaso
reviewed
Dec 15, 2025
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
pditommaso
reviewed
Dec 15, 2025
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Outdated
Show resolved
Hide resolved
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
This reverts commit dcb8465.
This reverts commit 6c6f153.
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
pditommaso
reviewed
Dec 16, 2025
plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Show resolved
Hide resolved
- Use guard clauses in AWS Batch handler for cleaner flow - Add clarifying comment in Google Batch handler Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
pditommaso
approved these changes
Dec 16, 2025
fntlnz
pushed a commit
to fntlnz/nextflow
that referenced
this pull request
Dec 18, 2025
Track and report spot/preemptible instance interruptions for cloud batch executors. Changes: - Add `numSpotInterruptions` transient field to TraceRecord - AWS Batch: detect spot interruptions by checking status reason pattern "Host EC2*" - Google Batch: detect spot preemptions via exit code 50001 in status events - Tower plugin: send numSpotInterruptions to Seqera Platform telemetry This enables workflow optimization and cost analysis by tracking how often tasks are retried due to spot instance reclamation.
fntlnz
pushed a commit
to fntlnz/nextflow
that referenced
this pull request
Dec 18, 2025
Track and report spot/preemptible instance interruptions for cloud batch executors. Changes: - Add `numSpotInterruptions` transient field to TraceRecord - AWS Batch: detect spot interruptions by checking status reason pattern "Host EC2*" - Google Batch: detect spot preemptions via exit code 50001 in status events - Tower plugin: send numSpotInterruptions to Seqera Platform telemetry This enables workflow optimization and cost analysis by tracking how often tasks are retried due to spot instance reclamation. (cherry picked from commit eecd816)
fntlnz
pushed a commit
to fntlnz/nextflow
that referenced
this pull request
Dec 18, 2025
Track and report spot/preemptible instance interruptions for cloud batch executors. Changes: - Add `numSpotInterruptions` transient field to TraceRecord - AWS Batch: detect spot interruptions by checking status reason pattern "Host EC2*" - Google Batch: detect spot preemptions via exit code 50001 in status events - Tower plugin: send numSpotInterruptions to Seqera Platform telemetry This enables workflow optimization and cost analysis by tracking how often tasks are retried due to spot instance reclamation. (cherry picked from commit eecd816) Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
fntlnz
pushed a commit
that referenced
this pull request
Dec 18, 2025
Track and report spot/preemptible instance interruptions for cloud batch executors. Changes: - Add `numSpotInterruptions` transient field to TraceRecord - AWS Batch: detect spot interruptions by checking status reason pattern "Host EC2*" - Google Batch: detect spot preemptions via exit code 50001 in status events - Tower plugin: send numSpotInterruptions to Seqera Platform telemetry This enables workflow optimization and cost analysis by tracking how often tasks are retried due to spot instance reclamation. (cherry picked from commit eecd816) Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds tracking and reporting of spot/preemptible instance interruptions for cloud batch executors (AWS Batch and Google Batch). When tasks are retried due to spot instance interruptions, the number of interruptions is now captured and exposed via the
numSpotInterruptionsfield in trace records.Motivation
Spot/preemptible instances can be reclaimed by cloud providers at any time, causing tasks to retry on new instances. Understanding how often this happens is important for:
Changes
Core Framework
modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy)numSpotInterruptionstransient field with getter/setter methodsAWS Batch Plugin (
nf-amazon)AwsBatchTaskHandler.groovy
getNumSpotInterruptions(String jobId)method that examines job attempts for spot interruption patternsstatusReasonstarts with "Host EC2"getTraceRecord()to populatenumSpotInterruptionsfieldTests (
AwsBatchTaskHandlerTest.groovy)getNumSpotInterruptions()with various scenarios:Google Batch Plugin (
nf-google)GoogleBatchTaskHandler.groovy
getNumSpotInterruptions(String jobId)method that examines task status eventsgetTraceRecord()to populatenumSpotInterruptionsfieldmaxSpotAttempts()helper using FusionConfig defaults when fusion snapshots enabledTests (
GoogleBatchTaskHandlerTest.groovy)getNumSpotInterruptions()covering multiple scenariosTechnical Details
Detection Mechanisms
AWS Batch:
JobDetail.attempts()listattempt.statusReason()starts with"Host EC2""Host EC2 (instance i-xxx) terminated."Google Batch:
TaskStatus.statusEventsList()exitCode == 50001in task execution eventsImplementation Approach
The
numSpotInterruptionsfield is:.command.tracefiles)getTraceRecord()is calledThis approach queries the cloud provider's job/task status to detect spot interruptions based on provider-specific indicators:
The field will be available to trace observers that consume TraceRecord objects, allowing workflows to track and report spot interruption rates.
Testing