Skip to content

Conversation

@munishchouhan
Copy link
Collaborator

@munishchouhan munishchouhan commented Nov 24, 2025

Summary

This PR adds tracking and reporting of spot/preemptible instance interruptions for cloud batch executors (AWS Batch and Google Batch). When tasks are retried due to spot instance interruptions, the number of interruptions is now captured and exposed via the numSpotInterruptions field in trace records.

Motivation

Spot/preemptible instances can be reclaimed by cloud providers at any time, causing tasks to retry on new instances. Understanding how often this happens is important for:

  • Workflow optimization and cost analysis
  • Identifying tasks that frequently experience spot interruptions
  • Monitoring the reliability of spot instance usage
  • Debugging workflow issues related to instance interruptions

Changes

Core Framework

  • TraceRecord (modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy)
    • Added numSpotInterruptions transient field with getter/setter methods
    • Field is accessible in trace records and can be consumed by trace observers

AWS Batch Plugin (nf-amazon)

  • AwsBatchTaskHandler.groovy

    • Added getNumSpotInterruptions(String jobId) method that examines job attempts for spot interruption patterns
    • Detects AWS Batch spot interruptions by checking if statusReason starts with "Host EC2"
    • Returns count of spot interruptions or null if unavailable
    • Updates getTraceRecord() to populate numSpotInterruptions field
  • Tests (AwsBatchTaskHandlerTest.groovy)

    • Added comprehensive test coverage for getNumSpotInterruptions() with various scenarios:
      • No interruptions (0 attempts, empty attempts)
      • Single interruption
      • Multiple interruptions
      • Mixed with non-spot failures
    • Added test verifying trace record integration

Google Batch Plugin (nf-google)

  • GoogleBatchTaskHandler.groovy

    • Added getNumSpotInterruptions(String jobId) method that examines task status events
    • Detects Google Batch spot preemptions by checking for exit code 50001 in status events
    • Returns count of spot preemptions or null if unavailable
    • Updates getTraceRecord() to populate numSpotInterruptions field
    • Implements maxSpotAttempts() helper using FusionConfig defaults when fusion snapshots enabled
  • Tests (GoogleBatchTaskHandlerTest.groovy)

    • Added parameterized test for getNumSpotInterruptions() covering multiple scenarios
    • Added test verifying trace record integration
    • Verified count correctly extracted from status events

Technical Details

Detection Mechanisms

AWS Batch:

  • Examines JobDetail.attempts() list
  • Identifies spot reclamations by checking if attempt.statusReason() starts with "Host EC2"
  • Example pattern: "Host EC2 (instance i-xxx) terminated."

Google Batch:

  • Examines TaskStatus.statusEventsList()
  • Identifies spot preemptions by checking for exitCode == 50001 in task execution events
  • Exit code 50001 is Google Batch's special code for spot preemption

Implementation Approach

The numSpotInterruptions field is:

  1. Stored in TraceRecord as a transient field (not serialized to .command.trace files)
  2. Computed on-demand from cloud provider APIs when getTraceRecord() is called
  3. Available to trace observers for reporting and metrics collection
  4. Returns null if the count cannot be determined (e.g., job not found, API error)

This approach queries the cloud provider's job/task status to detect spot interruptions based on provider-specific indicators:

  • AWS Batch: Status reasons starting with "Host EC2"
  • Google Batch: Status events with exit code 50001

The field will be available to trace observers that consume TraceRecord objects, allowing workflows to track and report spot interruption rates.

Testing

  • ✅ All existing tests pass
  • ✅ New unit tests for spot reclamation counting logic
  • ✅ Integration tests for trace record generation
  • ✅ Verified trace file format compatibility

Signed-off-by: munishchouhan <hrma017@gmail.com>
@netlify
Copy link

netlify bot commented Nov 24, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 90fb949
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69414be29580ee000858917f

@munishchouhan munishchouhan requested review from stefanoboriero and removed request for fntlnz, jordeu and stefanoboriero November 27, 2025 14:45
@pditommaso pditommaso marked this pull request as draft November 28, 2025 09:22
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
@munishchouhan munishchouhan marked this pull request as ready for review December 11, 2025 21:21
@munishchouhan munishchouhan changed the title Added number of spot interruptions in TraceRecord for aws and google batch Feat: Added number of spot interruptions in TraceRecord for aws and google batch Dec 15, 2025
@munishchouhan munishchouhan changed the title Feat: Added number of spot interruptions in TraceRecord for aws and google batch feat: Added number of spot interruptions in TraceRecord for aws and google batch Dec 15, 2025
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
Signed-off-by: munishchouhan <hrma017@gmail.com>
munishchouhan and others added 2 commits December 16, 2025 12:39
- Use guard clauses in AWS Batch handler for cleaner flow
- Add clarifying comment in Google Batch handler

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso changed the title feat: Added number of spot interruptions in TraceRecord for aws and google batch feat: Added number of spot interruptions to Tower/Platform telemetry Dec 16, 2025
@pditommaso pditommaso merged commit eecd816 into master Dec 16, 2025
14 checks passed
@pditommaso pditommaso deleted the add-num-reclamations-trace branch December 16, 2025 12:41
fntlnz pushed a commit to fntlnz/nextflow that referenced this pull request Dec 18, 2025
Track and report spot/preemptible instance interruptions for cloud batch executors.

Changes:
- Add `numSpotInterruptions` transient field to TraceRecord
- AWS Batch: detect spot interruptions by checking status reason pattern "Host EC2*"
- Google Batch: detect spot preemptions via exit code 50001 in status events
- Tower plugin: send numSpotInterruptions to Seqera Platform telemetry

This enables workflow optimization and cost analysis by tracking how often
tasks are retried due to spot instance reclamation.
fntlnz pushed a commit to fntlnz/nextflow that referenced this pull request Dec 18, 2025
Track and report spot/preemptible instance interruptions for cloud batch executors.

Changes:
- Add `numSpotInterruptions` transient field to TraceRecord
- AWS Batch: detect spot interruptions by checking status reason pattern "Host EC2*"
- Google Batch: detect spot preemptions via exit code 50001 in status events
- Tower plugin: send numSpotInterruptions to Seqera Platform telemetry

This enables workflow optimization and cost analysis by tracking how often
tasks are retried due to spot instance reclamation.

(cherry picked from commit eecd816)
fntlnz pushed a commit to fntlnz/nextflow that referenced this pull request Dec 18, 2025
Track and report spot/preemptible instance interruptions for cloud batch executors.

Changes:
- Add `numSpotInterruptions` transient field to TraceRecord
- AWS Batch: detect spot interruptions by checking status reason pattern "Host EC2*"
- Google Batch: detect spot preemptions via exit code 50001 in status events
- Tower plugin: send numSpotInterruptions to Seqera Platform telemetry

This enables workflow optimization and cost analysis by tracking how often
tasks are retried due to spot instance reclamation.

(cherry picked from commit eecd816)
Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
fntlnz pushed a commit that referenced this pull request Dec 18, 2025
Track and report spot/preemptible instance interruptions for cloud batch executors.

Changes:
- Add `numSpotInterruptions` transient field to TraceRecord
- AWS Batch: detect spot interruptions by checking status reason pattern "Host EC2*"
- Google Batch: detect spot preemptions via exit code 50001 in status events
- Tower plugin: send numSpotInterruptions to Seqera Platform telemetry

This enables workflow optimization and cost analysis by tracking how often
tasks are retried due to spot instance reclamation.

(cherry picked from commit eecd816)
Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants