Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAR - HOLD status instead of waiting status #2540

Closed
grafitos opened this issue Jan 6, 2022 · 5 comments
Closed

OAR - HOLD status instead of waiting status #2540

grafitos opened this issue Jan 6, 2022 · 5 comments

Comments

@grafitos
Copy link

grafitos commented Jan 6, 2022

Bug report

In rare cases, nextflow may stop because of job state incoherency with OAR. Indeed, OAR seems to set job status to HOLD during a very short time at its creation. (https://github.com/oar-team/oar/blob/4860ba9b0a592be5682a56635aa22e11bc705d84/sources/core/common-libs/lib/OAR/IO.pm#L1710)

Actual behavior

Nextflow believes that job is running, and is locking looking for output files but is unable to find expected files. This stops the program.

Program output

n-05 18:58:55.473 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /mnt/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/97/d1a3a6f65ebfcdd40b8236af76e50e/.command.out
Jan-05 18:58:55.477 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /mnt/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/97/d1a3a6f65ebfcdd40b8236af76e50e/.command.err
Jan-05 18:58:55.480 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /mnt/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/97/d1a3a6f65ebfcdd40b8236af76e50e/.command.log
Jan-05 18:58:55.481 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'Slicing (225)'

Environment

  • Nextflow version: All versions with OAR
  • Operating system: Linux

Additional context

Related issue

@pditommaso
Copy link
Member

Still not clear why NF should fail. Since hold is considered running, it should continue to wait indefinitely

https://github.com/nextflow-io/nextflow/blob/master/modules/nextflow/src/main/groovy/nextflow/executor/GridTaskHandler.groovy#L206

@grafitos
Copy link
Author

grafitos commented Jan 6, 2022

I'm going to investigate too. (Note that the task is "completed", and exit is "-")

if( task.exitStatus == Integer.MAX_VALUE )
?

Here a more complete trace:

Jan-05 18:58:55.460 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 2813496; id: 2664; name: Slicing (225); status:
COMPLETED; exit: -; error: -; workDir: /mnt/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/97/d1a3a6f
65ebfcdd40b8236af76e50e started: 1641405240313; exited: -; ]
Jan-05 18:58:55.473 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /mn
t/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/97/d1a3a6f65ebfcdd40b8236af76e50e/.command.out
Jan-05 18:58:55.477 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /mnt
/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/97/d1a3a6f65ebfcdd40b8236af76e50e/.command.err
Jan-05 18:58:55.480 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /mnt
/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/97/d1a3a6f65ebfcdd40b8236af76e50e/.command.log
Jan-05 18:58:55.481 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'Slicing (225)'

Caused by:
  Process Slicing (225) terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  vcfslicer /mnt/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/0b/4f190a64eaff3bede4cd79229ed73b/dc9
04328-db01-439a-8d4a-72c04b54d84f.diag.vcf 5000

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /mnt/scratch/20220105/aa8cf1e86e4b11ec8e730242ac11000b/work/97/d1a3a6f65ebfcdd40b8236af76e50e

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh
Jan-05 18:58:55.488 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Process Slicing (225)` terminated for an unknown reason -- Likely it has been terminated by the external system
Jan-05 18:58:55.509 [Task monitor] DEBUG nextflow.Session - The following nodes are still active:

@grafitos
Copy link
Author

grafitos commented Jan 6, 2022

"Nextflow believes that job is running, and is locking looking for output files but is unable to find expected files" => not true. It's just a side effect, not the root cause of the program to stop

@stale
Copy link

stale bot commented Jun 11, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 11, 2022
@stale stale bot removed the stale label Jun 30, 2022
@stale
Copy link

stale bot commented Dec 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 21, 2022
@stale stale bot closed this as completed Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants