Process terminated for an unknown reason -- Likely it has been terminated by the external system #2847

hukai916 · 2022-05-04T15:13:26Z

Bug report

Expected behavior and actual behavior

I expect the NF engine to correctly capture the job status code submitted to LSF scheduler, but it fails to do so occasionally.

Steps to reproduce the problem

I didn't have this problem using the same workflow over the past a few months, but starts to encounter it recently.
This problem is hard to reproduce because it happens sporadically. I was able to capture one of such cases and traced the .nextflow.log file and found the following facts:

The job (jobId: 5535024) was actually completed without problem according to its .command.log file:

Started at Tue May  3 09:20:54 2022
Terminated at Tue May  3 18:24:37 2022
Results reported at Tue May  3 18:24:37 2022

The job was submitted to LSF at May-03 09:10:36.762, it was still PENDING 10 minutes later at May-03 09:20:40.795 , but was considered as COMPLETED only a few milliseconds later at May-03 09:20:40.803, and NF engine couldn't locate it output files because they were not being generated at that time. Therefore, NF generated an error and terminated all other processes and exited.
Relevant log is below:

May-03 09:10:36.762 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11) > jobId: 5535024; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642

May-03 09:20:40.795 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: RUNNING; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270000
Current queue status:
>   job: 5535024: PENDING

May-03 09:20:40.803 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: COMPLETED; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ]
May-03 09:20:40.820 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.out
May-03 09:20:40.822 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.err
May-03 09:20:40.823 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.log
May-03 09:20:40.825 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11)'

The question is boiled down to "Why a job actually finished at May 3 18:24:37 2022 was considered as COMPLETED at May-03 09:20:40.803 ?

I noticed similar issues here (#2540), here (#1045), and here (#1644).

Any input to further debug will be highly appreciated.

Environment

Nextflow version: [nextflow version 21.10.6.5660]
Java version: [openjdk version "1.8.0_92"]
Bash version: (use the command $SHELL --version) [GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)]

The text was updated successfully, but these errors were encountered:

bentsherman · 2022-05-06T14:25:48Z

Is it possible that your job was preempted by a higher priority user?

hukai916 · 2022-05-06T14:34:44Z

Thanks for your reply!

How could it happen as I am the only user running the pipeline in my own allocated space? What I don't understand is how a "PENDING" job be considered as "COMPLETED" by NF engine as shown in the log file.

bentsherman · 2022-05-06T19:26:48Z

Ah, well if you're the only user then it's definitely not preemption.

I should have taken a closer look at your log -- I see that the task logs generated by Nextflow are not being found. If you're using a shared filesystem then I suspect the issue is related to timestamps getting out of sync between the filesystem, the Nextflow job, and the task jobs. I've encountered similar errors while building code with make on a shared filesystem.

However I'll have to defer to others because I don't remember if there is any troubleshooting advice for Nextflow in this situation.

stale · 2022-10-16T02:06:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

xz-stjude · 2023-06-20T16:32:04Z

@hukai916 Hi, I have run into the exact same problem today. Have you found a solution or identified the cause?

hukai916 · 2023-06-20T17:32:05Z

Sadly it is hard to replicate and I don't have a solution yet. However, I configure the pipeline to use "short" queues to minimize the pending time (our cluster prioritizes "short" queues), which seem to reduce the incidence.

TBradley27 · 2023-11-13T10:35:27Z

Hello,

I am also having this problem. The bot marked this issue as 'stale' - but as we are still having this problem, then the issue is therefore not 'stale' nor 'completed'

timothy-barry · 2024-01-19T04:51:51Z

Similar issue over here.

dkoppstein · 2024-01-22T04:30:27Z

I'm also having a similar issue, also on LSF. Is it possible to reopen this issue, or find some way to debug the issue? I also am running on LSF, and I also see that the command finishes successfully in .command.out.

Here is my log output:

Jan-20 07:33:45.479 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (H021-PUC89K_metastasis_2)'

Caused by:
  Process `NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (H021-PUC89K_metastasis_2)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  STAR \
      --genomeDir star \
      --readFilesIn input1/H021-PUC89K_metastasis_2_1_val_1.fq.gz input2/H021-PUC89K_metastasis_2_2_val_2.fq.gz \
      --runThreadN 12 \
      --outFileNamePrefix H021-PUC89K_metastasis_2. \
       \
      --sjdbGTFfile gencode.v43.annotation.gtf \
      --outSAMattrRGline 'ID:H021-PUC89K_metastasis_2'  'SM:H021-PUC89K_metastasis_2'  \
      --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend



  if [ -f H021-PUC89K_metastasis_2.Unmapped.out.mate1 ]; then
      mv H021-PUC89K_metastasis_2.Unmapped.out.mate1 H021-PUC89K_metastasis_2.unmapped_1.fastq
      gzip H021-PUC89K_metastasis_2.unmapped_1.fastq
  fi
  if [ -f H021-PUC89K_metastasis_2.Unmapped.out.mate2 ]; then
      mv H021-PUC89K_metastasis_2.Unmapped.out.mate2 H021-PUC89K_metastasis_2.unmapped_2.fastq
      gzip H021-PUC89K_metastasis_2.unmapped_2.fastq
  fi

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN":
      star: $(STAR --version | sed -e "s/STAR_//g")
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
      gawk: $(echo $(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*$//')
  END_VERSIONS

Command exit status:
  -

I also notice that the job sometimes stays in the queue with STAT UNKWN.

bentsherman · 2024-01-25T15:09:28Z

Lack of exit status suggests that the job was terminated abruptly by your scheduler. Usually they at least give you an exit code if you exceeded your resource limits. Maybe the node was shut down? You'll have to inspect the job logs and status from the scheduler for any underlying cause. If you can find a root cause then we can try to handle it better in Nextflow, otherwise there isn't much that we can do

kerrycobb · 2024-04-30T15:34:34Z

I was having a nearly identical problem using Nextflow with Slurm. Most jobs were not starting and no error was being reported except for a small number of instances where there was a 255 error code. Of the jobs that did start, they were often killed with no error reported.

We found that the Slurm configuration file was not synced across nodes. Updating the permissions and re-synchronizing the configuration file resolved the issue.

flass · 2024-05-20T15:02:35Z

same issue here with LSF. our system admin says that LSF tries and kill job that exceed resources using different strategies, and if the OS is not responsive enough, it eventually ends up using an aggressive kill code -the exit code that is reported from the .command.log file is 9, suggesting the use of kill -9 by LSF - which won't leave enough time to Nextflow to capture the exit code (no .exitcode file is produced).
Sadly this throws our error-handling strategy (see here), which relies on the interpretation of the error codes to retry with escalated resources where relevant - so that we end up ignoring the failed jobs, even though they could succeed if provided with more resources.

our environment is this:
OS: Ubuntu 22.04 / Linux 5.15.0-94-generic #104-Ubuntu SMP x86_64 GNU/Linux
scheduler: IBM Spectrum LSF Version 10.1 Fix Pack 14
Nextflow: 23.10.1-5891
Singularity: 3.11.4

flass · 2024-05-21T15:45:07Z

just expanding on my case, so that there is material for thinking a solution:

I run a nextflow pipeline that has reported something like this sometimes when the resource usage of the task process is exceeding the requested amount:

[71/c2120a] NOTE: Process `ABUNDANCE_ESTIMATION:BOWTIE2SAMTOOLS (48888_2#90)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Error is ignored

On a test run, it seems like a little over half of the processes failed for excess memory use report this error and will be ignored for further processing i.e. no retrying.
The rest would report the usual exit status 140 (SIGUSR2) leading to retrying the execution under our error-handling strategy.

looking for the execution report of the failed process example above in the pipeline trace, I can see this:

$ head -1 results/pipeline_info/execution_trace_2024-05-20_12-36-38.txt && grep '71/c2120a' results/pipeline_info/execution_trace_2024-05-20_12-36-38.txt
task_id	hash	native_id	name	status	exit	submit	duration	realtime	%cpu	peak_rss	peak_vmem	rchar	wchar
652	71/c2120a	100777	ABUNDANCE_ESTIMATION:BOWTIE2SAMTOOLS (48888_2#90)	FAILED	-	2024-05-20 12:41:33.456	12m 22s	11m 35s	-	-	-	-	-

the exit code is reported as -, which I would suspect corresponds to a null value of the $exitStatus Groovy variable as seen by the master Nextflow process.
but if you go to the task work folder, you see this:

$ cd work/71/c2120acc53e89f36034211b95ce7a9/
$ ls -ltrA
total 198036
-rw-r--r-- 1 pathpipe rvidata       369 May 20 12:41 .command.sh
-rw-r--r-- 1 pathpipe rvidata     12058 May 20 12:41 .command.run
-rw-r--r-- 1 pathpipe rvidata         0 May 20 12:42 .command.trace
-rw-r--r-- 1 pathpipe rvidata         0 May 20 12:42 .command.out
-rw-r--r-- 1 pathpipe rvidata         0 May 20 12:42 .command.err
-rw-r--r-- 1 pathpipe rvidata         0 May 20 12:42 .command.begin
lrwxrwxrwx 1 pathpipe rvidata       175 May 20 12:42 48888_2#90_1_kneaddata_paired_2.fastq -> /lustre/scratch127/pam/projects/rvidata/bait_capture/rvi_prod_bait_dilution_exp-2024_05_16-48888_2/work/37/35a37bae527ea07e72417a491798c9/48888_2#90_1_kneaddata_paired_2.fastq
lrwxrwxrwx 1 pathpipe rvidata       175 May 20 12:42 48888_2#90_1_kneaddata_paired_1.fastq -> /lustre/scratch127/pam/projects/rvidata/bait_capture/rvi_prod_bait_dilution_exp-2024_05_16-48888_2/work/37/35a37bae527ea07e72417a491798c9/48888_2#90_1_kneaddata_paired_1.fastq
-rw-r--r-- 1 pathpipe rvidata 202753086 May 20 12:48 48888_2#90.sorted.bam.tmp.0000.bam
-rw-r--r-- 1 pathpipe rvidata      3825 May 20 12:49 .command.log
$ less .command.log

[... skipping to the bottom ...]
------------------------------------------------------------

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with signal termination: 9.

Resource usage summary:

    CPU time :                                   1528.00 sec.
    Max Memory :                                 4131 MB
    Average Memory :                             2577.74 MB
    Total Requested Memory :                     4096.00 MB
    Delta Memory :                               -35.00 MB
    Max Swap :                                   -
    Max Processes :                              24
    Max Threads :                                49
    Run time :                                   446 sec.
    Turnaround time :                            450 sec.

The output (if any) is above this job summary.

the actual exit status of the job was 9, but that has not been written into an .exitcode file, which is missing.

Now the question is: what is the behaviour of the Nexflow master process from this. if $exitStatus Groovy variable is indeed taking a value of null, this is something we could handle.
However, as you can see here we actually already handle a null value for this variable, and that is not solving the issue.
Instead, from the message in the log terminated for an unknown reason -- Likely it has been terminated by the external system -- Error is ignored, it sounds like the nextflow master process is deciding on its own to ignore the error, bypassing the retry strategy directives from our config.
@bentsherman is there a way to make nextflow use the actual exit code? or at least not force ignore the error?

any suggestion would be most helpful.

many thanks,

Florent

flass · 2024-05-22T14:48:53Z

update:
I devised that when the terminated for an unknown reason error occurs i.e. when no .exitcode file is produced, the value of the $task.exitStatus variable is not set as null but as 2147483647 (largest possible value of signed 32-bit integer).
I'll see if I can rely on this to salvage our retry strategy (and how robust this hack can be)
it's not solving the issue - which I think now is really something to do with LSF / our HPC, but at least I may have got a workaround to keep our Nextflow pipelines afloat

flass · 2024-06-03T10:27:57Z

Seqera support confirms that:

In instances where the exit code file isn't generated, we utilize "Integer.MAX_VALUE" as an indicator that no exit code was returned.

if Nextflow devs ever want to build a handling mechanism for this, I personally think the solution would be that when the .exitcode file is not found, and that the executor is lsf, Nextflow could try to look at the the bottom of the .command.log file for the job summary footer that is appended by LSF on closing the job, as this would feature the actual exit code, which could be parsed and then fed to the $task.exitStatus, or maybe more robustly to used to create an .exitcode file.

bentsherman added the executor/lsf label May 6, 2022

stale bot added the stale label Oct 16, 2022

stale bot closed this as completed Jan 16, 2023

flass mentioned this issue May 21, 2024

trying to troubleshoot issue with handling of the LSF exit codes sanger-pathogens/nextflow-commons#23

Merged

flass mentioned this issue Jun 4, 2024

Integer.MAX_VALUE sanger-pathogens/nextflow-commons#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process terminated for an unknown reason -- Likely it has been terminated by the external system #2847

Process terminated for an unknown reason -- Likely it has been terminated by the external system #2847

hukai916 commented May 4, 2022

bentsherman commented May 6, 2022

hukai916 commented May 6, 2022

bentsherman commented May 6, 2022

stale bot commented Oct 16, 2022

xz-stjude commented Jun 20, 2023

hukai916 commented Jun 20, 2023

TBradley27 commented Nov 13, 2023 •

edited

Loading

timothy-barry commented Jan 19, 2024

dkoppstein commented Jan 22, 2024 •

edited

Loading

bentsherman commented Jan 25, 2024

kerrycobb commented Apr 30, 2024 •

edited

Loading

flass commented May 20, 2024 •

edited

Loading

flass commented May 21, 2024 •

edited

Loading

flass commented May 22, 2024

flass commented Jun 3, 2024 •

edited

Loading

Process terminated for an unknown reason -- Likely it has been terminated by the external system #2847

Process terminated for an unknown reason -- Likely it has been terminated by the external system #2847

Comments

hukai916 commented May 4, 2022

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Environment

bentsherman commented May 6, 2022

hukai916 commented May 6, 2022

bentsherman commented May 6, 2022

stale bot commented Oct 16, 2022

xz-stjude commented Jun 20, 2023

hukai916 commented Jun 20, 2023

TBradley27 commented Nov 13, 2023 • edited Loading

timothy-barry commented Jan 19, 2024

dkoppstein commented Jan 22, 2024 • edited Loading

bentsherman commented Jan 25, 2024

kerrycobb commented Apr 30, 2024 • edited Loading

flass commented May 20, 2024 • edited Loading

flass commented May 21, 2024 • edited Loading

flass commented May 22, 2024

flass commented Jun 3, 2024 • edited Loading

TBradley27 commented Nov 13, 2023 •

edited

Loading

dkoppstein commented Jan 22, 2024 •

edited

Loading

kerrycobb commented Apr 30, 2024 •

edited

Loading

flass commented May 20, 2024 •

edited

Loading

flass commented May 21, 2024 •

edited

Loading

flass commented Jun 3, 2024 •

edited

Loading