-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process terminated for an unknown reason -- Likely it has been terminated by the external system #2847
Comments
Is it possible that your job was preempted by a higher priority user? |
Thanks for your reply! How could it happen as I am the only user running the pipeline in my own allocated space? What I don't understand is how a "PENDING" job be considered as "COMPLETED" by NF engine as shown in the log file. |
Ah, well if you're the only user then it's definitely not preemption. I should have taken a closer look at your log -- I see that the task logs generated by Nextflow are not being found. If you're using a shared filesystem then I suspect the issue is related to timestamps getting out of sync between the filesystem, the Nextflow job, and the task jobs. I've encountered similar errors while building code with However I'll have to defer to others because I don't remember if there is any troubleshooting advice for Nextflow in this situation. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@hukai916 Hi, I have run into the exact same problem today. Have you found a solution or identified the cause? |
Sadly it is hard to replicate and I don't have a solution yet. However, I configure the pipeline to use "short" queues to minimize the pending time (our cluster prioritizes "short" queues), which seem to reduce the incidence. |
Hello, I am also having this problem. The bot marked this issue as 'stale' - but as we are still having this problem, then the issue is therefore not 'stale' nor 'completed' |
Similar issue over here. |
I'm also having a similar issue, also on LSF. Is it possible to reopen this issue, or find some way to debug the issue? I also am running on LSF, and I also see that the command finishes successfully in .command.out. Here is my log output:
I also notice that the job sometimes stays in the queue with STAT UNKWN. |
Lack of exit status suggests that the job was terminated abruptly by your scheduler. Usually they at least give you an exit code if you exceeded your resource limits. Maybe the node was shut down? You'll have to inspect the job logs and status from the scheduler for any underlying cause. If you can find a root cause then we can try to handle it better in Nextflow, otherwise there isn't much that we can do |
I was having a nearly identical problem using Nextflow with Slurm. Most jobs were not starting and no error was being reported except for a small number of instances where there was a 255 error code. Of the jobs that did start, they were often killed with no error reported. We found that the Slurm configuration file was not synced across nodes. Updating the permissions and re-synchronizing the configuration file resolved the issue. |
same issue here with LSF. our system admin says that LSF tries and kill job that exceed resources using different strategies, and if the OS is not responsive enough, it eventually ends up using an aggressive kill code -the exit code that is reported from the our environment is this: |
just expanding on my case, so that there is material for thinking a solution: I run a nextflow pipeline that has reported something like this sometimes when the resource usage of the task process is exceeding the requested amount:
On a test run, it seems like a little over half of the processes failed for excess memory use report this error and will be ignored for further processing i.e. no retrying. looking for the execution report of the failed process example above in the pipeline trace, I can see this:
the exit code is reported as
the actual exit status of the job was Now the question is: what is the behaviour of the Nexflow master process from this. if any suggestion would be most helpful. many thanks, Florent |
update: |
Seqera support confirms that:
if Nextflow devs ever want to build a handling mechanism for this, I personally think the solution would be that when the |
Bug report
Expected behavior and actual behavior
I expect the NF engine to correctly capture the job status code submitted to LSF scheduler, but it fails to do so occasionally.
Steps to reproduce the problem
I didn't have this problem using the same workflow over the past a few months, but starts to encounter it recently.
This problem is hard to reproduce because it happens sporadically. I was able to capture one of such cases and traced the .nextflow.log file and found the following facts:
May-03 09:10:36.762
, it was still PENDING 10 minutes later atMay-03 09:20:40.795
, but was considered as COMPLETED only a few milliseconds later atMay-03 09:20:40.803
, and NF engine couldn't locate it output files because they were not being generated at that time. Therefore, NF generated an error and terminated all other processes and exited.The question is boiled down to "Why a job actually finished at
May 3 18:24:37 2022
was considered as COMPLETED atMay-03 09:20:40.803
?I noticed similar issues here (#2540), here (#1045), and here (#1644).
Any input to further debug will be highly appreciated.
Environment
$SHELL --version
) [GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)]The text was updated successfully, but these errors were encountered: