mhap fails on grid due to final query folder missing #2363

samwichse · 2025-01-21T14:02:06Z

This is basically the same issue as: #1191

I have a couple Pacbio datasets I'm attempting to assemble... but any combination of the two datasets fails in the correction stage as the final query folder (where it's only supposed to compare the last block to itself) in correction/1-overlapper/queries is not created. It's always one less than needed. So for instance, my current set creates 192 query folders, but then in results I get 193 ovb files, the last one failing.

This seems like a bug... the above link has a workaround (trying now, but it's not gotten to this step), but there definitely seems like there's an off-by-one error happening here. I'm also running another assembly on the same set locally on a single node without grid to see if this is something to do with my specific data, although that probably won't be done for a couple weeks.

I can run Pacbio-hifi assemblies no problem in this same grid setup (using slurm 23.11.8)...

skoren · 2025-01-21T16:13:00Z

Is this a single run of canu or did you re-launch it at least one? Did any parameters change between the first and second run if it was run twice? What's the exact command you're using and the full log of the canu run.

samwichse · 2025-01-22T14:33:29Z

It was a single run of canu... however, I restarted (resumed) it and it seems to have restarted all the various log files? Same parameters.

I got the admins to update to canu 2.3, cancelled the whole thing, cleared the directory, and restarted with the same parameters to see if that solves the problem. Currently it's running the cormhap process on 82 nodes, so we'll see pretty soon if the latest version helps or not.

samwichse · 2025-01-22T14:37:49Z

Actually, just checking the new process, the queries folder has up to 000193, where before it was only creating up to 000192, so maybe it is?

samwichse · 2025-01-24T14:48:28Z

Whelp, got my hopes up... but when I checked on it yesterday evening, the 000193 directory had disappeared, and of course, when mhap got to that one, it barfed again:

Found perl (from '/usr/bin/perl'):
  /usr/bin/perl
  This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi

Found java (from '/software/el9/apps/java/17/bin/java'):
  /software/el9/apps/java/17/bin/java
  openjdk version "17.0.11" 2024-04-16

Found canu (from '/software/el9/apps/canu/2.3/bin/canu'):
  /software/el9/apps/canu/2.3/bin/canu
  canu 2.3

-- canu 2.3
--
-- CITATIONS
--
-- For 'standard' assemblies of PacBio or Nanopore reads:
--   Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
--   Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
--   Genome Res. 2017 May;27(5):722-736.
--   http://doi.org/10.1101/gr.215087.116
--
-- Read and contig alignments during correction and consensus use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
--
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
--
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
--
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
--
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
--
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '17.0.11' (from '/software/el9/apps/java/17/bin/java') without -d64 support.
-- Detected gnuplot version '5.4 patchlevel 8   ' (from 'gnuplot') and image format 'png'.
-- Detected samtools version '1.17' / htslib version '1.17' (from 'samtools').
--
-- Detected 2 CPUs and 384 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with task IDs up to 10000 allowed.
--
-- Slurm support detected.  Resources available:
--      8 hosts with  96 cores and 1509 GB memory.
--     20 hosts with 256 cores and 2250 GB memory.
--     72 hosts with  96 cores and  371 GB memory.
--      2 hosts with  80 cores and  753 GB memory.
--    100 hosts with  72 cores and  371 GB memory.
--      6 hosts with  80 cores and 1509 GB memory.
--
-- Found PacBio CLR reads in 'new_species.seqStore':
--   Libraries:
--     PacBio CLR:            1
--   Reads:
--     Raw:                   71844069449
--
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     24.000 GB    8 CPUs  (k-mer counting)
-- Grid:  cormhap   30.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       16.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       30.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      128.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- Generating assembly 'new_species' in '/path/to/assembly/folder':
--   genomeSize:
--     650000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.2400 ( 24.00%)
--     obtOvlErrorRate 0.0450 (  4.50%)
--     utgOvlErrorRate 0.0450 (  4.50%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.2500 ( 25.00%)
--     obtErrorRate    0.0450 (  4.50%)
--     utgErrorRate    0.0450 (  4.50%)
--     cnsErrorRate    0.0750 (  7.50%)
--
--   Stages to run:
--     correct raw reads.
--     trim corrected reads.
--     assemble corrected and trimmed reads.
--
--
--
--
-- BEGIN CORRECTION
--
-- OVERLAPPER (mhap) (correction) complete, not rewriting scripts.
--
--
-- Mhap overlap jobs failed, tried 2 times, giving up.
--   job correction/1-overlapper/results/000193.ovb FAILED.
--

ABORT:
ABORT: canu 2.3
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

Command used to call canu:
canu -pacbio CLR_reads.fastq.gz -d canu_assembly_raw -genomeSize=650M -p new_species > canu_raw.log 2> canu_raw.stderr

The bottom of the mhap results folder looks like this while the last mhap processes are finishing, then after it fails (WORKING never disappears):

-rw-r-----. 1 user.name group  25M Jan 23 13:46 000190.oc
-rw-r-----. 1 user.name group  10G Jan 23 13:46 000190.ovb
-rw-r-----. 1 user.name group  25M Jan 23 13:34 000191.oc
-rw-r-----. 1 user.name group 6.7G Jan 23 13:34 000191.ovb
-rw-r-----. 1 user.name group  25M Jan 23 12:31 000192.oc
-rw-r-----. 1 user.name group 3.3G Jan 23 12:31 000192.ovb
-rw-r-----. 1 user.name group    0 Jan 24 08:36 000193.mcvt.success
-rw-r-----. 1 user.name group    0 Jan 24 08:36 000193.mhap.ovb.WORKING
-rw-r-----. 1 user.name group  25M Jan 24 08:36 000193.oc

skoren · 2025-01-24T21:22:06Z

I don't see how the 193 queries subfolder can be there and disappear, this folder is not removed until after all jobs have completed. Even if it is removed, the shell script (mhap.sh) will extract the tar version of the folder to re-run if needed. I confirmed this is the case by removing the folder during the run and the last job still completed w/o issue on our cluster.

So something is going on with your cluster that I think is outside of canu's control. Are you running in some kind of temp/scratch space where idle files are removed after a timeout? Can you post the full recursive contents of your 1-overlapper and canu-scripts folders? Also the mhap.sh script, the log from the failed step (mhap.*193.out), and the canu.*.out files in canu-scripts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mhap fails on grid due to final query folder missing #2363

mhap fails on grid due to final query folder missing #2363

samwichse commented Jan 21, 2025

skoren commented Jan 21, 2025

samwichse commented Jan 22, 2025

samwichse commented Jan 22, 2025

samwichse commented Jan 24, 2025 •

edited

Loading

skoren commented Jan 24, 2025

mhap fails on grid due to final query folder missing #2363

mhap fails on grid due to final query folder missing #2363

Comments

samwichse commented Jan 21, 2025

skoren commented Jan 21, 2025

samwichse commented Jan 22, 2025

samwichse commented Jan 22, 2025

samwichse commented Jan 24, 2025 • edited Loading

skoren commented Jan 24, 2025

samwichse commented Jan 24, 2025 •

edited

Loading