anything, anywhere, everywhere
Buildable Source Tarball: wfmash-v0.15.0.tar.gz
Initial experiments in our all-to-all alignment of the draft vertebrate genomes project demonstrated that we were not generating end-to-end alignments for many mashmap3 homology pairs at 70% ANI (wfmash -m -p 70
). Exploration showed that our attempts at automatically tuning alignment parameters based on mashmap estimated identity simply didn't work. The parameter settings we used meant that optimal wflign alignments were often I*D*
, or "fully indel-ed", leading to no insight into the homology between the pairs even when internally WFA segments did match.
To avoid this "gotcha" and ensure we obtain an alignment, we set the softest wflign parameters possible to maintain the inequality match < gap-extend < mismatch < gap-open
: match=0 mismatch=2, gap-open=3, gap-extend=1. We also use 0,3,4,2,24,1 for our WFA patching parameters, matching minimap2's asm20 setting. These changes lead to a major improvement in runtime and memory usage during alignment. In WFA, where everything is order of score or score*score, smaller scores mean lower memory and faster runtime.
We also ran into portability issues. The biggest improvement was to bring back static builds with options to enable generic compatibility with many recent x86 systems. This will allow direct distribution of binaries in these releases.
We also hit some very weird software bugs that led us to drop jemalloc. It was causing very strange problems (like IOT like invalid instruction errors, signal 9 allocation errors with 5% RAM usage, etc.) and offers no obvious performance advantage in wfmash's current setup, mentioning here because it was a very tricky bug to resolve.
New Features and Enhancements
Breaking Changes
wfmash
now requires the query FASTA sequence to be bgzipped and samtools faidx indexed as well as the target sequence. This lets us basically be able to randomly access the query which improves performance in parallel and high-performance computing settings because we don't have to spool through very big query files if we're only aligning a very small part of them.
Publications
- Added a new citation for the biWFA algorithm:
- Santiago Marco-Sola, Jordan M. Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. "Optimal gap-affine alignment in O (s) space". Bioinformatics, 2023.
Build System
- Configurable Build Options: Introduced new CMake options to make the build process more flexible:
BUILD_STATIC
: Option to build a static binary.BUILD_DEPS
: Option to build external dependencies (htslib, gsl, libdeflate) from source.BUILD_RETARGETABLE
: Option to build a retargetable binary without machine-specific optimizations.
- Static Compilation: Improved support for static compilation, including the ability to build static binaries and handle external dependencies more flexibly.
- OpenMP Support: Added OpenMP support for parallel processing.
- Improved Documentation: Updated the README to provide detailed instructions for building from source, including static and retargetable binaries.
Performance and Optimization
- Optimized Compilation Flags: Adjusted compilation flags for better performance and compatibility across different systems.
- Memory Management: Improved memory management by reducing the number of sketches kept in memory during large alignments.
- Query Sequence Handling: Enhanced the handling of query sequences to support random access, reducing memory usage and improving performance.
Bug Fixes
- Memory Access Errors: Fixed potential memory access errors by adding bounds checks for sequence indices.
- Thread Safety: Ensured thread safety by using a single
faidx_t
object for sequence fetching, shared among multiple threads. - Alignment Filtering: Disabled low-identity filtering by default to ensure all alignments are kept for post-processing.
Miscellaneous
- Nix and Guix Support: Added support for building wfmash using Nix and Guix, including Docker image generation.
- Test Cases: Added a script to generate test cases for wflign, facilitating easier testing and validation.
Detailed Changes
Commit Highlights
- Commit 577c3de: Added biWFA citation to the README.
- Commit 1d142d9: Merged changes for Stampede3 build configuration.
- Commit d55cfe7: Made the build configurable and documented how to use the new options.
- Commit 18e33b0: Fixed the path for libdeflate in the CMake configuration.
- Commit 9ff0452: Merged updates for scoring parameter optimizations.
- Commit 609082b: Updated build to use Clang and removed jemalloc dependency.
- Commit e6f1824: Restored micromamba/anaconda support.
- Commit debeff7: Debugged build on TACC's Stampede3 cluster.
- Commit 75a6631: Improved build process for Stampede3 cluster.
- Commit 719381c: Avoided
-march=native
for broader compatibility. - Commit 081213c: Fixed memory management issues in alignment code.
- Commit fb4c6d0: Used generic modern optimizations, avoiding processor-specific flags.
- Commit c04088e: Ensured zero-termination of sequence data fetched with
faidx_fetch_seq64
. - Commit ea76722: Added validation for mashmap input rows.
- Commit fedad55: Reduced the number of sketches kept in memory during large alignments.
- Commit a7aa342: Improved queue behavior and memory management in alignment code.
- Commit 581b364: Disabled low-identity filtering by default.
- Commit 3f1f7af: Corrected documentation of queues.
- Commit acd7fdc: Updated atomic queue definition for better single-producer multi-consumer behavior.
- Commit 2b91145: Avoided deadlock on empty input files.
- Commit edcd281: Used a single
faidx_t
object for sequence fetching to save memory. - Commit a04908b: Fixed scoring parameters for diverse alignment problems.
- Commit bb0d43d: Merged updates for forcibly using biWFA alignment.
- Commit 3e434f8: Added a script to create test cases for wflign.
- Commit cc89be6: Added option to force global biWFA alignment.
- Commit 35194d8: Merged updates for random access to queries during alignment.
- Commit c738c1d: Stopped sorting the input mapping file for better performance.
- Commit 70e896a: Removed redundant query sequence processing.
- Commit 6fcddc2: Enabled random access of query subsequences in alignment.
- Commit d9a0880: Limited to one query sequence file for simplicity.
- Commit 729d9d7: Merged updates for static build reimplementation.
- Commit 8477337: Corrected debugging build with PNG and TSV support.
- Commit efc8f04: Added libdeflate as a dependency.
- Commit deb1472: Updated minimum CMake versions.
- Commit d1588e6: Described static compilation options in the README.
- Commit 2713899: Defaulted to non-static builds.
- Commit a96919e: Reimplemented static builds.
- Commit 899e154: Bumped Nix build configuration.
- Commit 7376468: Reverted removal of
flake.lock
. - Commit 526995a: Removed
flake.lock
. - Commit f94b7ee: Updated Nix build configuration.
- Commit e2df9c8: Moved to Nix flake.
- Commit cbedc8f: Locked the Nix flake.
- Commit b0d0ada: Added Nix flake configuration.
Happy whole-genome-aligning! 🔬🧬📊