Skip to content

chunking and gliding while head tail global patching

Compare
Choose a tag to compare
@ekg ekg released this 26 Aug 17:22
· 478 commits to main since this release
064fafa

Buildable source tarball: wfmash-v0.20.0.tar.gz

Major Changes

  1. New Global Alignment Approach:

    • Replaced the previous head and tail patching with a comprehensive global alignment strategy.
    • Implemented erode_head and erode_tail functions to remove small, potentially spurious matches at alignment boundaries.
    • The alignment now aims to include the entire query sequence, crucial when using the -P option for chunking mappings.
    • This change ensures continuity across the entire sequence, especially important when mappings are broken into smaller pieces for easier alignment.
    • Switched from a semi-global approach (pinned at one end) to a fully global alignment, improving accuracy across the entire sequence length.
  2. Improved Chaining Algorithm:

    • Introduced an axis-weighted Euclidean distance function for more accurate chaining of mappings.
    • This new function helps break mappings when encountering large indels, which can be computationally expensive to align.
    • Improves detection of large structural variations directly from the mapping stage.
    • Reduces spurious chaining in satellite repetitive sequences by considering the diagonal nature of true matches.
    • The weighting maintains the original chain gap threshold for on-diagonal matches while effectively shortening the allowed distance for off-diagonal matches.
  3. Mapping and Alignment Improvements:

    • Modified the logic for determining cuttable positions in long alignments to avoid breaking alignments in the middle of structural variations (SVs).
    • Adjusted the merging of consecutive mappings to be more selective, prioritizing the preservation of potential SV signals.
    • Enhanced the handling of complex genomic structures by improving coordination between mapping and alignment stages.
  4. Performance Optimization:

    • Temporarily disabled multithreaded FASTA input processing due to thread safety issues with the samtools faidx reader.
    • This change addresses memory efficiency concerns and prevents potential errors in multi-threaded environments.
    • Future updates may reintroduce multi-threaded processing with improved memory management.
    • Optimized the mapping process when not splitting sequences.
    • Improved efficiency of long mapping handling, particularly when max mapping length is set to infinity.
  5. Default Changes:

    • Changed the default maximum mapping length (-P/--max-mapping-length) to infinity, allowing for longer continuous alignments when appropriate.

Minor Improvements and Bug Fixes

  • Enhanced error handling and validation throughout the alignment process.
  • Improved coordinate calculations, especially in edge cases involving sequence boundaries and large structural variations.
  • Added additional PAF output fields, including a chain identifier for merged mappings.
  • Adjusted parameters for more robust alignment in complex regions.

This release significantly improves wfmash's efficiency when handle complex genomic structures (e.g. centromeres) and large-scale variations, particularly when using the -P option to chunk mappings for more efficient alignment. While this option has been left unset by default, we do strongly recommend exploring it if you find your alignment times are very slow. A good setting in testing has been -P50k.