Skip to content

Latest commit

 

History

History
107 lines (95 loc) · 6.09 KB

File metadata and controls

107 lines (95 loc) · 6.09 KB

Code Update Log

Step 1: Verify Code Functionality (Complete)

  • All 11 existing tests passed
  • Baseline time: 57 seconds (5 synthetic sequences of 100KB each)

Step 2: Tier 1 Sliding Window Conversion (Complete)

  • Change: 4^k motif enumeration -> O(n) sliding window scan
  • Result: Test time reduced from 57s to 15s (3.7x improvement)
  • Accuracy: 11/11 tests passed, stress test 100% sensitivity 99.1% precision
  • Key modifications:
    • Direct detection of period-k repeats using text[i] == text[i+k] comparison
    • Allow short seeds (2 copies) -> extend with extend_with_mismatches then apply threshold
    • Fall back to seed region when extension fails

Step 3: C Extension to Accelerate Tier 1 Run Detection (Complete)

  • src/c_extensions/tier1_scan.c: Implemented period-k run detection in C
  • Python fallback maintained (automatic switch if C build fails)
  • 11/11 tests passed, stress test 100% sensitivity / 99.1% precision

Step 4: bwtandem vs TRF Comparison (Complete)

  • Comparison on 5 synthetic sequences (44 repeats):
    • bwtandem: 100% sensitivity, 97.8% precision
    • TRF: 97.7% sensitivity, 100% precision
  • bwtandem outperformed TRF on adjacent repeats (11/11 vs 10/11)

Step 5: Chr4 Execution (Complete)

  • 18.5MB sequence, total runtime 7 min 52 sec (472 sec)
  • BWT construction: 172s, Tier 1: 11s, Tier 2: 245s, Tier 3: 40s
  • Repeats found: 6,625
  • Tier 1 speed: previous 10-15 min -> 11s (~80x improvement)

Step 6: Tool Comparison (Complete)

Synthetic Sequence Comparison (44 ground truth repeats)

Tool Sensitivity Precision F1
bwtandem 100.0% 97.8% 98.9%
TRF 97.7% 100.0% 98.9%

Chr4 Execution Comparison

Tool Results Runtime
bwtandem 6,625 7 min 52 sec
TRF 4,549 34 sec
mreps 84,502 48 sec
ULTRA 23,145 24 min 11 sec

Conclusions

  • bwtandem: Highest sensitivity (100%), detects 45% more repeats than TRF
  • TRF: Fastest (34 sec), high precision
  • mreps: Over-detection (84K), filtering required
  • ULTRA: Slowest (24 min), 3.5x more results than bwtandem

Step 7: Tier 2 C Extension Acceleration (Complete)

  • Change: C extension for smallest_period_str + coverage-skip optimization + stride tuning
  • Key modifications:
    • src/c_extensions/tier2_accel.c: C implementations of smallest_period_str, smallest_period_str_approx, hamming_distance, batch_process_lcp_candidates
    • Python smallest_period_str calls replaced with C ctypes calls in hot loops
    • O(n²) dedup check replaced with O(1) coverage mask check
    • Phase B stride increased from min 3 to min 5
    • Coverage-based early skip for overlapping candidates
  • Result: Tier 2 reduced from 245s to 117s (2.1x speedup)
  • Overall: Chr4 total 472s → 317s (1.5x speedup, 5 min 17 sec)
  • Accuracy: 11/11 tests passed, 100% sensitivity / 100% precision maintained

Step 8: Comprehensive C Extension Acceleration (Complete)

  • Change: C extensions for align_repeat_region, BWT backward_search, Kasai LCP, removed unused code
  • Key modifications:
    • src/c_extensions/align_accel.c: Full DP alignment loop in C (banded semi-global DP per copy + consensus updates). 40x faster than Python loop (0.095ms vs 3.8ms per call)
    • src/c_extensions/bwt_accel.c: C implementations of count_equal_range, backward_search, kasai_lcp, batch_backward_search. backward_search 10x faster, Kasai LCP 50x+ faster
    • src/bwt_core.py: Integrated C rank queries, backward_search, and Kasai LCP. Added flat checkpoint data preparation for zero-overhead C access
    • src/motif_utils.py: C-accelerated align_repeat_region with text pointer caching (avoids 72s from_buffer_copy). C-accelerated smallest_period_str and smallest_period_str_approx
    • Removed unused _build_kmer_hash (28s) and _sample_suffix_array (5s) from BWTCore init
  • Result: Chr4 total 317s → 244s (1.3x speedup, 4 min 4 sec)
    • BWT construction: 172s → 157s (sampled_sa removed, kmer_hash restored for accuracy)
    • Tier 2: 117s → 65s (C batch processing + BWT backward search)
    • Kasai LCP: 49s → <1s (C, 50x+)
    • Tier 3: 40s → 5s (C acceleration)
    • Test suite: 15.2s → 9.8s (1.5x faster)
  • Accuracy: 29/29 tests passed, all ground truth tests maintain 100% sensitivity/precision
  • Note: kmer_hash build restored (removing it caused adjacent repeat sensitivity regression)

Step 9: Satellite DNA Scanner for Centromere Detection (Complete)

  • Change: Post-processing satellite DNA scanner using autocorrelation-based period detection
  • Key modifications:
    • src/finder.py: Added _fill_satellite_gaps() method to TandemRepeatFinder
      • Builds numpy coverage mask from existing repeats
      • Finds uncovered blocks (≥300bp, ≤100kb) using numpy transitions
      • Proximity filter: only scans blocks within 50kb of existing satellite detections
      • Multi-window autocorrelation: scans start, middle, and end of each block
      • Threshold: ≥0.55 autocorrelation identity, periods 100-300bp
      • Creates repeats directly from autocorrelation (skips expensive refine_repeat)
    • src/finder.py: Improved _merge_adjacent_repeats() for satellite DNA
      • Fuzzy canonical motif matching (≤10% hamming distance for motifs ≥50bp)
      • Vectorized gap autocorrelation check using numpy
      • Skip _recompute_stats for merged regions >50kb
      • Large gap allowance for satellite DNA (period × 100)
    • src/motif_utils.py: Added _detect_satellite_period() for highly divergent repeats
      • Autocorrelation-based primitive period detection (≥60% identity threshold)
      • Sub-period search for compound satellite motifs
  • Result: ColCEN full genome centromere benchmark
    • CEN180 unit coverage: Chr1 95.5%, Chr2 99.7%, Chr3 99.4%, Chr4 98.0%, Chr5 98.6%, Overall 98.2%
    • bwtandem dramatically outperforms all other tools (mreps 30.9%, ULTRA 0.1%, TRF 0.0%)
    • Runtime: 27 min for full 132MB ColCEN genome (5 chromosomes + organelles)
    • Satellite scanner adds <15s per chromosome
  • Accuracy: 29/29 tests passed, all ground truth tests maintain 100% sensitivity/precision