- All 11 existing tests passed
- Baseline time: 57 seconds (5 synthetic sequences of 100KB each)
- Change: 4^k motif enumeration -> O(n) sliding window scan
- Result: Test time reduced from 57s to 15s (3.7x improvement)
- Accuracy: 11/11 tests passed, stress test 100% sensitivity 99.1% precision
- Key modifications:
- Direct detection of period-k repeats using
text[i] == text[i+k]comparison - Allow short seeds (2 copies) -> extend with extend_with_mismatches then apply threshold
- Fall back to seed region when extension fails
- Direct detection of period-k repeats using
src/c_extensions/tier1_scan.c: Implemented period-k run detection in C- Python fallback maintained (automatic switch if C build fails)
- 11/11 tests passed, stress test 100% sensitivity / 99.1% precision
- Comparison on 5 synthetic sequences (44 repeats):
- bwtandem: 100% sensitivity, 97.8% precision
- TRF: 97.7% sensitivity, 100% precision
- bwtandem outperformed TRF on adjacent repeats (11/11 vs 10/11)
- 18.5MB sequence, total runtime 7 min 52 sec (472 sec)
- BWT construction: 172s, Tier 1: 11s, Tier 2: 245s, Tier 3: 40s
- Repeats found: 6,625
- Tier 1 speed: previous 10-15 min -> 11s (~80x improvement)
| Tool | Sensitivity | Precision | F1 |
|---|---|---|---|
| bwtandem | 100.0% | 97.8% | 98.9% |
| TRF | 97.7% | 100.0% | 98.9% |
| Tool | Results | Runtime |
|---|---|---|
| bwtandem | 6,625 | 7 min 52 sec |
| TRF | 4,549 | 34 sec |
| mreps | 84,502 | 48 sec |
| ULTRA | 23,145 | 24 min 11 sec |
- bwtandem: Highest sensitivity (100%), detects 45% more repeats than TRF
- TRF: Fastest (34 sec), high precision
- mreps: Over-detection (84K), filtering required
- ULTRA: Slowest (24 min), 3.5x more results than bwtandem
- Change: C extension for
smallest_period_str+ coverage-skip optimization + stride tuning - Key modifications:
src/c_extensions/tier2_accel.c: C implementations ofsmallest_period_str,smallest_period_str_approx,hamming_distance,batch_process_lcp_candidates- Python
smallest_period_strcalls replaced with C ctypes calls in hot loops - O(n²) dedup check replaced with O(1) coverage mask check
- Phase B stride increased from min 3 to min 5
- Coverage-based early skip for overlapping candidates
- Result: Tier 2 reduced from 245s to 117s (2.1x speedup)
- Overall: Chr4 total 472s → 317s (1.5x speedup, 5 min 17 sec)
- Accuracy: 11/11 tests passed, 100% sensitivity / 100% precision maintained
- Change: C extensions for align_repeat_region, BWT backward_search, Kasai LCP, removed unused code
- Key modifications:
src/c_extensions/align_accel.c: Full DP alignment loop in C (banded semi-global DP per copy + consensus updates). 40x faster than Python loop (0.095ms vs 3.8ms per call)src/c_extensions/bwt_accel.c: C implementations ofcount_equal_range,backward_search,kasai_lcp,batch_backward_search. backward_search 10x faster, Kasai LCP 50x+ fastersrc/bwt_core.py: Integrated C rank queries, backward_search, and Kasai LCP. Added flat checkpoint data preparation for zero-overhead C accesssrc/motif_utils.py: C-acceleratedalign_repeat_regionwith text pointer caching (avoids 72s from_buffer_copy). C-acceleratedsmallest_period_strandsmallest_period_str_approx- Removed unused
_build_kmer_hash(28s) and_sample_suffix_array(5s) from BWTCore init
- Result: Chr4 total 317s → 244s (1.3x speedup, 4 min 4 sec)
- BWT construction: 172s → 157s (sampled_sa removed, kmer_hash restored for accuracy)
- Tier 2: 117s → 65s (C batch processing + BWT backward search)
- Kasai LCP: 49s → <1s (C, 50x+)
- Tier 3: 40s → 5s (C acceleration)
- Test suite: 15.2s → 9.8s (1.5x faster)
- Accuracy: 29/29 tests passed, all ground truth tests maintain 100% sensitivity/precision
- Note: kmer_hash build restored (removing it caused adjacent repeat sensitivity regression)
- Change: Post-processing satellite DNA scanner using autocorrelation-based period detection
- Key modifications:
src/finder.py: Added_fill_satellite_gaps()method toTandemRepeatFinder- Builds numpy coverage mask from existing repeats
- Finds uncovered blocks (≥300bp, ≤100kb) using numpy transitions
- Proximity filter: only scans blocks within 50kb of existing satellite detections
- Multi-window autocorrelation: scans start, middle, and end of each block
- Threshold: ≥0.55 autocorrelation identity, periods 100-300bp
- Creates repeats directly from autocorrelation (skips expensive refine_repeat)
src/finder.py: Improved_merge_adjacent_repeats()for satellite DNA- Fuzzy canonical motif matching (≤10% hamming distance for motifs ≥50bp)
- Vectorized gap autocorrelation check using numpy
- Skip
_recompute_statsfor merged regions >50kb - Large gap allowance for satellite DNA (period × 100)
src/motif_utils.py: Added_detect_satellite_period()for highly divergent repeats- Autocorrelation-based primitive period detection (≥60% identity threshold)
- Sub-period search for compound satellite motifs
- Result: ColCEN full genome centromere benchmark
- CEN180 unit coverage: Chr1 95.5%, Chr2 99.7%, Chr3 99.4%, Chr4 98.0%, Chr5 98.6%, Overall 98.2%
- bwtandem dramatically outperforms all other tools (mreps 30.9%, ULTRA 0.1%, TRF 0.0%)
- Runtime: 27 min for full 132MB ColCEN genome (5 chromosomes + organelles)
- Satellite scanner adds <15s per chromosome
- Accuracy: 29/29 tests passed, all ground truth tests maintain 100% sensitivity/precision