Code Update Log

Step 1: Verify Code Functionality (Complete)

All 11 existing tests passed
Baseline time: 57 seconds (5 synthetic sequences of 100KB each)

Step 2: Tier 1 Sliding Window Conversion (Complete)

Change: 4^k motif enumeration -> O(n) sliding window scan
Result: Test time reduced from 57s to 15s (3.7x improvement)
Accuracy: 11/11 tests passed, stress test 100% sensitivity 99.1% precision
Key modifications:
- Direct detection of period-k repeats using text[i] == text[i+k] comparison
- Allow short seeds (2 copies) -> extend with extend_with_mismatches then apply threshold
- Fall back to seed region when extension fails

Step 3: C Extension to Accelerate Tier 1 Run Detection (Complete)

src/c_extensions/tier1_scan.c: Implemented period-k run detection in C
Python fallback maintained (automatic switch if C build fails)
11/11 tests passed, stress test 100% sensitivity / 99.1% precision

Step 4: bwtandem vs TRF Comparison (Complete)

Comparison on 5 synthetic sequences (44 repeats):
- bwtandem: 100% sensitivity, 97.8% precision
- TRF: 97.7% sensitivity, 100% precision
bwtandem outperformed TRF on adjacent repeats (11/11 vs 10/11)

Step 5: Chr4 Execution (Complete)

18.5MB sequence, total runtime 7 min 52 sec (472 sec)
BWT construction: 172s, Tier 1: 11s, Tier 2: 245s, Tier 3: 40s
Repeats found: 6,625
Tier 1 speed: previous 10-15 min -> 11s (~80x improvement)

Step 6: Tool Comparison (Complete)

Synthetic Sequence Comparison (44 ground truth repeats)

Tool	Sensitivity	Precision	F1
bwtandem	100.0%	97.8%	98.9%
TRF	97.7%	100.0%	98.9%

Chr4 Execution Comparison

Tool	Results	Runtime
bwtandem	6,625	7 min 52 sec
TRF	4,549	34 sec
mreps	84,502	48 sec
ULTRA	23,145	24 min 11 sec

Conclusions

bwtandem: Highest sensitivity (100%), detects 45% more repeats than TRF
TRF: Fastest (34 sec), high precision
mreps: Over-detection (84K), filtering required
ULTRA: Slowest (24 min), 3.5x more results than bwtandem

Step 7: Tier 2 C Extension Acceleration (Complete)

Change: C extension for smallest_period_str + coverage-skip optimization + stride tuning
Key modifications:
- src/c_extensions/tier2_accel.c: C implementations of smallest_period_str, smallest_period_str_approx, hamming_distance, batch_process_lcp_candidates
- Python smallest_period_str calls replaced with C ctypes calls in hot loops
- O(n²) dedup check replaced with O(1) coverage mask check
- Phase B stride increased from min 3 to min 5
- Coverage-based early skip for overlapping candidates
Result: Tier 2 reduced from 245s to 117s (2.1x speedup)
Overall: Chr4 total 472s → 317s (1.5x speedup, 5 min 17 sec)
Accuracy: 11/11 tests passed, 100% sensitivity / 100% precision maintained

Step 8: Comprehensive C Extension Acceleration (Complete)

Change: C extensions for align_repeat_region, BWT backward_search, Kasai LCP, removed unused code
Key modifications:
- src/c_extensions/align_accel.c: Full DP alignment loop in C (banded semi-global DP per copy + consensus updates). 40x faster than Python loop (0.095ms vs 3.8ms per call)
- src/c_extensions/bwt_accel.c: C implementations of count_equal_range, backward_search, kasai_lcp, batch_backward_search. backward_search 10x faster, Kasai LCP 50x+ faster
- src/bwt_core.py: Integrated C rank queries, backward_search, and Kasai LCP. Added flat checkpoint data preparation for zero-overhead C access
- src/motif_utils.py: C-accelerated align_repeat_region with text pointer caching (avoids 72s from_buffer_copy). C-accelerated smallest_period_str and smallest_period_str_approx
- Removed unused _build_kmer_hash (28s) and _sample_suffix_array (5s) from BWTCore init
Result: Chr4 total 317s → 244s (1.3x speedup, 4 min 4 sec)
- BWT construction: 172s → 157s (sampled_sa removed, kmer_hash restored for accuracy)
- Tier 2: 117s → 65s (C batch processing + BWT backward search)
- Kasai LCP: 49s → <1s (C, 50x+)
- Tier 3: 40s → 5s (C acceleration)
- Test suite: 15.2s → 9.8s (1.5x faster)
Accuracy: 29/29 tests passed, all ground truth tests maintain 100% sensitivity/precision
Note: kmer_hash build restored (removing it caused adjacent repeat sensitivity regression)

Step 9: Satellite DNA Scanner for Centromere Detection (Complete)

Change: Post-processing satellite DNA scanner using autocorrelation-based period detection
Key modifications:
- src/finder.py: Added _fill_satellite_gaps() method to TandemRepeatFinder
  - Builds numpy coverage mask from existing repeats
  - Finds uncovered blocks (≥300bp, ≤100kb) using numpy transitions
  - Proximity filter: only scans blocks within 50kb of existing satellite detections
  - Multi-window autocorrelation: scans start, middle, and end of each block
  - Threshold: ≥0.55 autocorrelation identity, periods 100-300bp
  - Creates repeats directly from autocorrelation (skips expensive refine_repeat)
- src/finder.py: Improved _merge_adjacent_repeats() for satellite DNA
  - Fuzzy canonical motif matching (≤10% hamming distance for motifs ≥50bp)
  - Vectorized gap autocorrelation check using numpy
  - Skip _recompute_stats for merged regions >50kb
  - Large gap allowance for satellite DNA (period × 100)
- src/motif_utils.py: Added _detect_satellite_period() for highly divergent repeats
  - Autocorrelation-based primitive period detection (≥60% identity threshold)
  - Sub-period search for compound satellite motifs
Result: ColCEN full genome centromere benchmark
- CEN180 unit coverage: Chr1 95.5%, Chr2 99.7%, Chr3 99.4%, Chr4 98.0%, Chr5 98.6%, Overall 98.2%
- bwtandem dramatically outperforms all other tools (mreps 30.9%, ULTRA 0.1%, TRF 0.0%)
- Runtime: 27 min for full 132MB ColCEN genome (5 chromosomes + organelles)
- Satellite scanner adds <15s per chromosome
Accuracy: 29/29 tests passed, all ground truth tests maintain 100% sensitivity/precision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Update Log

Step 1: Verify Code Functionality (Complete)

Step 2: Tier 1 Sliding Window Conversion (Complete)

Step 3: C Extension to Accelerate Tier 1 Run Detection (Complete)

Step 4: bwtandem vs TRF Comparison (Complete)

Step 5: Chr4 Execution (Complete)

Step 6: Tool Comparison (Complete)

Synthetic Sequence Comparison (44 ground truth repeats)

Chr4 Execution Comparison

Conclusions

Step 7: Tier 2 C Extension Acceleration (Complete)

Step 8: Comprehensive C Extension Acceleration (Complete)

Step 9: Satellite DNA Scanner for Centromere Detection (Complete)

FilesExpand file tree

code_update.md

Latest commit

History

code_update.md

File metadata and controls

Code Update Log

Step 1: Verify Code Functionality (Complete)

Step 2: Tier 1 Sliding Window Conversion (Complete)

Step 3: C Extension to Accelerate Tier 1 Run Detection (Complete)

Step 4: bwtandem vs TRF Comparison (Complete)

Step 5: Chr4 Execution (Complete)

Step 6: Tool Comparison (Complete)

Synthetic Sequence Comparison (44 ground truth repeats)

Chr4 Execution Comparison

Conclusions

Step 7: Tier 2 C Extension Acceleration (Complete)

Step 8: Comprehensive C Extension Acceleration (Complete)

Step 9: Satellite DNA Scanner for Centromere Detection (Complete)