BWT-based Tandem Repeat Finder

How BWT Finds Tandem Repeats

Project Overview

BWT Tandem Repeat Finder is a tool for detecting tandem repeats in genomic FASTA files. It uses the Burrows-Wheeler Transform (BWT) and FM-index as its core algorithms, and comprehensively detects everything from short perfect repeats (STR/microsatellites) to ultra-long repeats spanning hundreds of kilobases through a 3-tier pipeline.

Why is this tool needed?

Tandem repeats play important roles in genome instability, genetic disorders, and evolutionary research. Existing tools (such as TRF) have limitations in detecting long repeats or imperfect repeats. This tool has the following characteristics:

FM-index-based fast search: Efficiently processes large chromosome sequences using suffix arrays and BWT
3-Tier pipeline: Achieves both detection rate and speed through a 3-stage detection structure optimized for each repeat type
Multiple output formats: Supports BED, VCF, TRF .dat, and STRfinder .csv
Adaptive parameters: Automatically adjusts Tier 3 parameters based on sequence length, GC content, and coverage
Optional acceleration: Enables acceleration of core operations through Cython extensions and Numba JIT compilation

Installation

Micromamba Environment Setup (Recommended)

Using Micromamba, you can manage all dependencies in an isolated environment.

# 1. Create environment (Python 3.11 + main dependencies)
micromamba create -n bwtandem python=3.11 numpy cython pytest numba setuptools -c conda-forge -y

# 2. Install pydivsufsort (not on conda-forge, so install via pip; --no-build-isolation required)
micromamba run -n bwtandem pip install pydivsufsort --no-build-isolation

# 3. Build Cython extensions (requires -std=c99 on GCC 4.x environments)
micromamba run -n bwtandem python3 -c "
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np
ext_modules = [Extension('src._accelerators', ['src/_accelerators.pyx'],
               include_dirs=[np.get_include()], extra_compile_args=['-std=c99'])]
setup(script_args=['build_ext', '--inplace'],
      ext_modules=cythonize(ext_modules, compiler_directives={'language_level': '3'}))
"

# 4. Run tests
micromamba run -n bwtandem python3 -m pytest tests/ -v

# 5. Run the tool
micromamba run -n bwtandem python3 -m src.main input.fa --format bed -v

Direct Python Dependency Installation

You can also install dependencies directly without Micromamba.

# Required dependencies
pip install numpy pydivsufsort

# Optional dependencies (performance improvement)
pip install numba Cython

Package	Type	Description
`numpy`	Required	Array operations
`pydivsufsort`	Required (recommended)	Fast suffix array construction; falls back to NumPy prefix-doubling if unavailable
`numba`	Optional	JIT acceleration for rank queries and LCP computation
`Cython`	Optional	Compiling `_accelerators.pyx` to accelerate critical paths

Building Cython Extensions

_accelerators.pyx provides performance-critical paths including Hamming distance computation, mismatch extension, LCP candidate detection, and DP alignment. Without compilation, pure-Python fallbacks are used, and some code paths return empty results.

# Run from the project root
python3 -c "
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np
ext_modules = [Extension('src._accelerators', ['src/_accelerators.pyx'], include_dirs=[np.get_include()])]
setup(script_args=['build_ext', '--inplace'], ext_modules=cythonize(ext_modules, compiler_directives={'language_level': '3'}))
"

On successful build, a src/_accelerators.*.so file will be generated.

Singularity Container

You can build a container that includes all dependencies and Cython extensions.

# Build container (requires admin privileges)
sudo singularity build bwtandem.sif Singularity

# Run with the container
singularity exec bwtandem.sif python3 -m src.main input.fa --format bed

The container image is based on Ubuntu 22.04 with numpy, numba, pydivsufsort, and Cython pre-installed.

Quick Start

# Most basic execution (defaults: all Tiers, BED output)
python3 -m src.main input.fa

# View results
cat input.bed

The above command generates an input.bed file. The BED format includes columns for chrom, start, end, motif, copies, tier, mismatch_rate, and strand.

CLI Options in Detail

python3 -m src.main <fasta_file> [options]

Option	Default	Description
`fasta_file`	(required)	Path to input FASTA file
`--min-period INT`	`1`	Minimum repeat unit length to detect (bp)
`--max-period INT`	`2000`	Maximum repeat unit length to detect (bp)
`--min-array-bp INT`	None	Minimum repeat array length filter for results (bp)
`--max-array-bp INT`	None	Maximum repeat array length filter for results (bp)
`--tiers TIERS`	`tier1,tier2,tier3`	List of Tiers to run (comma-separated, or `all`)
`--tier3-mode MODE`	`balanced`	Tier 3 speed/accuracy preset: `fast`, `balanced`, `sensitive`
`--format FORMAT`	`bed`	Output format: `bed`, `vcf`, `trf`, `strfinder`
`-o, --output PREFIX`	Input filename	Output file prefix
`-t, --threads INT`	`1`	Number of parallel processes (parallel processing per chromosome)
`--mask MODE`	`none`	Masking mode: `none`, `soft`, `hard`, `both`
`-v, --verbose`	`False`	Print progress information
`--profile`	`False`	Profile execution time with cProfile and output top hotspots

Detailed Option Descriptions

--min-period / --max-period Specifies the length range of repeat units (motifs). Tier 1 handles 1-9 bp, Tier 2 handles 10 bp and above, and Tier 3 handles 100 bp and above, so setting --max-period 100 will limit the effectiveness of Tier 3.

--min-array-bp / --max-array-bp Filters results based on the total repeat array length (repeat unit x copy number). Example: --min-array-bp 30 removes short arrays less than 30 bp.

--tiers Used to run only specific Tiers. You can combine tier1, tier2, tier3, or all.

--tier3-mode

fast: Large k-mer size, wide stride for fast scanning (reduced sensitivity)
balanced: Default, balances speed and accuracy
sensitive: Small k-mer size, narrow stride for detailed scanning (reduced speed)

-t, --threads Processes multiple chromosomes simultaneously using multiple processes. Since each chromosome independently builds a BWT index and detects repeat sequences, linear speedup can be expected with multi-chromosome FASTA files. Has no effect on single-chromosome files.

# Parallel processing with 4 processes
python3 -m src.main genome.fa -t 4 -v

# Set to match the number of CPU cores (using nproc)
python3 -m src.main genome.fa -t $(nproc) -v

--mask Specifies how to handle soft mask (lowercase) and hard mask (N) regions in genomic FASTA files.

Mode	Lowercase (acgt)	N characters	Description
`none`	Convert to uppercase	Keep	Default: ignore masks, analyze all sequences
`soft`	Replace with N	Keep	Skip soft-masked regions (repetitive elements, etc.)
`hard`	Convert to uppercase	Keep	Skip only hard-masked regions (gaps, unspecified regions)
`both`	Replace with N	Keep	Skip both soft and hard masked regions

Soft masking follows the convention where tools like RepeatMasker mark repetitive elements (LINE, SINE, transposons, etc.) in lowercase. Use --mask soft when you want to analyze only tandem repeats while excluding known interspersed repeats.

# Analyze excluding soft-masked regions
python3 -m src.main masked_genome.fa --mask soft -v

# Exclude both soft and hard masked regions
python3 -m src.main masked_genome.fa --mask both -v

3-Tier Detection Pipeline

The TandemRepeatFinder coordinator builds a BWTCore FM-index once per chromosome, then runs the enabled Tiers sequentially. Each Tier receives information about regions already detected by previous Tiers to avoid redundant work.

Input FASTA
    |
    v
BWTCore FM-index construction (suffix array, BWT, occurrence array)
    |
    +---> Tier 1: Short perfect repeats (1-9 bp)
    |       |
    +---> Tier 2: Medium/imperfect repeats (>=10 bp)
    |       |   [Excluding Tier 1 result regions]
    |
    +---> Tier 3: Long repeats (100 bp - 100 kbp)
    |       [Excluding Tier 1 + Tier 2 result regions]
    |
    v
Post-processing: Sort -> Merge adjacent -> Filter overlaps -> Filter by length
    |
    v
Output (BED / VCF / TRF / STRfinder)

Tier 1: Short Perfect Repeats (1-9 bp)

Coverage: Motif length 1-9 bp, perfect (or near-perfect) repeats with 3 or more copies.

How it works:

Sequence length < 10 Mbp: Enumerates all canonical motifs via FM-index backward search and finds tandem repeat positions.
Sequence length >= 10 Mbp: Switches to a sliding window scanner with adaptive step size.

This is the primary detection tier for STRs (Short Tandem Repeats, microsatellites). It performs O(1) short k-mer lookup using an 8-mer hash.

Tier 2: Medium/Imperfect Repeats (>=10 bp)

Coverage: Repeats with motif length 10 bp or longer. Detects minisatellites and medium-length imperfect repeats.

Two sub-phases:

Long-unit strict (unit >=20 bp): Computes periods from adjacent suffix pairs in the LCP array built using Kasai's algorithm, and extends with mismatch tolerance.
General scanning (unit 10-50 bp): Uses BWT k-mer seed scanning (bwt_seed.py) to detect periodic runs (arithmetic progressions) in FM-index occurrence positions and extends candidates.

Allows up to 20% mismatches and 10% indels.

Tier 3: Long Repeats (100 bp - 100 kbp)

Coverage: Repeat units 100 bp-100,000 bp. Detects satellite DNA, centromeric repeats, transposon-derived repeats, etc.

How it works:

BWT k-mer seed scanning (large k-mers, sparse stride=100 by default)
Detects arithmetic progressions in occurrence positions, then determines candidate regions
Ultra-long arrays (>100 copies or >10 kb): Saves full DP alignment costs via anchor-based boundary verification (anchor_scan_boundaries)
Regular arrays: Determines precise boundaries via full DP alignment (refine_repeat)

Adaptive parameters: Automatically computes k-mer size, stride, max_occurrences, etc., based on sequence length, GC content, and previously-detected coverage ratio (see section below).

Output Formats

BED Format (Default)

Tab-delimited 8-column format. Coordinates are 0-based (BED standard).

chr1    100    145    AT    22.5    1    0.022    +
chr1    500    620    AATGG    24.0    2    0.083    +
chr1    1000   5400   ATCGATCG    550.0    3    0.150    -

Column	Description
1 chrom	Chromosome name
2 start	Start position (0-based)
3 end	End position
4 motif	Repeat unit sequence
5 copies	Copy number
6 tier	Detection Tier (1/2/3)
7 mismatch_rate	Mismatch rate (0.0-1.0)
8 strand	Strand (+/-)

VCF Format

VCFv4.2 format. Coordinates are 1-based (VCF standard). INFO field includes MOTIF, COPIES, TIER, etc.

##fileformat=VCFv4.2
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the repeat">
#CHROM  POS   ID  REF  ALT   QUAL  FILTER  INFO
chr1    101   .   A    <STR>  .     .       END=145;MOTIF=AT;CONS_MOTIF=AT;COPIES=22.5;TIER=1;CONF=0.98;MM_RATE=0.022;...

TRF .dat Format

A format compatible with TRF (Tandem Repeats Finder). Space-delimited.

100 145 2 22.5 2 97 0 44 22 39 30 9 1.85 AT ATATATATATATATATATATAT...

Column order: Start End Period CopyNumber ConsensusSize PercentMatches PercentIndels Score A C G T Entropy ConsensusPattern Sequence

STRfinder .csv Format

A CSV format compatible with the STRfinder tool. Coordinates are 1-based.

STR_marker,STR_position,STR_motif,STR_genotype_structure,STR_genotype,STR_core_seq,Allele_coverage,Alleles_ratio,Reads_Distribution,STR_depth,Full_seq,Variations
STR_chr1_100,chr1:101-145,[AT]n,2[AT]22,22,ATATATATAT...,97%,-,22:22,22,ATATATATAT...,-

Tier 3 Adaptive Parameters

Tier 3 automatically adjusts search parameters based on input characteristics through the compute_adaptive_params() function.

Relationship Between Input Characteristics and Parameters

Input Characteristic	Affected Parameters	Effect
Sequence length (seq_len)	`kmer_size`, `stride`, `max_occurrences`, `scan_backward`, `scan_forward`	Longer sequence = larger k-mer, larger stride
GC content (gc_content)	`allowed_mismatch_rate`	Farther from 0.5 = higher mismatch tolerance
Previously-detected coverage ratio (coverage_ratio)	`stride`, `anchor_match_pct`	Higher coverage = smaller stride (more detailed)
Maximum period (max_period)	`tolerance_ratio`	Longer period = greater tolerance

Parameter Details

Parameter	Range	Description
`kmer_size`	12-28	FM-index seed k-mer length. Longer = higher specificity
`stride`	20-300	k-mer sampling interval. Larger = faster but lower sensitivity
`allowed_mismatch_rate`	0.15-0.20	Allowed mismatch ratio
`tolerance_ratio`	0.02-0.04	Period estimation error tolerance ratio
`max_occurrences`	200-1500	Maximum k-mer occurrence count (excludes frequent k-mers)
`anchor_match_pct`	0.70-0.80	Minimum match percentage for anchor-based boundary verification
`scan_backward`	20-80	Number of periods to scan backward from anchor
`scan_forward`	200-800	Number of periods to scan forward from anchor

Preset Descriptions

The --tier3-mode option proportionally adjusts speed-related parameters through speed_factor.

Preset	speed_weight	Characteristics
`fast`	0.8	Increased k-mer size, increased stride, decreased max_occurrences. Suitable for fast screening of large genomes
`balanced`	0.5	Default. Balances speed and sensitivity
`sensitive`	0.2	Decreased k-mer size, decreased stride, increased max_occurrences. Suitable for short sequences or detailed analysis

Special Handling by Sequence Length

Over 100 Mbp (large chromosome mode): Forced adjustment to stride >= 150, kmer_size >= 20, max_occurrences <= 500
Under 100 kbp (small sequence mode): Minimum sensitivity guaranteed with stride >= 20, kmer_size >= 12

Usage Examples

Example 1: Basic Execution

# Run all Tiers, BED output (default)
python3 -m src.main arabadopsis_chrs/chr1.fa -v

# Output: chr1.bed

Example 2: Running Specific Tiers Only

# Run Tier 1 only (STR/microsatellites only)
python3 -m src.main input.fa --tiers tier1 --format bed -o str_only -v

# Run Tier 1 and Tier 2 only, VCF output
python3 -m src.main input.fa --tiers tier1,tier2 --format vcf -o output -v

# Limit period range (1-50 bp only)
python3 -m src.main input.fa --tiers tier1,tier2 --min-period 1 --max-period 50 -o short_repeats

Example 3: Processing Large Genomes with Tier 3 Fast Mode

# Fast search for long repeats only in large genomes
python3 -m src.main large_genome.fa \
    --tiers tier3 \
    --tier3-mode fast \
    --min-period 100 \
    --min-array-bp 500 \
    --format bed \
    -o long_repeats \
    -v

# Detailed analysis of small sequences with sensitive mode
python3 -m src.main small_region.fa \
    --tier3-mode sensitive \
    --format trf \
    -o detailed_analysis

Example 4: Multi-threading and Masking

# Parallel processing of entire genome with 4 processes
python3 -m src.main genome.fa -t 4 --format bed -o genome_repeats -v

# Analyze excluding interspersed repeats from RepeatMasker output (soft mask)
python3 -m src.main masked_genome.fa --mask soft -t 8 -v

# Analyze excluding only hard-masked regions (N regions)
python3 -m src.main genome.fa --mask hard -v

# Exclude both soft and hard masked regions
python3 -m src.main masked_genome.fa --mask both -t $(nproc) -v

Example 5: Performance Analysis with Profiling

# Execution time profiling (outputs top 20 hotspots)
python3 -m src.main input.fa --tiers tier1 --format trf --profile -v

# Profile results are also saved to input.tier2_profile.prof
# Additional analysis possible with: python -m pstats input.tier2_profile.prof

Running Tests

# Run all tests (when using micromamba environment)
micromamba run -n bwtandem python3 -m pytest tests/ -v

# Or activate the environment and run directly
micromamba activate bwtandem
pytest tests/ -v

# Run specific test files
pytest tests/test_adaptive_params.py -v   # Tier 3 adaptive parameters
pytest tests/test_anchor_scan.py -v       # Anchor-based boundary scanning
pytest tests/test_tier3_wiring.py -v      # Tier 3 integration wiring
pytest tests/test_ground_truth.py -v -s   # Regression tests (detailed output)

Test Composition (29 tests + stress tests)

Test File	Test Count	Description	Cython Required
`test_adaptive_params.py`	10	Unit tests for `compute_adaptive_params()`: per-preset behavior, GC/coverage impact, sequence length mode switching	No
`test_anchor_scan.py`	5	Unit tests for `anchor_scan_boundaries()`: perfect repeats, flanking sequences, imperfect repeats, single copy	No
`test_tier3_wiring.py`	3	Tier 3 mode parameter passing verification: initialization, defaults, seed scan wiring	No
`test_ground_truth.py`	11	Regression tests with synthetic sequences: per-Tier sensitivity/precision verification	Tier 2/3 tests only
`test_random_stress.py`	--	Stress test with 30 random sequences (run with `python3 -m tests.test_random_stress`)	Yes

Regression Tests (Ground Truth)

The tests/fixtures/ directory contains 100KB synthetic sequences (FASTA) and ground truth files (BED). Each sequence has repeats of known motifs inserted at known positions, and the tests automatically compute sensitivity and precision by comparing the tool's detection results against the ground truth.

Synthetic Test Data

File	Repeat Count	Description
`synth_tier1.fa` / `synth_tier1_truth.bed`	8	Tier 1 repeats: 1-6 bp motifs, perfect to 5% mismatch
`synth_tier2.fa` / `synth_tier2_truth.bed`	8	Tier 2 repeats: 12-50 bp motifs, perfect to 8% mismatch+indels
`synth_tier3.fa` / `synth_tier3_truth.bed`	8	Tier 3 repeats: 100-1000 bp motifs, perfect to 10% divergence
`synth_mixed.fa` / `synth_mixed_truth.bed`	9	All Tiers mixed (Tier 1/2/3 repeats coexisting in a single sequence)
`synth_adjacent.fa` / `synth_adjacent_truth.bed`	11	Edge cases: adjacent repeats, touching repeats, compound repeats

Synthetic data regeneration:

python3 tests/fixtures/generate_synthetic.py
# Reproducible with seed=42

Matching Logic

The following criteria are applied when matching ground truth to predictions:

Overlap ratio >= 50%: Overlap length between two intervals / length of the larger interval
Motif compatibility: Canonical motifs match, or primitive period lengths are compatible (integer multiple or within +/-20%)

For imperfect repeats, the tool may report a consensus motif different from the original motif, so period compatibility checks prevent correctly detected repeats from being classified as false negatives.

Test Results (2026-03-30)

Regression Tests (Synthetic Sequences, 11 tests)

Test	Sensitivity	Precision	F1	Threshold
Tier 1	100.0%	100.0%	100.0%	Sensitivity >= 95%, Precision >= 90%
Tier 2	100.0%	100.0%	100.0%	Sensitivity >= 90%, Precision >= 90%
Tier 3	100.0%	100.0%	100.0%	Sensitivity >= 90%, Precision >= 90%
Mixed (Tier 1)	100.0%	--	--	Sensitivity >= 95%
Mixed (Tier 2)	100.0%	--	--	Sensitivity >= 70%
Mixed (Tier 3)	100.0%	--	--	Sensitivity >= 70%
Adjacent	100.0%	100.0%	100.0%	Sensitivity >= 95%, Precision >= 90%

Stress Tests (30 Random Sequences, 108 repeats)

Item	Sensitivity	Precision	F1	Ground Truth Count
Overall	100.0%	93.1%	96.4%	108
Tier 1	100.0%	--	--	54
Tier 2	100.0%	--	--	31
Tier 3	100.0%	--	--	23

The stress test validates the robustness of the detector by inserting 2-5 repeats into 30 random 50KB sequences. It uses reproducible seeds (1000-1029).

Note: Tier 2/3/Mixed/Adjacent tests require the Cython extension (_accelerators.so) to be built. If run without Cython, those tests are automatically skipped.

Test Data

The arabadopsis_chrs/ directory contains Arabidopsis chromosome FASTA files and small test sequences (test_seq1.fa through test_seq5.fa).

# Quick functionality check with test sequences
python3 -m src.main arabadopsis_chrs/test_seq1.fa -v

Utility Scripts

# Generate mutated sequences for mismatch tolerance testing
python3 scripts/mutate_fasta.py input.fa --mutation-rate 0.05
# Creates input.fa.bak backup file

# Convert TRF results to BED format
python3 scripts/trf_to_bed.py input.dat -o output.bed

Project Structure

bwtandem/
├── src/
│   ├── main.py             # CLI entry point, FASTA parsing, output writing
│   ├── finder.py           # TandemRepeatFinder: 3-Tier pipeline coordinator
│   ├── bwt_core.py         # BWTCore: FM-index construction (suffix array, BWT, occurrence array)
│   ├── bwt_seed.py         # Shared BWT k-mer seed scanning (used by Tier 2/3)
│   ├── tier1.py            # Tier1STRFinder: Short perfect repeats (1-9 bp)
│   ├── tier2.py            # Tier2LCPFinder: Medium/imperfect repeats (>=10 bp)
│   ├── tier3.py            # Tier3LongReadFinder: Long repeats + adaptive parameters
│   ├── motif_utils.py      # MotifUtils: Canonical motifs, primitive period detection, DP alignment, statistics
│   ├── models.py           # Data classes: TandemRepeat, RefinedRepeat, etc. + output formatters
│   ├── accelerators.py     # Transparent Cython extension loader (falls back to Python if unavailable)
│   ├── _accelerators.pyx   # Cython source: Hamming distance, LCP detection, DP alignment, etc.
│   └── utils.py            # Common utilities
├── tests/
│   ├── test_adaptive_params.py  # Tier 3 adaptive parameter unit tests (10)
│   ├── test_anchor_scan.py      # Anchor-based boundary verification tests (5)
│   ├── test_tier3_wiring.py     # Tier 3 integration wiring tests (3)
│   ├── test_ground_truth.py     # Synthetic sequence regression tests (11)
│   ├── test_random_stress.py    # Stress test with 30 random sequences
│   └── fixtures/
│         ├── generate_synthetic.py    # Synthetic data generation script (seed=42)
│         ├── synth_tier1.fa           # Tier 1 synthetic sequence (100KB)
│         ├── synth_tier1_truth.bed    # Tier 1 ground truth BED
│         ├── synth_tier2.fa           # Tier 2 synthetic sequence (100KB)
│         ├── synth_tier2_truth.bed    # Tier 2 ground truth BED
│         ├── synth_tier3.fa           # Tier 3 synthetic sequence (100KB)
│         ├── synth_tier3_truth.bed    # Tier 3 ground truth BED
│         ├── synth_mixed.fa           # Mixed Tier synthetic sequence (100KB)
│         ├── synth_mixed_truth.bed    # Mixed Tier ground truth BED
│         ├── synth_adjacent.fa        # Adjacent repeat edge cases (100KB)
│         └── synth_adjacent_truth.bed # Adjacent repeat ground truth BED
├── scripts/
│   ├── mutate_fasta.py     # Random point mutation introduction (for mismatch tolerance testing)
│   ├── run_trf.py          # TRF execution wrapper
│   └── trf_to_bed.py       # TRF .dat to BED converter
├── arabadopsis_chrs/       # Arabidopsis test data (FASTA)
├── docs/                   # Additional documentation
├── results/                # Analysis results directory
├── Singularity             # Singularity container definition file
└── CLAUDE.md               # Development guide

Core Module Dependencies

main.py
  └── finder.py (TandemRepeatFinder)
        ├── bwt_core.py (BWTCore)         <- FM-index shared by all Tiers
        ├── tier1.py
        ├── tier2.py
        │     └── bwt_seed.py
        └── tier3.py
              ├── bwt_seed.py             <- Shared with Tier 2
              └── accelerators.py         <- Cython acceleration (optional)

Benchmarks

Synthetic Sequence Accuracy (44 ground truth repeats)

Comparison on 5 synthetic test sequences containing 44 planted repeats with known positions and motifs (periods 1 bp–1000 bp, including adjacent/edge cases):

Tool	Sensitivity	Precision	F1
bwtandem	100.0%	100.0%	100.0%
TRF 4.10	97.7%	100.0%	98.9%
ULTRA 1.2.1	72.7%	71.1%	71.9%
mreps 2.6	68.2%	7.6%	13.6%

bwtandem: Detected all 44 repeats with zero false positives, including all 11 adjacent/edge-case repeats
TRF: High precision but missed 1 adjacent repeat (97.7% sensitivity)
ULTRA: Good at short repeats (Tier 1/2: 100%) but weak on long repeats (Tier 3: 12.5%, max period=100 by default)
mreps: High over-detection (366 false positives across 5 sequences), low sensitivity on long repeats (25%)

Arabidopsis Chr4 (18.5 Mbp)

Tool	Repeats Found	Runtime
bwtandem	4,826	3 min 32 sec
TRF	4,549	34 sec
mreps	84,502	48 sec
ULTRA	23,145	24 min 11 sec

Key Observations

bwtandem achieves the highest sensitivity and precision (100%/100% on synthetic data) and detects 6% more repeats than TRF on Chr4
TRF is the fastest tool (34 sec) with high precision
mreps reports 84K results (heavy over-detection, filtering required)
ULTRA is the slowest (24 min) with 4.8x more results than bwtandem

Centromere Satellite DNA Detection (ColCEN Assembly)

Benchmark on the Arabidopsis Col-CEN genome assembly using CEN180 satellite repeat annotations (GFF3) as ground truth. CEN180 is a ~178 bp satellite repeat constituting centromeric regions, with 20-30% inter-copy divergence.

CEN180 Annotated Unit Coverage (per-base coverage of individual annotated CEN180 units):

Tool	Chr1	Chr2	Chr3	Chr4	Chr5	Overall
bwtandem	95.5%	99.7%	99.4%	98.0%	98.6%	98.2%
mreps	33.6%	34.0%	29.9%	32.1%	24.4%	30.9%
ULTRA	0.0%	0.2%	0.1%	0.2%	0.1%	0.1%
TRF	N/A	N/A	N/A	N/A	N/A	N/A

Key Findings

bwtandem is the only tool that effectively detects CEN180 satellite repeats, achieving 98.2% per-base coverage of GFF3-annotated CEN180 units across all 5 chromosomes
TRF fails to complete on the ColCEN assembly — it hangs on centromeric sequences because its sliding-window DP alignment exhaustively tries all period/alignment combinations across highly repetitive regions, causing combinatorial explosion. bwtandem avoids this by using BWT/FM-index to target only positions where repeats actually exist, rather than scanning every window
Other tools fail because CEN180 has ~25% inter-copy divergence, exceeding their mismatch tolerance thresholds
The remaining ~1.8% uncovered CEN180 units are dispersed copies with autocorrelation at random level (~0.30), lacking detectable tandem repeat periodicity at the sequence level
bwtandem's satellite DNA scanner uses autocorrelation-based period detection that tolerates high divergence
Note: CEN180 units occupy 46–71% of the centromere span (the rest is retrotransposons and other non-repetitive DNA), so centromere-wide coverage metrics can be misleading

Telomere Repeat Detection (ColCEN Assembly)

bwtandem's Tier 1 (short tandem repeats, periods 1–9 bp) automatically detects telomeric repeats (AAACCCT, 7 bp period) at chromosome ends and interstitial telomeric sequences.

Telomere Detection Summary (5 nuclear chromosomes):

Chr	Telomere Repeats	Period 7	Period 8	5' End	3' End
Chr1	421	421 (100%)	0	✓	✓
Chr2	9	9 (100%)	0	— (rDNA)	✓
Chr3	45	42 (93%)	3	✓	✓
Chr4	112	108 (96%)	2	— (rDNA)	✓
Chr5	48	45 (94%)	3	✓	✓

All detected motifs are rotations of the canonical telomere repeat AAACCCT: CCTAAAC, AACCCTA, CCCTAAA, TTAGGGT, GGTTTAG, GGGTTTA, etc.
Chr1 has 421 telomeric repeats, mostly interstitial telomeric sequences (ITS) within the centromere region (17.1M–17.6M), reflecting retrotransposon-mediated telomere insertion into centromeric DNA
Chr2 and Chr4 start with rDNA (not telomeric sequence) in the ColCEN assembly
Mismatch rates in the BED output indicate SNPs relative to the consensus motif (e.g., 3.7% = ~9 SNPs per 251 bp region)

Cross-Species Validation: Oropetium thomaeum (245 Mb)

To validate bwtandem beyond Arabidopsis, we ran it on the Oropetium thomaeum v1.0 genome assembly (Phytozome V12, 243 Mb, 625 contigs). The published analysis (VanBuren et al., 2015) used TRF with parameters 1 1 2 80 5 200 2000 -d -h and identified 155 bp centromeric repeats (including 310 bp dimers and 465 bp trimers) and 18 telomeric sequences with the monomer AAACCCT.

bwtandem results (8 threads, all tiers, 11 min 30 sec):

Feature	Published (TRF)	bwtandem	Match
Total repeats	Not reported	56,854	—
Telomere (AAACCCT, 7 bp)	18 sequences	362 repeats / 18 contigs	Exact match
Centromere (155 bp monomer)	Detected	357 repeats across 53 contigs	Detected
Centromere (310 bp dimer)	Detected	13 repeats	Detected
Centromere (465 bp trimer)	Detected	3 repeats	Detected

Key findings:

Telomere: bwtandem found AAACCCT telomeric repeats in exactly 18 contigs, matching the published count. These are located at contig ends, consistent with telomeric positions.
Centromere: The 155 bp base centromeric repeat was detected across 53 contigs, with the largest arrays on contig 003 (up to 65.7 copies, ~10 kb arrays in the 6.0–6.3 Mb region), consistent with the paper's description of contig 003 as one of the three largest centromeric arrays (>400 kb).
Higher-order repeats: 310 bp dimers and 465 bp trimers were also detected, confirming bwtandem's ability to identify higher-order repeat organization.
Runtime: 11.5 minutes for a 243 Mb draft genome with 625 contigs — comparable to the Arabidopsis Col-CEN (132 Mb) benchmarks when accounting for genome size.

This demonstrates that bwtandem generalizes beyond Arabidopsis to other plant genomes with different centromeric repeat families and assembly qualities.

Tool Overlap (Chr4 Venn Diagram)

Region overlap analysis using 500bp genomic bins on Arabidopsis Chr4:

Comparison	Shared Bins	bwtandem-only	Other-only
bwtandem vs TRF	2,879	2,473	990
bwtandem vs ULTRA	4,663	689	12,503
bwtandem vs mreps	3,288	2,064	12,435
All 4 tools	1,891	—	—

bwtandem shares 74% of TRF's detected regions while finding 2,473 additional regions that TRF misses. ULTRA and mreps each detect many regions not found by other tools, likely due to different sensitivity/specificity trade-offs.

Design Principles

Internal coordinates: 0-based. Converted to 1-based for VCF and STRfinder output
Motif canonicalization: Uses the lexicographically smallest rotation among all cyclic rotations of both strands (forward + reverse complement) as the canonical motif
Primitive period reduction: refine_repeat() always reduces to the primitive period (e.g., ATAT -> AT). Performs both exact and approximate (<=2% error) periodicity tests
Sentinel character: $ is appended to the end of sequences for BWT construction and is excluded from repeat detection

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.vscode		.vscode
arabadopsis_chrs		arabadopsis_chrs
figures		figures
results		results
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
Singularity		Singularity
code_update.md		code_update.md
environment.yml		environment.yml
setup.md		setup.md
software_comparison.md		software_comparison.md

Folders and files

Latest commit

History

Repository files navigation

BWT-based Tandem Repeat Finder

How BWT Finds Tandem Repeats

Project Overview

Why is this tool needed?

Installation

Micromamba Environment Setup (Recommended)

Direct Python Dependency Installation

Building Cython Extensions

Singularity Container

Quick Start

CLI Options in Detail

Detailed Option Descriptions

3-Tier Detection Pipeline

Tier 1: Short Perfect Repeats (1-9 bp)

Tier 2: Medium/Imperfect Repeats (>=10 bp)

Tier 3: Long Repeats (100 bp - 100 kbp)

Output Formats

BED Format (Default)

VCF Format

TRF .dat Format

STRfinder .csv Format

Tier 3 Adaptive Parameters

Relationship Between Input Characteristics and Parameters

Parameter Details

Preset Descriptions

Special Handling by Sequence Length

Usage Examples

Example 1: Basic Execution

Example 2: Running Specific Tiers Only

Example 3: Processing Large Genomes with Tier 3 Fast Mode

Example 4: Multi-threading and Masking

Example 5: Performance Analysis with Profiling

Running Tests

Test Composition (29 tests + stress tests)

Regression Tests (Ground Truth)

Synthetic Test Data

Matching Logic

Test Results (2026-03-30)

Test Data

Utility Scripts

Project Structure

Core Module Dependencies

Benchmarks

Synthetic Sequence Accuracy (44 ground truth repeats)

Arabidopsis Chr4 (18.5 Mbp)

Key Observations

Centromere Satellite DNA Detection (ColCEN Assembly)

Key Findings

Telomere Repeat Detection (ColCEN Assembly)

Cross-Species Validation: Oropetium thomaeum (245 Mb)

Tool Overlap (Chr4 Venn Diagram)

Design Principles

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages