Skip to content

Latest commit

 

History

History
62 lines (43 loc) · 7.52 KB

File metadata and controls

62 lines (43 loc) · 7.52 KB

1. Tandem Repeats Finder (TRF)

  • Algorithm: TRF avoids full-scale alignment matrices (which are slow, scaling $$O(n^2)$$ ) by using a $$k$$-tuple matching algorithm. It scans the sequence for matching $$k$$-tuples separated by a common distance $$d$$, without needing prior knowledge of the repeat pattern or size.
  • Probabilistic Model: It treats the alignment of two tandem copies as a sequence of independent Bernoulli trials. Matches are considered "heads," and mutations/indels are considered "tails."
  • Statistical Filtering: Once $$k$$-tuples establish candidate regions, TRF uses statistical distributions (like the Random Walk Distribution and Waiting Time Distribution) to filter out false positives based on the expected behavior of true repeats.
  • Consensus Alignment: Finally, the candidates undergo wraparound dynamic programming (Smith-Waterman style) to generate a consensus sequence and an alignment score.

2. mreps

Developed in 2003 (Kolpakov et al.), mreps relies on combinatorial algorithms rather than probabilistic heuristics. It was built for speed and flexibility, resolving the problem of identifying both exact and "fuzzy" repeats in a single rapid pass.

  • Algorithm: It uses the Kolpakov-Kucherov algorithm, which processes the sequence in linear time, $$O(n)$$, to find exact maximal repetitions. This completely bypasses traditional alignment matrix computations.
  • Fuzzy Repeats: To handle biological mutations, it introduces a "resolution parameter." Instead of enforcing strict edit-distance limits, this parameter defines how much "error" (mismatches/indels) is tolerated before a contiguous run of repeats is technically considered broken.
  • Combinatorial Search: It identifies maximal repetitions (runs that cannot be extended to the left or right without losing the repeat property) and then merges them based on the resolution parameter to find approximate tandem repeats.

3. ULTRA (ULTRA Locates Tandemly Repetitive Areas)

Released in 2019, ULTRA is a modern, model-based tool designed to find highly degenerate (heavily mutated) tandem repeats that older tools like TRF might miss due to strict alignment scoring thresholds.

  • HMM Structure: ULTRA is built on a specialized Hidden Markov Model (HMM). The model consists of a single state for non-repetitive background sequence and a collection of context-sensitive repetitive states.
  • Context Sensitivity: In a repetitive state of period $$p$$, the probability of emitting a specific nucleotide depends heavily on the nucleotide observed $$p$$ positions prior.
  • Indel Handling: Modeling all possible complex insertions and deletions would make the HMM intractably slow. To maintain speed, ULTRA simplifies this by modeling only up to a set number of consecutive insertions and deletions.
  • Statistical Scoring: It outputs an empirically derived P-value for each region, avoiding arbitrary alignment score cutoffs and allowing for rigorous statistical evaluation of deeply decayed repeats.

4. NCRF (Noise-Cancelling Repeat Finder)

Developed specifically for the era of third-generation long-read sequencing (PacBio, Oxford Nanopore), NCRF (2019) targets raw reads with high, unequal error rates where insertions and deletions occur at vastly different frequencies.

  • Noise Cancellation: Traditional tools fail on raw long reads because they assume symmetric mismatch/indel penalties. NCRF explicitly models the asymmetric error profiles of long-read sequencers.
  • Algorithm: It operates by searching for user-specified tandem repeat motifs directly in the noisy reads. It utilizes a specialized dynamic programming alignment matrix that heavily penalizes mismatch types that are unlikely for the specific sequencing chemistry, while forgiving known systematic errors.
  • Output: It excels at capturing the full length of long tandem repeat arrays (like centromeric repeats) that would be artificially fragmented by tools like TRF.

5. tantan

Developed by Martin C. Frith (2011), tantan is an incredibly fast tool primarily designed for "masking" simple and tandem repeats to prevent false-positive homology predictions during sequence alignment, rather than generating detailed biological annotations of the repeats themselves.

  • Algorithm: Like ULTRA, tantan uses a Hidden Markov Model. However, instead of finding discrete boundaries of a repeat, it uses a Forward-Backward algorithm to calculate the absolute probability that any single given letter in a sequence is part of a repeat.
  • Gentle Masking: Instead of "hard masking" (replacing repeats with 'N's, which destroys data), tantan introduced the concept of "gentle masking." It lowers the alignment scoring matrix values for highly probable repetitive regions, allowing true homologous alignments to bridge across repeats without artificial inflation.
  • Sequence Composition: It is uniquely optimized to handle highly AT-rich genomes (like Plasmodium), which notoriously trigger false positives in traditional repeat finders due to naturally low sequence complexity.

Pros and Cons Table

Software Pros Cons
TRF Industry standard benchmark; no prior knowledge of motifs required; highly reliable on high-quality assemblies. Struggles with highly degenerate repeats; computationally expensive for very long reads; assumes uniform error rates.
mreps Extremely fast linear time complexity for exact repeats; resolution parameter allows flexible discovery of fuzzy repeats. Output can be fragmented; less sensitive to complex indels compared to HMM-based models; older codebase.
ULTRA Exceptional at finding highly degraded/mutated repeats; provides statistically rigorous P-values; HMM avoids arbitrary scoring matrices. Slower than combinatorial methods (like mreps) on massive datasets; uses simplistic indel modeling to save time.
NCRF Unmatched for raw, noisy long-read sequencing data (PacBio/ONT); mathematically accounts for asymmetric sequencing errors. Requires a specified motif (not de novodiscovery); not optimized for short, high-accuracy Illumina reads.
tantan Blazing fast; provides base-by-base probabilities; excellent for AT-rich genomes and pre-alignment masking. Does not extract consensus motifs or annotate repeat structures by default; indels can easily fragment its repeat calls.

Technical Feature Comparison

Feature TRF mreps ULTRA NCRF tantan Your BWT Tool (Theoretical)
Primary Algorithm $$k$$-tuple matching + Dynamic Programming Kolpakov-Kucherov (Combinatorial) Hidden Markov Model (HMM) Asymmetric Dynamic Programming Hidden Markov Model (Forward-Backward) Burrows-Wheeler Transform (BWT)
Time Complexity $$O(n \log n)$$heuristic $$O(n)$$ for exact repeats $$O(n \times \text{max\_period})$$ $$O(n \times m)$$ $$O(n \times \text{max\_period})$$ Typically $$O(n)$$ with FM-index
Requires Prior Motif? No (de novo) No (de novo) No (de novo) Yes (Targeted) No (de novo) No (de novo)
Best Use Case General purpose, high-quality reference genomes Rapid scanning of whole genomes Highly mutated/ancient repeat arrays Raw, uncorrected long-reads Fast repeat masking before homology search Fast, large-scale repeat discovery
Indel Tolerance Moderate (symmetric scoring) Moderate (via resolution parameter) High (up to consecutive indels) Extremely High (asymmetric) Low to Moderate (often breaks regions) Variable (depends on search depth)