Longdust identifies long highly repetitive STRs, VNTRs, satellite DNA and other low-complexity regions (LCRs) in a genome. It is motivated by and follows a similar rationale to SDUST. Unlike SDUST which is limited to short windows, longdust can find centromeric satellite and VNTRs with long repeat units.
Longdust also overlaps with tandem repeat finders (e.g. TRF, TANTAN and ULTRA) in functionality. Nonetheless, it is not tuned for tandem repeats with two or three copies, but may report low-complexity regions without clear tandem structure. Longdust complements TRF etc to some extent.
If you use longdust, please cite:
Li H and Li B (2025) Finding low-complexity DNA sequences with longdust. arXiv:2509.07357
Longdust finds 277.1Mb of LCRs from the T2T-CHM13 analysis set. 226.5Mb of them overlap with satellites (plus ~5Mb flanking) annotated by the T2T consortium, 32.8Mb of the remainder (50.6Mb) overlap with TRF (2 7 7 80 10 50 500 -l12), and 14.7Mb of the rest (17.8Mb) with SDUST (-t30). Only 3.0Mb is left. Most longdust LCRs are found by other tools collectively.
In comparison, TRF finds 244.0Mb of TRs with four or more copies. 97.9% of them are identified by longdust as well. On the contrary, of 30.6Mb of TRs with less than four copies, only 14.8% overlap with longdust LCRs. Longdust is not tuned for TRs with low copy numbers by default. With 349Mb, TANTAN (-w500 -s.85) finds the most TRs. 70.3Mb of them do not overlap with the union of T2T satellite, TRF, SDUST and longdust. Even if we reduce the score threshold from 50 to 30 with TRF, 63.6Mb is still left. TANTAN seems to be finding distinct TRs.
On performance, longdust ran for 63 minutes. TANTAN is faster at ~35 min; SDUST was the fastest
at 4 minutes only. They all used less than 1.5GB memory. TRF took nearly 13
hours for T2T-CHM13. Setting "2 5 7 80 10 30 2000 -l20" took 19 hours and 12GB;
reducing -l
to 12 did not yield any output in 40 hours. TRF was much faster on
GRCh38 due to the lack of long satellite arrays.
Let
Suppose we are working with one long genome string. We use closed interval
Given a sequence
SDUST scores the complexity of
It hardcodes
A sequence of length
Longdust scores complexity with
where
is calculated numerically. The first term in
Longdust identifies a good interval
In the code, longdust impements a few strategies to speed up the search without changing the output. It also uses BLAST-like X-drop to break at long non-LCR intervals. Due to heuristics, longdust may generate slightly different output on the reverse complement of the input sequence. For strand symmetry like SDUST, longdust takes the union of intervals identified from both strands.