Skip to content

giant leaps

Compare
Choose a tag to compare
@ekg ekg released this 23 Oct 10:58
· 17059 commits to master since this release

Notable improvements in this release:

  1. MSGA -- a kind of alignment-driven overlap-based assembler that can align large genomes; vg's answer to cactus.
  2. call -- variant calling on the graph
  3. index -- index huge graphs!
  4. map -- map against huge graphs!

Indexing improvements

Assuming you've started with a big graph (and that node ids 1000000000 and 1000000001 are outside of your node space), let's start by reducing the complexity of the graph so as to bound memory requirements during GCSA2 construction. We break the graph where we cross more than (approximately) 3 bifurcating edges in 16bp, and when fragments are less than 100bp in total sequence we remove them. This effectively masks out high-complexity regions of the graph. Despite the complexity reduction, the resulting system is sufficient for seeding global mapping:

vg mod -pl 16 -t 16 -e 3 $chr.vg \
    | vg mod -S -l 32 - >$chr.prune.vg
vg kmers -gB -k 16 -F -H 1000000000 -T 1000000001 -t 16 \
    $chr.prune.vg >$chr.graph

Now, we feed the resulting graphs, which binary formats for de Bruijn-like graphs, into GCSA2's build_gcsa tool:

build_gcsa -d 1 -o wg $((seq 1 22; echo X; echo Y) | parallel -j 1 "echo -n ' {}.graph'")

The resulting GCSA index is wg.gcsa.

Now, we build an xg index:

vg index -x wg.xg $((seq 1 22; echo X; echo Y) | parallel -j 1 "echo -n ' {}.vg'")

We can then drive our mapper using these indexes, and it only requires 20G of RAM!

zcat NA12878.x10.13502_7.fq.gz | head -4000000 | vg map -x wg.xg -g wg.k16e3l100d1.gcsa -if - -t 32 -k 27 -GFX 1.5 -T 50 -l 12 -S 5 >aln.gam

I'll generate a more-detailed writeup for the wiki, but hopefully this satisfies the very-curious.

MSGA

You can build graphs of sets of sequences by feeding them in in fasta format using vg msga.

vg msga -f ref.fa >ref.vg

Fun!