✩✭✮✫✯✨
Since v1.0.0:
🖬 111606B of patches.
- multi mapping (thanks to @adamnovak )
- gcsa2 integration (nod to @jltsiren for helping me debug my many buggy attempts)
vg
is now capable of generating GCSA and xg indexes of graphs! By default it will index the forward and reverse complement of any variation graph you give it, and allow you to construct queries of up to . At present, you can query the indexes using vg find
, but stay tuned because mapping using the new indexes is top priority for v1.2.0.
You can try it out with this executable and the tarball (or your own copy of the repo):
➜ test git:(master) ✗ time vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg -p
loading variants for z [==============================================]100.0%
parsing variants [==============================================]100.0%
enforcing node size limit [==============================================]100.0%
planning construction [==============================================]100.0%
constructing graph [==============================================]100.0%
merging remaining graphs [==============================================]100.0%
joining graphs [==============================================]100.0%
topologically sorting [==============================================]100.0%
compacting ids [==============================================]100.0%
saving graph [==============================================]100.0%
vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 -p > z.vg 5.99s user 0.32s system 123% cpu 5.114 total
And then we'll index it with xg and gcsa2, using an initial kmer size of 16 and 3 doubling steps, which will yield a GCSA index that lets us find the positions of sequences of up to 16 * 2^3 = 128
in the graph. Meanwhile the xg
index provides near-optimal query performance over elements of the graph (nodes, edges, paths, and path-relative positions) using minimal space.
➜ test git:(master) ✗ time vg index -x z.vg.xg -d z.vg.gcsa -g -k 16 -X 3 -V z.vg
GCSA::prefixDoubling(): Initial path length 16
mergePaths(): 3370686 -> 3265575 paths
mergePaths(): 2988126 unique, 277449 unsorted, 0 nondeterministic paths
GCSA::prefixDoubling(): Step 1 (path length 16 -> 32)
joinPaths(): 3265575 -> 3756204 paths (8280486 ranks)
mergePaths(): 3756204 -> 3741775 paths
mergePaths(): 3607232 unique, 113082 unsorted, 21461 nondeterministic paths
GCSA::prefixDoubling(): Step 2 (path length 32 -> 64)
joinPaths(): 3741775 -> 6328908 paths (21039927 ranks)
mergePaths(): 6328908 -> 6314604 paths
mergePaths(): 6080484 unique, 61311 unsorted, 172809 nondeterministic paths
GCSA::prefixDoubling(): Step 3 (path length 64 -> 128)
joinPaths(): 6314604 -> 8253191 paths (36664975 ranks)
mergePaths(): 8253191 -> 8252375 paths
mergePaths(): 7786268 unique, 39280 unsorted, 426827 nondeterministic paths
GCSA::mergeByLabel(): 8252375 -> 8017312 paths
GCSA::build(): 10511607 edges
GCSA::sample(): 2558469 samples at 2376049 positions
Index verification complete.
vg index -x z.vg.xg -d z.vg.gcsa -g -k 16 -X 3 -V z.vg 234.44s user 1.80s system 310% cpu 1:15.97 total
We can find things in it using a few features of vg find
. Let's simulate a perfect read using vg sim and then look it up in the index:
➜ test git:(master) ✗ vg sim -f -n 1 -l 64 z.vg
TCACTTAACAATGTCACTTCCAGTGGGTGTCTCCCACCTGAGGAGGGTGATACATGTACAAGGC
➜ test git:(master) ✗ vg find -g z.vg.gcsa -S TCACTTAACAATGTCACTTCCAGTGGGTGTCTCCCACCTGAGGAGGGTGATACATGTACAAGGC
58533:0
Also, we can also pick up pieces of the graph from the xg index:
➜ test git:(master) ✗ vg find -x z.vg.xg -p z:200-300 | vg mod -o - | vg view - | head
H HVN:Z:1.0
S 25 CCACATTTCCATTCACCCATCCTC
P 25 z + 24M
L 25 + 26 + 0M
S 26 C
P 26 z + 1M
L 26 + 28 + 0M
S 28 CATCCATCAACCCTCCAATCC
P 28 z + 21M
L 28 + 29 + 0M
Indexing takes about the same amount of time as using the old strategy:
➜ test git:(master) ✗ time vg index -s -k 27 -e 7 z.vg
vg index -s -k 27 -e 7 z.vg 140.62s user 1.22s system 222% cpu 1:03.76 total
But uses less disk and RAM.
1.2M z.vg # the graph itself
32M z.vg.index # old index
2.6M z.vg.xg # xg index
18M z.vg.gcsa # 128mer gcsa
And the comparison isn't apples-to-apples. We've made something altogether much more powerful. We can query the graph for not just 27-mers on the forward strand but all 128-mers, both on the forward and reverse strands!
Both the .gcsa
and .xg
indexes use approximately the same amount of memory on disk and in ram, so we'd need about 21M to load them into ram. This is about 17 times the size of the input graph. Extrapolating, we should expect to use about 50G of for the whole genome.
Now, if you're itching to try to build such indexes for a human whole-genome variation graph, you will need to mask out some small regions of apparent high complexity (likely misalignments or highly divergent haplotypes):
(NB: I haven't tried this. But, when I do, this is how I will 😉)
# for each chromosome, reduce the graph complexity by masking
# complex regions yielding a lot of short subgraphs
# and retaining the original id space
for chr in $(seq 1 22; echo X; echo Y);
do
vg mod -pl 16 -e 6 $chr.vg | vg mod -S -l 50 - >>in.vg
done
# now use these graphs as input to the indexer
# using only 2 doubling steps, and indexing only the forward strand
time vg index -x wg.xg -d wg.gcsa -g -k 16 -X 2 -V -F in.vg
Reboot!