Releases: vgteam/vg
vg 1.8.0 - Vallata
Docker image: quay.io/vgteam/vg:v1.8.0-0-g3cb239e2-t176-run
This release includes:
- Improvements to haplotype-aware mapping
- The
vg viz
native graph rendering tool - Improvements to plotting scripts
- Improvements to the build system
- Improvements to CI infrastructure error handling
- Bugfixes
Breaking Changes
- This release generates XG version 9. It can read previous versions, but new indexes cannot be read by previous versions.
New Dependencies
Now vg depends on Cairo, which vg viz
uses for drawing.
New and Updated Submodules
The gbwt
and raptor
submodules have been updated.
Resubmission
This release roughly coincides with resubmission of the vg paper. There have been a lot of changes since v1.6.0 and it would be difficult to provide a specific summary, but several items deserve mention.
- We improve vg's mapping runtime,
- integrate the GBWT,
- and refresh some critical components of the indexing process (pruning and kmer generation).
- We explore haplotype-based alignment scoring using the GBWT,
- and enable using it as the base thread storage system for all the tools in vg.
- We greatly improved the accuracy of the long read alignment algorithm in vg map.
Submission
This checkpoints the version that was used to generate experimental results in the preprint of our manuscript. (It will be linked here when made public.)
Attached is a stripped static binary that is portable across all 64-bit Linux systems that we have access to.
Paper freeze
Recent developments in vg have focused on improving the mapper's performance via simulations.
As a result of this work, our simulations show that when aligning reads simulated from the 1000G variation graph, vg
is equivalent to bwa mem
in performance on the linear reference, and bests it considerably when aligning against the pangenome:
Experiments are completed with scripts/map-sim
. The above example can be reproduced by:
vg/scripts/map-sim human.10M hs37d5.fa ref/index pan/index 32 "-s 571 -n 500000 -e 0.01 -i 0.002 -l 150 -p 500 -v 50" " "
Runtime is between 5 and 10x that of bwa mem
.
Sendai
There have been 286 commits since v1.3.0, so it is hard to summarize everything that's happened. Here are some of the highlights:
mapping ranks
Since v1.2.0:
- Improvements in MSGA allow construction of the GA4GH bakeoff regions (however, problems with alignment are breaking the paths that are included).
- Mappings now include ranks, so that it's easy to obtain subsets of paths and merge subsets of graphs without needing to keep the entire path around.
- Better validation of paths by using ranks on mappings.
The changes in format necessitated changes in xg, and any indexes from older versions will be invalidated with this update.
giant leaps
Notable improvements in this release:
- MSGA -- a kind of alignment-driven overlap-based assembler that can align large genomes; vg's answer to cactus.
- call -- variant calling on the graph
- index -- index huge graphs!
- map -- map against huge graphs!
Indexing improvements
Assuming you've started with a big graph (and that node ids 1000000000 and 1000000001 are outside of your node space), let's start by reducing the complexity of the graph so as to bound memory requirements during GCSA2 construction. We break the graph where we cross more than (approximately) 3 bifurcating edges in 16bp, and when fragments are less than 100bp in total sequence we remove them. This effectively masks out high-complexity regions of the graph. Despite the complexity reduction, the resulting system is sufficient for seeding global mapping:
vg mod -pl 16 -t 16 -e 3 $chr.vg \
| vg mod -S -l 32 - >$chr.prune.vg
vg kmers -gB -k 16 -F -H 1000000000 -T 1000000001 -t 16 \
$chr.prune.vg >$chr.graph
Now, we feed the resulting graphs, which binary formats for de Bruijn-like graphs, into GCSA2's build_gcsa tool:
build_gcsa -d 1 -o wg $((seq 1 22; echo X; echo Y) | parallel -j 1 "echo -n ' {}.graph'")
The resulting GCSA index is wg.gcsa
.
Now, we build an xg index:
vg index -x wg.xg $((seq 1 22; echo X; echo Y) | parallel -j 1 "echo -n ' {}.vg'")
We can then drive our mapper using these indexes, and it only requires 20G of RAM!
zcat NA12878.x10.13502_7.fq.gz | head -4000000 | vg map -x wg.xg -g wg.k16e3l100d1.gcsa -if - -t 32 -k 27 -GFX 1.5 -T 50 -l 12 -S 5 >aln.gam
I'll generate a more-detailed writeup for the wiki, but hopefully this satisfies the very-curious.
MSGA
You can build graphs of sets of sequences by feeding them in in fasta format using vg msga.
vg msga -f ref.fa >ref.vg
Fun!
✩✭✮✫✯✨
Since v1.0.0:
🖬 111606B of patches.
- multi mapping (thanks to @adamnovak )
- gcsa2 integration (nod to @jltsiren for helping me debug my many buggy attempts)
vg
is now capable of generating GCSA and xg indexes of graphs! By default it will index the forward and reverse complement of any variation graph you give it, and allow you to construct queries of up to . At present, you can query the indexes using vg find
, but stay tuned because mapping using the new indexes is top priority for v1.2.0.
You can try it out with this executable and the tarball (or your own copy of the repo):
➜ test git:(master) ✗ time vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg -p
loading variants for z [==============================================]100.0%
parsing variants [==============================================]100.0%
enforcing node size limit [==============================================]100.0%
planning construction [==============================================]100.0%
constructing graph [==============================================]100.0%
merging remaining graphs [==============================================]100.0%
joining graphs [==============================================]100.0%
topologically sorting [==============================================]100.0%
compacting ids [==============================================]100.0%
saving graph [==============================================]100.0%
vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 -p > z.vg 5.99s user 0.32s system 123% cpu 5.114 total
And then we'll index it with xg and gcsa2, using an initial kmer size of 16 and 3 doubling steps, which will yield a GCSA index that lets us find the positions of sequences of up to 16 * 2^3 = 128
in the graph. Meanwhile the xg
index provides near-optimal query performance over elements of the graph (nodes, edges, paths, and path-relative positions) using minimal space.
➜ test git:(master) ✗ time vg index -x z.vg.xg -d z.vg.gcsa -g -k 16 -X 3 -V z.vg
GCSA::prefixDoubling(): Initial path length 16
mergePaths(): 3370686 -> 3265575 paths
mergePaths(): 2988126 unique, 277449 unsorted, 0 nondeterministic paths
GCSA::prefixDoubling(): Step 1 (path length 16 -> 32)
joinPaths(): 3265575 -> 3756204 paths (8280486 ranks)
mergePaths(): 3756204 -> 3741775 paths
mergePaths(): 3607232 unique, 113082 unsorted, 21461 nondeterministic paths
GCSA::prefixDoubling(): Step 2 (path length 32 -> 64)
joinPaths(): 3741775 -> 6328908 paths (21039927 ranks)
mergePaths(): 6328908 -> 6314604 paths
mergePaths(): 6080484 unique, 61311 unsorted, 172809 nondeterministic paths
GCSA::prefixDoubling(): Step 3 (path length 64 -> 128)
joinPaths(): 6314604 -> 8253191 paths (36664975 ranks)
mergePaths(): 8253191 -> 8252375 paths
mergePaths(): 7786268 unique, 39280 unsorted, 426827 nondeterministic paths
GCSA::mergeByLabel(): 8252375 -> 8017312 paths
GCSA::build(): 10511607 edges
GCSA::sample(): 2558469 samples at 2376049 positions
Index verification complete.
vg index -x z.vg.xg -d z.vg.gcsa -g -k 16 -X 3 -V z.vg 234.44s user 1.80s system 310% cpu 1:15.97 total
We can find things in it using a few features of vg find
. Let's simulate a perfect read using vg sim and then look it up in the index:
➜ test git:(master) ✗ vg sim -f -n 1 -l 64 z.vg
TCACTTAACAATGTCACTTCCAGTGGGTGTCTCCCACCTGAGGAGGGTGATACATGTACAAGGC
➜ test git:(master) ✗ vg find -g z.vg.gcsa -S TCACTTAACAATGTCACTTCCAGTGGGTGTCTCCCACCTGAGGAGGGTGATACATGTACAAGGC
58533:0
Also, we can also pick up pieces of the graph from the xg index:
➜ test git:(master) ✗ vg find -x z.vg.xg -p z:200-300 | vg mod -o - | vg view - | head
H HVN:Z:1.0
S 25 CCACATTTCCATTCACCCATCCTC
P 25 z + 24M
L 25 + 26 + 0M
S 26 C
P 26 z + 1M
L 26 + 28 + 0M
S 28 CATCCATCAACCCTCCAATCC
P 28 z + 21M
L 28 + 29 + 0M
Indexing takes about the same amount of time as using the old strategy:
➜ test git:(master) ✗ time vg index -s -k 27 -e 7 z.vg
vg index -s -k 27 -e 7 z.vg 140.62s user 1.22s system 222% cpu 1:03.76 total
But uses less disk and RAM.
1.2M z.vg # the graph itself
32M z.vg.index # old index
2.6M z.vg.xg # xg index
18M z.vg.gcsa # 128mer gcsa
And the comparison isn't apples-to-apples. We've made something altogether much more powerful. We can query the graph for not just 27-mers on the forward strand but all 128-mers, both on the forward and reverse strands!
Both the .gcsa
and .xg
indexes use approximately the same amount of memory on disk and in ram, so we'd need about 21M to load them into ram. This is about 17 times the size of the input graph. Extrapolating, we should expect to use about 50G of for the whole genome.
Now, if you're itching to try to build such indexes for a human whole-genome variation graph, you will need to mask out some small regions of apparent high complexity (likely misalignments or highly divergent haplotypes):
(NB: I haven't tried this. But, when I do, this is how I will 😉)
# for each chromosome, reduce the graph complexity by masking
# complex regions yielding a lot of short subgraphs
# and retaining the original id space
for chr in $(seq 1 22; echo X; echo Y);
do
vg mod -pl 16 -e 6 $chr.vg | vg mod -S -l 50 - >>in.vg
done
# now use these graphs as input to the indexer
# using only 2 doubling steps, and indexing only the forward strand
time vg index -x wg.xg -d wg.gcsa -g -k 16 -X 2 -V -F in.vg
Reboot!
first binary distribution
Let's call this version 1, if for no particular reason other than that I have a static linux 64-bit binary to distribute. Linux users should be able to download and run the attached executable after a quick chmod +x ./vg
.