Tachyon is an open source C++ software library for reading, writing, and manipulating sequence variant data in a lossless and bit-exact representation. It is completely compatible with BCF/VCF. It was developed with a focus on enabling fast experimentation and storage of population-scaled datasets.
Tachyon stores data in a format that optimize query execution (column store). Additionally, this data layout generally results in considerable gains in compression as similar data are stored together separately. Tachyon can be considered the equivalent of what CRAM is for SAM/BAM but for sequence variant data (VCF/BCF).
The following tests were run on the first release of Haplotype Reference Consortium (HRC) data. There are ~39 million phased SNPs in 32,488 samples. Left panel: Filesizes for chromosomes 1-22. Right panel: We generated a yon archive for this dataset (left) and compared file sizes for both uncompressed (ubcf and uyon) and compressed data (bcf and yon).
Compression Ratio / Chromosome | Compression Ratio |
---|---|
The following tests were run on the 1000 Genomes Phase 3 (1KGP3) data. There are ~84.4 million phased SNPs in 2,504 samples from 26 distinct populations.
Compression Ratio / Chromosome | Compression Ratio |
---|---|
ubcf: uncompressed bcf; uyon: uncompressed yon; 1 GB = 1000 * 1000 * 1000 b
The references system used was a server running Linux Ubuntu, with an Intel Xeon E5-2697 v3 processor, 64GB of DDR4-2133 RAM, and a pair of Intel SSE 750 NVMe drives running in RAID-0.
The following tests were run to benchmark the processing time of various yon
archives. For these tests we use three distinct datasets: 1) 1000 Genomes Phase 3 (1KGP3) chromosome 11; 2) Haplotype Reference Consortium (HRC) chromosome 11; and 3) Human Genome Diversity Project (HGDP) chromosome 10.
Dataset | Variants | #INFO | #FORMAT | ubcf | bcf | uyon | yon |
---|---|---|---|---|---|---|---|
1kgp3-chr11 | 4,045,628 | 24 | 1 | 20.60 GB | 633.70 MB | 670.29 MB | 157.28 MB |
HRC-chr11 | 1,936,990 | 6 | 1 | 125.90 GB | 3.48 GB | 1.47 GB | 461.96 MB |
HGDP-chr10 | 3,766,673 | 24 | 9 | 73.93 GB | 19.07 GB | 67.76 GB | 14.40 GB |
#INFO: number of INFO fields; #FORMAT: number of FORMAT fields; ubcf: uncompressed bcf; uyon: uncompressed yon; 1 MB = 1000 * 1000 b
Interested in contributing? Fork and submit a pull request and it will be reviewed.
We are actively developing Tachyon and are always interested in improving its quality. If you run into an issue, please report the problem on our Issue tracker. Be sure to add enough detail to your report that we can reproduce the problem and address it. We have not reached version 1.0 and as such the specification and/or the API interfaces may change.
This is Tachyon 0.6.1. Tachyon follows semantic versioning.
Tachyon grew out of the Tomahawk project for calculating genome-wide linkage-disequilibrium.
Marcus D. R. Klarqvist (mk819@cam.ac.uk)
Department of Genetics, University of Cambridge
Wellcome Trust Sanger Institute
James Bonfield, Wellcome Trust Sanger Institute
Petr Daněček, Wellcome Trust Sanger Institute
Richard Durbin, Wellcome Trust Sanger Institute, and Department of Genetics, University of Cambridge
Tachyon is licensed under MIT