Skip to content

Latest commit

 

History

History
357 lines (280 loc) · 15.7 KB

Lightning-Data.md

File metadata and controls

357 lines (280 loc) · 15.7 KB

Lightning Data

Overview

Lightning is a system that allows for efficient access to large large scale population genomic data with a focus on clinical and research use. The primary use of Lightning is for populations human genomic data but this could be extended to encompass other organisms. Genomes are stored in a compressed format that is a compromise between size and accessibility. Additional data data sources, such as phenotype data from the Harvard Personal Genome Project and variant data from the ClinVar database, are added for practical use.

The Lightning system is a combination of a conceptual way to think about genomes (genomic tiling), the internal representation of genomes for efficient access (the compact genome format and auxiliary data) and the software that manages access to the data.

This document will be focusing on the data files and formats used in the Lightning system. Please refer to Lightning Concepts for the description of the tiling system and Lightning Architecture for references to the engineering architecture.

Data Overview

Here is a list of the main data components:

  • Compact genome format files (CGF), one for each data genomic set
  • Tile library, holding the underlying sequence
  • Tagset library, holding the tile sequences
  • Assembly data, holding the mapping of reference genomes to tile boundaries
  • Tile span information

There are some auxiliary data files and formats that are used for conversion but aren't used for Lightning server queries:

  • FastJ
  • Simple genome library format

In addition, for the Lightning prototype, there are the ClinVar database and the untap database provided:

  • ClinVar database, holding ClinVar variants and their tile mappings
  • untap database, holding phenotype information for individuals from the Harvard Personal Genome Project and their mapping to the CGF datasets above

Compact Genome Format Files

See the CGF schema for a description of the CGF files. The library and command line tools provided in cgf allow for inspection and creation of CGF files.

The current 651 CGF files (433 Thousand Genomes Project, 217 Harvard Personal Genome Project and hg19) average at about 27MiB each. Most of that information is fine grained no-call information needed to reconstruct the original sequence. The current CGF files can be found at:

Tile Library Format

The sequence information is stored on a per tile path basis. All tile variants at a particular tile position are converted into their twobit format and packed into a tar archive grouped by tile path. The tar archive is indexed using tarindexer and the tar archive is then bgzip'd, creating a bgzip index file in the process.

For example, here are some of the first files in the resulting tile library format directory:

0000.tar.gz      007b.tar.gz.gzi  00f6.tar.tai     0172.tar.gz      01ed.tar.gz.gzi  0268.tar.tai     02e4.tar.gz
0000.tar.gz.gzi  007b.tar.tai     00f7.tar.gz      0172.tar.gz.gzi  01ed.tar.tai     0269.tar.gz      02e4.tar.gz.gzi
0000.tar.tai     007c.tar.gz      00f7.tar.gz.gzi  0172.tar.tai     01ee.tar.gz      0269.tar.gz.gzi  02e4.tar.tai
0001.tar.gz      007c.tar.gz.gzi  00f7.tar.tai     0173.tar.gz      01ee.tar.gz.gzi  0269.tar.tai     02e5.tar.gz
0001.tar.gz.gzi  007c.tar.tai     00f8.tar.gz      0173.tar.gz.gzi  01ee.tar.tai     026a.tar.gz      02e5.tar.gz.gzi
0001.tar.tai     007d.tar.gz      00f8.tar.gz.gzi  0173.tar.tai     01ef.tar.gz      026a.tar.gz.gzi  02e5.tar.tai
0002.tar.gz      007d.tar.gz.gzi  00f8.tar.tai     0174.tar.gz      01ef.tar.gz.gzi  026a.tar.tai     02e6.tar.gz
...

The combination of twobit representation and bgzip allows for a relatively size compact tile library.

The combination of the tar archive and bgzip archive allow for fast random access to individual tile positions.

Here is an example of getting all tile variants for tile position 01ef.00.1d2e:

$ grep 01ef.00.1d2e 01ef.tar.tai
01ef.00.1d2e.2bit 25138176 230
$ twoBitGulp -i <( bgzip -b 25138176 -s 230 01ef.tar.gz )
>01ef.00.1d2e.000
attagaggtagtgctctacaattttctgttaagttaagatggttgataat
gtcaatatcaaatcttctgtatccttactcatttttaagttggtctatca
attattgaagaaattgtgttaaatcctctaactttcactgtaaatttatt
tctccctatatttctctcagtttttggcaaatggattttgacacattgtt
gttatgattgttacaccttcctgatgcattggctttgttgtcattgtaa

>01ef.00.1d2e.001
attagaggtagtgctctacaattttctgttaagatggttgataatgtcaa
tatcaaatcttctgtatccttactcatttttaagttggtctatcaattat
tgaagaaattgtgttaaatcctctaactttcactgtaaatttatttctcc
ctatatttctctcagtttttggcaaatggattttgacacattgttgttat
gattgttacaccttcctgatgcattggctttgttgtcattgtaa

For the 651 samples, 433 thousand genome datasets, 217 Harvard Personal Genome Project datasets and the hg19 reference, the tile library totals 9GiB. The latest tile library can be found at:

Tagset

Lightning tile tag sequences are stored in a 2bit files. Each sequence name is identified by it's four hex path string followed by a period and the 2 digit hex string tile library version (e..g 01ef.00).

The tags for a path are concatenated into a long sequence. The tag length, currently 24, along with the tile position indicates how much of the sequence should be skipped to find the relevant tag. The first 24 bases in a tile path represent the ending tag of the first tile. The last 24 bases in a tile path represent the starting tag sequence of the last tile in the tile path. A tile path can be empty and have no tags in them.

For example, here is an illustrative example of how to find the beginning and end tag for tile position 01ef.00.1d2e:

$ twoBitGulp -i tagset.2bit -name 01ef.00 -w 24 | egrep -v '^>' | head -n 7470 | tail -n 2
atctatggatatttggatttctaa
attagaggtagtgctctacaattt

There are roughly 10M tile positions. The current tagset.2bit is ~61MiB in size. The latest tagset can be found at:

Assembly

The mapping between a reference genome and which tile position they map to are kept in a compressed and indexed text file, The text file consists of a header followed by fixed with data. The header is a greater than symbol (>) followed by a triple of the reference name, chromosome name and 4 digit ASCII hex tile path, delimited by a colon (:). The subsequent data is a tilestep followed by a non-inclusive, 0 index, ending base position of the tile. That is, the reported base position is one past the inclusive end of the 0 reference start of the next tile, where it exists. Each tilestep and ending position are on individual lines with a tab (\t) immediately following the 4 (ASCII) hex digit tile step and spaces left padding the (base 10) ending tile position, padding to make a total of 10 characters.

For example, here is a portion of the assembly.00.hg19.fw.gz file:

...
202c      34599696
202d      34600000
>hg19:chr1:000c
0000      34600237
0001      34600462
0002      34600715
0003      34600949
...
5240      40099773
5241      40100000
>hg19:chr1:000d
0000      40100225
0001      40100457
0002      40100682
0003      40100907
0004      40101132
...

To calculate the beginning of a tile path, the chromosome and reference position needs to be stored from the previous tile path. If there is no previous tile path or the tile path has a different chromosome, the beginning reference position is 0. If the previous tile path has the same chromosome, then one past the reported value of the last tile step is the beginning of the current tile path.

In the above example, tile path 0x000d starts at chr1, reference position 40100001.

Tile steps can be skipped, in which case the tile that the end tag falls in is reported. Tile step 0 is always the first tile step in a tile path. To calculate the span of a tile, the previous tile step needs to be stored. If the tile step increment is greater than one, the tile step one past the previous tile step is the anchor tile and the difference between the reported tile step and previously reported tile step is the span.

For example:

...
00b3      59032812
00b4      59373566
>human_g1k_v37:MT:035e
0000           244
0001           467
0002           959
0003          1394
0004          1767
0005          2225
0008          3338
0009          3563
000a          3788
000b          9760
000c          9985
...

For tile path 0x35e, tile step 0x0008 is reported with end reference position 3338 with the previously reported tile step of 0x0005 and reference position 2225. The next tile after tile step 0x0005 is 0x0006 and has span of 3 (8-5=3). In this case, the anchor tile is tile step 0x0006 with span of 3 with the next tile after 0x0006 is 0x0009. The tile 0x0006 starts at reference position 2201 (2225-24) and ends (non-inclusive, 0 reference) on 3338.

In addition to the fixed width file, a fixed width index file, reminiscent of a FASTA index file, is used as an index file. The format is a tab delimited sequence name, size (in bytes), start (excluding header, in bytes), line length (in bytes, excluding ending newline) and total line length (including newline, in bytes).

For example:

$ head  assembly.00.hg19.fw.fwi
hg19:chr1:0000  86928   16      15      16
hg19:chr1:0001  185360  86960   15      16
hg19:chr1:0002  113792  272336  15      16
hg19:chr1:0003  120800  386144  15      16
hg19:chr1:0004  209504  506960  15      16
hg19:chr1:0005  160976  716480  15      16
hg19:chr1:0006  241776  877472  15      16
hg19:chr1:0007  211392  1119264 15      16
hg19:chr1:0008  237408  1330672 15      16
hg19:chr1:0009  117776  1568096 15      16

The fixed width file is compressed with bgzip and indexed, for a total of 3 files per reference tile assembly build. In theory multiple tile reference assemblies could be provided in the same file but the current version has each reference tile assembly in distinct files, with three files for hg19 and three files for human_g1k_v37.

Note that the chromosome naming convention is used from the source reference sequence. For example, chrM for hg19 and MT for the human_g1k_v37 reference.

The current fixed width files total ~56MiB. The most current assembly files can be found at:

Span

When variants from an input sequence fall on tags, the resulting tile is extended until the sequence matches an unaltered tile tag. The information for how many tags these variant skip is stored in the span data file.

The span file consists of ASCII CSV file (, delimited), with a 4, 2, 4, 3 hex tile ID, followed by the span number (in decimal). Only tile span information for tiles that are other than span of 1 are held.

For example:

$ zcat span.gz | head
0000.00.0004.42b,2
0000.00.0004.42c,2
0000.00.0004.42d,2
0000.00.0004.42e,2
0000.00.0005.0da,2
0000.00.0005.0db,2
0000.00.0005.0dc,2
0000.00.0005.0dd,2
0000.00.0005.0de,2
0000.00.0005.0df,2

The current compressed span.gz file is ~49MiB. The most current version can be found at:

Simple Genome Library Format

The simple genome library format (SGLF) is used as a verbose, intermediate format for help in conversion to the more size compact cousins.

The SGLF files are compressed CSV files, one per tile path. The columns are the 4, 2, 4, 3 formatted tile id followed by a + and span information (in decimal), followed by the md5sum of the tile sequence, followed by the sequence itself.

As an example:

$ zcat 01ef.sglf.gz  | head
01ef.00.0000.000+1,57f5909d01b7f8edd3f8652e7e610709,ggtgaatgttggctgtggagaatgaatccgaatcacttaggtcaaaagatgactaattccaaacacttttgctctatgcctgtttttatggtggccactctttgctctcaaacagggctcagaagaagagtgccaacaagtttctccacagaggggcactggctggcatccctgtaatacgcggtttgtagagaatgaaagcagctttggttttcttttgtacga
01ef.00.0000.001+1,1fbe5a4c30ceb0054ff6bf995bb6f2f1,ggtgaatgttggctgtggagaatgaatccgaatcacttaggtcaaaagatggctaattccaaacacttttgctctatgcctgtttttatggtggccactctttgctctcaaacagggctcagaagaagagtgccaacaagtttctccacagaggggcactggctggcatccctgtaatacgcggtttgtagagaatgaaagcagctttggttttcttttgtacga
01ef.00.0000.002+1,944d1b3b958a2c8ff4105c816c8b8044,ggtgaatgttggctgtggagaatgaatccgaatcacttaggtcaaaagatgactaattccaaacacttttgctctatgcctgtttttatggtggccactctttgctctcaaacagggctcagaaggagagtgccaacaagtttctccacagaggggcactggctggcatccctgtaatacgcggtttgtagagaatgaaagcagctttggttttcttttgtacga
01ef.00.0000.003+1,9188190557002b3f28d847d707464100,ggtgaatgttggctgtggagaatgaatccgaatcacttaggtcaaaagatgactaactccaaacacttttgctctatgcctgtttttatggtggccactctttgctctcaaacagggctcagaagaagagtgccaacaagtttctccacagaggggcactggctggcatccctgtaatacgcggtttgtagagaatgaaagcagctttggttttcttttgtacga
01ef.00.0001.000+1,c4c6db1a6f52cca5aafc43616246e4c9,cagctttggttttcttttgtacgagtgcacccagttaccggcatgacactatggtttcctcgctcggctttgaaatatagtaaactcacaaaagctactgtcgagttcagaacaaacacagggggtatgtttagttatttgtttttgtagtaaaataaaatatttattatacgcagacacctactgtgaacagagggggaggccactgtgttttatcttgccggtgtcatgtttgacttccattaggga
01ef.00.0001.001+1,d9303363bff612ec0e65108e7b0f189a,cagctttggttttcttttgtacgagtgcacccagttaccagcatgacactatggtttcctcgctcggctttgaaatatagtaaactcacaaaagctactgtcgagttcagaacaaacacagggggtatgtttagttatttgtttttgtagtaaaataaaatatttattatacgcagacacctactgtgaacagagggggaggccactgtgttttatcttgccggtgtcatgtttgacttccattaggga
01ef.00.0001.002+1,0a2d0c1ba8227f4daf56e303955d0f3c,cagctttggttttcttttgtacgagtgcacccagttaccggcatgacactatggtttcctcgctcggctttgaaatatagtaaactcacaaaagctactgtcgagttcagaacaaacacagggggtatgtttagttatttgtttttgtagtaaaataaaatatttattatacacagacacctactgtgaacagagggggaggccactgtgttttatcttgccggtgtcatgtttgacttccattaggga
01ef.00.0001.003+1,0b8c1f117c4467f3734f5d15c716641d,cagctttggttttcttttgtacgagtgcacccagttaccggcatgacactatggtttcctcgctcggctttgaaatatagtaaactcacaaaagctactgtcgagttcagaacagacacagggggtatgtttagttatttgtttttgtagtaaaataaaatatttattatacgcagacacctactgtgaacagagggggaggccactgtgttttatcttgccggtgtcatgtttgacttccattaggga
01ef.00.0001.004+1,75daad328a534dd860f91e219ce158c6,cagctttggttttcttttgtacgagtgcacccagttaccggcatgacactatggtttcctcgctcggctttgaaatatagtaaactcacaaaagctactgtcgagttcagaacaaacacagggggtatgtttagttatttgtttttgtagtaaaataaaatatttattatacgcagacacctactgtgaacagagcgggaggccactgtgttttatcttgccggtgtcatgtttgacttccattaggga
01ef.00.0001.005+1,7719b0d2005abd6955e0afb1c5665d46,cagctttggttttcttttgtacgagtgcacccagttaccggcatgacactatggtttcctcgctcggctttgaaatatagtaaactcacaaaagctactgtcgagttcagaacaaacacagggggtatgtttagttatttgtttttgtagtaaaataaaatatttattatacgcagacacctactgtgaacagagggggaggccactgtgttttatcttgctggtgtcatgtttgacttccattaggga

The tile library itself does not hold nocall information. In construction, a heuristic was used to populate bases with what was thought to be the most common base found, defaulting to a if the entire population was no-call.

The current size is ~24GiB. The current SGLF can be found at:

FastJ

FastJ is reminiscent of the FASTA file except the header line is replaced by a JSON string. See the FastJ for details.

Since FastJ is so verbose, it's used as an intermediate format to construct the tile library and compact genome format files.

Untap

See the untap project for details.

The most current version of the untap database can be found at: