Skip to content

Commit da05e00

Browse files
author
nicolaasuni
committed
Add "how to cite" and update description
1 parent e29d42f commit da05e00

File tree

1 file changed

+11
-2
lines changed

1 file changed

+11
-2
lines changed

README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,14 @@
1111
* **license** MIT (see LICENSE)
1212
* **link** https://github.com/Genomicsplc/variantkey
1313

14+
-----------------------------------------------------------------
15+
16+
**How to cite**
17+
18+
Nicola Asuni. [VariantKey - A Reversible Numerical Representation of Human Genetic Variants](https://www.biorxiv.org/content/early/2018/11/19/473744.1)., bioRxiv 473744; doi: https://doi.org/10.1101/473744
19+
20+
21+
-----------------------------------------------------------------
1422

1523
## TOC
1624

@@ -34,19 +42,20 @@
3442
* [R Module](#rlib)
3543
* [Javascript library](#jslib)
3644

45+
-----------------------------------------------------------------
3746

3847
<a name="description"></a>
3948
## Description
4049

41-
A genetic variant is often referred as a single entity but, for a given genome assembly, it is usually represented as a set of four components with variable length: *chromosome*, *position*, *reference* and *alternate* alleles. There is no guarantee that these components are represented in a consistent way across different data sources. The numerical *dbSNP* reference record representation (*rs#*) only covers a subset of all possible variants and it is not bijective. Processing variant-based data can be really inefficient due to the necessity to perform four different comparison operations for each variant, three of which are string comparisons. Working with strings, in contrast of numbers, poses extra challenges on memory allocation and data-representation.
50+
Human genetic variants are usually represented by four values with variable length: chromosome, position, reference and alternate alleles. There is no guarantee that these components are represented in a consistent way across different data sources, and processing variant-based data can be inefficient because four different comparison operations are needed for each variant, three of which are string comparisons. Working with strings, in contrast to numbers, poses extra challenges on computer memory allocation and data-representation. Existing variant identifiers do not typically represent every possible variant we may be interested in, nor they are directly reversible.
4251

4352
**VariantKey**, a novel reversible numerical encoding schema for human genetic variants, overcomes these limitations by allowing to process variants as a single 64 bit numeric entities while preserving the ability to be searched and sorted per chromosome and position.
4453

4554
The individual components of short variants (up to 11 bases between `REF` and `ALT` alleles) can be directly read back from the VariantKey, while long variants requires a lookup table to retrieve the reference and alternate allele strings.
4655

4756
The [VariantKey Format](#vkformat) doesn't represent universal codes, it only encodes `CHROM`, `POS`, `REF` and `ALT`, so each code is unique for a given reference genome. The direct comparisons of two VariantKeys makes sense only if they both refer to the same genome reference.
4857

49-
This software library can be used to generate and reverse VariantKeys.
58+
This software library can be used to generate and reverse [VariantKey](#vkformat)s and [RegionKey](#regionkey)s.
5059

5160

5261

0 commit comments

Comments
 (0)