Skip to content

Commit c6dbf1e

Browse files
chore: Update documentation to match current CLI/usage (#595)
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
1 parent faffad6 commit c6dbf1e

File tree

9 files changed

+245
-224
lines changed

9 files changed

+245
-224
lines changed

README.md

Lines changed: 18 additions & 197 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
# Mehari
99

10-
<img align="right" width="200" height="200" src="misc/camel.jpeg">
10+
<img style="float: right" width="200" height="200" src="misc/camel.jpeg" alt="a camel">
1111

1212
Mehari is a software package for annotating VCF files with variant effect/consequence.
1313
The program uses [hgvs-rs](https://crates.io/crates/hgvs) for projecting genomic variants to transcripts and proteins and thus has high prediction quality.
@@ -17,201 +17,22 @@ Other popular tools offering variant effect/consequence prediction include:
1717
- [SnpEff](http://pcingola.github.io/SnpEff/)
1818
- [VEP (Variant Effect Predictor)](https://www.ensembl.org/info/docs/tools/vep/index.html)
1919

20-
Mehari offers predictions that aim to mirror VariantValidator, the gold standard for HGVS variant descriptions.
20+
Mehari offers HGVS predictions that aim to mirror VariantValidator, the gold standard for HGVS variant descriptions, and consequence predictions compatible with VEP.
2121
Further, it is written in the Rust programming language and can be used as a library for users' Rust software.
2222

23-
## Supported Sequence Variant Frequency Databases
24-
25-
Mehari can import public sequence variant frequency databases.
26-
The supported set slightly differs between import for GRCh37 and GRCh38.
27-
28-
**GRCh37**
29-
30-
- gnomAD r2.1.1 Exomes [`gnomad.exomes.r2.1.1.sites.vcf.bgz`](https://gnomad.broadinstitute.org/downloads#v2)
31-
- gnomAD r2.1.1 Genomes [`gnomad.genomes.r2.1.1.sites.vcf.bgz`](https://gnomad.broadinstitute.org/downloads#v2)
32-
- gnomAD v3.1 mtDNA [`gnomad.genomes.v3.1.sites.chrM.vcf.bgz`](https://gnomad.broadinstitute.org/downloads#v3-mitochondrial-dna)
33-
- HelixMTdb `HelixMTdb_20200327.tsv`
34-
35-
**GRCh38**
36-
37-
- gnomAD r2.1.1 lift-over Exomes [`gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz`](https://gnomad.broadinstitute.org/downloads#v2)
38-
- gnomAD v3.1 Genomes [`gnomad.genomes.v3.1.2.sites.$CHROM.vcf.bgz`](https://gnomad.broadinstitute.org/downloads#v3)
39-
- gnomAD v3.1 mtDNA [`gnomad.genomes.v3.1.sites.chrM.vcf.bgz`](https://gnomad.broadinstitute.org/downloads#v3-mitochondrial-dna)
40-
- HelixMTdb `HelixMTdb_20200327.tsv`
41-
42-
## Building from scratch
43-
To reduce compile times, we recommend using a pre-built version of `rocksdb`, either from the system package manager or e.g. via `conda`:
44-
45-
```bash
46-
# Ubuntu
47-
sudo apt-get install librocksdb-dev
48-
49-
# Conda
50-
conda install -c conda-forge rocksdb
51-
```
52-
53-
In either case, either add
54-
```toml
55-
[env]
56-
ROCKSDB_LIB_DIR = "/usr/lib/" # in case of the system package manager, adjust the path accordingly for conda
57-
SNAPPY_LIB_DIR = "/usr/lib/" # same as above
58-
```
59-
to `.cargo/config.toml` or set the environment variables `ROCKSDB_LIB_DIR` and `SNAPPY_LIB_DIR` to the appropriate paths:
60-
61-
```bash
62-
export ROCKSDB_LIB_DIR=/usr/lib/
63-
export SNAPPY_LIB_DIR=/usr/lib/
64-
```
65-
66-
By default, the environment variables are defined in the `.cargo/config.toml` as described above, i.e. may need adjustments if not using the system package manager.
67-
68-
To build the project, run:
69-
```bash
70-
cargo build --release
71-
```
72-
73-
To install the project locally, run:
74-
```bash
75-
cargo install --path .
76-
```
77-
## Internal Notes
78-
79-
```
80-
rm -rf /tmp/out ; cargo run -- db create seqvar-freqs --path-output-db /tmp/out --genome-release grch38 --path-helix-mtdb ~/Downloads/HelixMTdb_20200327.vcf.gz --path-gnomad-mtdna ~/Downloads/gnomad.genomes.v3.1.sites.chrM.vcf.bgz --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrY.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrX.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrY.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr1.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr2.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr1.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr2.vcf
81-
82-
rm -rf /tmp/out ; cargo run -- db create seqvar-freqs --path-output-db /tmp/out --genome-release grch37 --path-gnomad-mtdna ~/Downloads/gnomad.genomes.v3.1.sites.chrM.vcf.bgz --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrY.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.genomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr1.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr2.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr1.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr2
83-
```
84-
85-
```
86-
prepare()
87-
{
88-
in=$1
89-
out=$2
90-
91-
zcat $in \
92-
| head -n 5000 \
93-
| grep ^# \
94-
> $out
95-
96-
zcat $in \
97-
| grep -v ^# \
98-
| head -n 3 \
99-
>> $out
100-
}
101-
102-
base=/data/sshfs/data/gpfs-1/groups/cubi/work/projects/2021-07-20_varfish-db-downloader-holtgrewe/varfish-db-downloader/
103-
104-
mkdir -p tests/data/db/create/seqvar_freqs/{12,xy}-{37,38}
105-
106-
## 37 exomes
107-
108-
prepare \
109-
$base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr1.vcf.bgz \
110-
tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr1.vcf
111-
prepare \
112-
$base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr2.vcf.bgz \
113-
tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr2.vcf
114-
prepare \
115-
$base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrX.vcf.bgz \
116-
tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrX.vcf
117-
prepare \
118-
$base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrY.vcf.bgz \
119-
tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrY.vcf
120-
121-
## 37 genomes
122-
123-
prepare \
124-
$base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chr1.vcf.bgz \
125-
tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr1.vcf
126-
prepare \
127-
$base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chr2.vcf.bgz \
128-
tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr2.vcf
129-
prepare \
130-
$base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chrX.vcf.bgz \
131-
tests/data/db/create/seqvar_freqs/xy-37/gnomad.genomes.r2.1.1.sites.chrX.vcf
132-
133-
## 38 exomes
134-
135-
prepare \
136-
$base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr1.vcf.bgz \
137-
tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr1.vcf
138-
prepare \
139-
$base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr2.vcf.bgz \
140-
tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr2.vcf
141-
prepare \
142-
$base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrX.vcf.bgz \
143-
tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrX.vcf
144-
prepare \
145-
$base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrY.vcf.bgz \
146-
tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrY.vcf
147-
148-
## 38 genomes
149-
150-
prepare \
151-
$base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chr1.vcf.bgz \
152-
tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr1.vcf
153-
prepare \
154-
$base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chr2.vcf.bgz \
155-
tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr2.vcf
156-
prepare \
157-
$base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chrX.vcf.bgz \
158-
tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrX.vcf
159-
prepare \
160-
$base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chrY.vcf.bgz \
161-
tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrY.vcf
162-
```
163-
164-
Building tx database
165-
166-
167-
```
168-
cd hgvs-rs-data
169-
170-
seqrepo --root-directory seqrepo-data/master init
171-
172-
mkdir -p mirror/ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot
173-
cd !$
174-
wget https://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.files.installed
175-
parallel -j 16 'wget https://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/{}' ::: $(cut -f 2 human.files.installed | grep fna)
176-
cd -
177-
178-
mkdir -p mirror/ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/cdna
179-
cd !$
180-
wget https://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
181-
cd -
182-
mkdir -p mirror/ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/ncrna
183-
cd !$
184-
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
185-
cd -
186-
mkdir -p mirror/ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/cdna/
187-
cd !$
188-
wget https://ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.cdna.all.fa.gz
189-
cd -
190-
mkdir -p mirror/ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/ncrna/
191-
cd !$
192-
wget https://ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh37.ncrna.fa.gz
193-
cd -
194-
195-
seqrepo --root-directory seqrepo-data/master load -n NCBI $(find mirror/ftp.ncbi.nih.gov -name '*.fna.gz' | sort)
196-
seqrepo --root-directory seqrepo-data/master load -n ENSEMBL $(find mirror/ftp.ensembl.org -name '*.fa.gz' | sort)
197-
198-
cd ../mehari
199-
200-
cargo run --release -- \
201-
-v \
202-
db create txs \
203-
--path-out /tmp/txs-out.bin.zst \
204-
--path-lable-tsv PATH_TO_MANE_LABEL.tsv \
205-
--path-cdot-json ../cdot-0.2.21.ensembl.grch37_grch38.json.gz \
206-
--path-cdot-json ../cdot-0.2.21.refseq.grch37_grch38.json.gz \
207-
--path-seqrepo-instance ../hgvs-rs-data/seqrepo-data/master/master
208-
```
209-
210-
## Development Setup
211-
212-
You will need a recent version of protoc, e.g.:
213-
214-
```
215-
# bash utils/install-protoc.sh
216-
# export PATH=$PATH:$HOME/.local/share/protoc/bin
217-
```
23+
## Usage
24+
To annotate variant consequences, gnomAD frequencies and clinVar information for sequence variants:
25+
```sh
26+
mehari annotate seqvars \
27+
--transcripts resources/transcript_db \
28+
--frequencies resources/gnomad_db \
29+
--clinvar resources/clinvar_db \
30+
--path-input-vcf input.vcf \
31+
--path-output-vcf output.vcf
32+
```
33+
The corresponding database builds can be obtained from:
34+
- transcripts: [github.com/varfish-org/mehari-data-tx/releases](https://github.com/varfish-org/mehari-data-tx/releases)
35+
- gnomAD frequencies: TODO
36+
- clinVar: [github.com/varfish-org/annonars-data-clinvar/releases](https://github.com/varfish-org/annonars-data-clinvar/releases)
37+
38+
See [Getting Started](docs/getting_started.md) for more information on usage, and [Development Setup](docs/development.md) for more information on how to build mehari and its databases from scratch.

docs/anno_seqvars.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,16 +22,17 @@ Currently, Mehari will annotate variants using:
2222

2323
- The predicted impact on gene transcripts and the corresponding protein sequence (in the case of coding genes).
2424
- Their frequency in the gnomAD exomes and genomes databases as well as the HelixMtDb database in the case of mitochondrial databases.
25+
- Variant information from ClinVar, if any
2526

2627
## Command Line Invocation
2728

28-
You can invoke Mehari like this to annotate a VCF file `IN.vcf` to an output file `OUT.vcf` using the built (or downloaded) database as `path/to/db`.
29+
You can invoke Mehari to annotate a VCF file `IN.vcf` creating an output file `OUT.vcf` using the built (or downloaded) databases – for example the transcript database as follows:
2930

3031
```text
3132
$ mehari annotate seqvars \
32-
--path-db path/to/db \
33-
--input-vcf IN.vcf \
34-
--output-vcf OUT.vcf
33+
--transcripts path/to/transcripts-db \
34+
--path-input-vcf IN.vcf \
35+
--path-output-vcf OUT.vcf
3536
```
3637

3738
Note that the input and output files can optionally be gzip/bgzip compressed VCF files with suffixes (`.gz` or `.bgz`) or BCF files with suffix `.bcf`.

docs/development.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
## Building from scratch
2+
To reduce compile times, we recommend using a pre-built version of `rocksdb`, either from the system package manager or e.g. via `conda`:
3+
4+
```bash
5+
# Ubuntu
6+
sudo apt-get install librocksdb-dev
7+
8+
# Conda
9+
conda install -c conda-forge rocksdb
10+
```
11+
12+
In either case, either add
13+
```toml
14+
[env]
15+
ROCKSDB_LIB_DIR = "/usr/lib/" # in case of the system package manager, adjust the path accordingly for conda
16+
SNAPPY_LIB_DIR = "/usr/lib/" # same as above
17+
```
18+
to `.cargo/config.toml` or set the environment variables `ROCKSDB_LIB_DIR` and `SNAPPY_LIB_DIR` to the appropriate paths:
19+
20+
```bash
21+
export ROCKSDB_LIB_DIR=/usr/lib/
22+
export SNAPPY_LIB_DIR=/usr/lib/
23+
```
24+
25+
By default, the environment variables are defined in the `.cargo/config.toml` as described above, i.e. may need adjustments if not using the system package manager.
26+
27+
You will need a recent version of protoc, e.g.:
28+
29+
```bash
30+
bash utils/install-protoc.sh
31+
export PATH=$PATH:$HOME/.local/share/protoc/bin
32+
```
33+
34+
To build the project, run:
35+
```bash
36+
cargo build --release
37+
```
38+
39+
To install the project locally, run:
40+
```bash
41+
cargo install --path .
42+
```

docs/getting_started.md

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,42 +2,47 @@ Getting Started.
22

33
# Installation
44

5-
You most likely want to install via bioconda.
5+
## via bioconda
66
As a prerequisite, [follow the bioconda getting started guide](http://bioconda.github.io/#usage).
77

8-
Then, create a new environment (use the `mamba` if you are as impatient as us).
8+
Then, create a new environment;
99

10-
```text
11-
$ mamba create -y mehari mehari
12-
$ conda activate mehari
10+
```sh
11+
conda create -n mehari -y mehari
12+
conda activate mehari
1313
```
1414

15-
The `mehari` executable is now available:
15+
The `mehari` executable is now available from within the activated `mehari` conda environment:
1616

17+
```sh
18+
mehari --help
1719
```
18-
$ mehari --help
19-
```
20+
21+
## via docker
22+
Docker images of mehari are available from ghcr.io, see [ghcr.io/varfish-org/mehari](https://github.com/varfish-org/mehari/pkgs/container/mehari).
23+
2024

2125
# Downloading Prebuilt Databases
2226

23-
TODO: not yet available
27+
- transcript database releases: https://github.com/varfish-org/mehari-data-tx/releases
28+
- gnomAD frequency database releases: TODO
29+
- clinVar database releases: https://github.com/varfish-org/annonars-data-clinvar/releases
2430

2531
# Annotating Example VCF Files
2632

2733
You can obtain an example file like this:
2834

29-
```text
30-
$ wget https://raw.githubusercontent.com/varfish-org/mehari/main/tests/data/db/create/seqvar_freqs/db-rs1263393206/input.vcf \
31-
-O example.vcf
35+
```sh
36+
wget https://raw.githubusercontent.com/varfish-org/mehari/main/tests/data/db/create/seqvar_freqs/db-rs1263393206/input.vcf -O example.vcf
3237
```
3338

3439
Now, annotate it using Mehari:
3540

36-
```text
37-
$ mehari annotate seqvars \
38-
--path-db path/to/mehari-db/b37 \
41+
```sh
42+
mehari annotate seqvars \
43+
--transcripts path/to/mehari-transcript-db \
44+
--frequencies path/to/mehari-frequency-db \
45+
--clinvar path/to/mehari-clinvar-db \
3946
--path-input-vcf example.vcf \
4047
--path-output-vcf example.out.vcf
41-
$ grep -v ^# example.out.vcf
42-
TODO: output line
4348
```

docs/implementation_notes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Implementation notes.
1+
# Implementation notes
22

33
## Frequency Databases
44

docs/index.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ Why another software package?
1010
library.
1111
The latter serves as the basis for [VariantValidator.org](https://variantvalidator.org/) which is the gold standard for HGVS variant description generation and validation.
1212
- Mehari is written in the Rust programming language which allows it to work fast, with low memory consumption (as a C++ program would) and being memory safe at the same time (as a Java/Python/Perl program would).
13-
- As a Rust program, it can be embedded into the backend of the [VarFish](https://github.com/varfish-org/varfish-server) variant analysis platform.
13+
- It can be used as a rust library, as is the case for e.g. the backend of the [VarFish](https://github.com/varfish-org/varfish-server) variant analysis platform
14+
- Provides a REST API for sequence variant annotation (see `mehari server run --help`)
1415

1516
## What's Next?
1617

@@ -27,4 +28,4 @@ We recommend to read the Mehari end-user documentation in the following order:
2728
Since Mehari is written in the Rust programming language, we host the documentation on `docs.rs` written as Rust online documentation.
2829
This has the advantage that the documentation is bundle with the program source code (and thus always up to date) and the latest documentation is always available at <https://docs.rs/mehari>.
2930

30-
The drawback is that the formatting of this may not be as end-user friendly as it could be but you will manage.
31+
The drawback is that the formatting of this may not be as end-user friendly as it could be, but you will manage.

0 commit comments

Comments
 (0)