Skip to content

Commit

Permalink
Updated docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
pfeiferd committed Sep 23, 2024
1 parent c15afe3 commit b897560
Showing 1 changed file with 14 additions and 12 deletions.
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ will generate 9 databases (and indexes) of different sizes. The generation proce
Generating the bacterial databases is particularly time consuming.

Your machine should have:
* 650 GB of free disk space - mainly for downloading genomes from [NCBI](https://www.ncbi.nlm.nih.gov/),
* 1 TB of free disk space - mainly for downloading genomes from [NCBI](https://www.ncbi.nlm.nih.gov/),
* at least 8 cores - the more the better (some phases of the database generation keep 32 cores 100% busy),
* 48 GB of main memory,
* a high bandwidth Internet connection.
Expand All @@ -27,23 +27,25 @@ The databases are based on and compatible with [Genestrip v1.4](https://github.c

## The databases

All databases are purely genomic.
All databases are genomic or based on total RNA.

| Name | Category |Description | Database disk size | Sources and references |
| ----------- | -----|------ | ----------- | ---------------------- |
| `babesia` | `protozoa` | Babesia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) and [Genbank](https://ftp.ncbi.nlm.nih.gov/genomes/genbank/) which are potentially pathogenic for humans | 936 MB | General knowledge |
| `borrelia` | `bacteria` | Borrelia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 844 MB | General knowledge |
| `borrelia_plasmid` | `plasmid` | Borrelia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 205 MB | General knowledge |
| `chronicb` | `bacteria` | Potentially tick-borne infections which are potentially pathogenic for humans and may lead to chronic diseases | 4.34 GB | Collected from [Armin Labs](https://arminlabs.com/en/services) |
| `babesia` | `protozoa` | Babesia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) and [Genbank](https://ftp.ncbi.nlm.nih.gov/genomes/genbank/) which are potentially pathogenic for humans | 1.1 G | General knowledge |
| `borrelia` | `bacteria` | Borrelia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 850 MB | General knowledge |
| `borrelia_plasmid` | `plasmid` | Borrelia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 219 MB | General knowledge |
| `chronicb` | `bacteria` | Potentially tick-borne infections which are potentially pathogenic for humans and may lead to chronic diseases | 2.8 GB | Collected from [Armin Labs](https://arminlabs.com/en/services) |
| `chronicb-rna` | `bacteria` | Same as `chronicb` but based on total RNA. | 1.1 M | |
| `human_virus2` | `viral` | Viruses from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) and [Genbank](https://ftp.ncbi.nlm.nih.gov/genomes/genbank/) which are potentially pathogenic for humans |89 MB | Extracted from the [Viral Zone](https://viralzone.expasy.org/678) |
| `parasites` | `invertebrate` | Parasitic invertebrate animals from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 20.26 GB | Collected from the book ["Die Parasiten des Menschen"](https://link.springer.com/book/10.1007/978-3-662-65315-9) by Heinz Mehlhorn |
| `protozoa` | `protozoa` | Protozoan parasites from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 14.46 GB | Collected from the German book ["Die Parasiten des Menschen"](https://link.springer.com/book/10.1007/978-3-662-65315-9) by Heinz Mehlhorn |
| `vineyard` | `fungi` | Fungal infections of grapevine taken from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) | 5.6 GB | Collected from the German book ["Rebschutz"](https://books.google.de/books/about/Rebschutz_Taschenbuch.html?id=ov1JAAAAYAAJ&redir_esc=y) by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |
| `plasmopara` | `plant` | Peronosporales as infections of grapevine taken from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) | 5.7 GB | Collected from the German book ["Rebschutz"](https://books.google.de/books/about/Rebschutz_Taschenbuch.html?id=ov1JAAAAYAAJ&redir_esc=y) by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |
| `parasites` | `invertebrate` | Parasitic invertebrate animals from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 20 GB | Collected from the book ["Die Parasiten des Menschen"](https://link.springer.com/book/10.1007/978-3-662-65315-9) by Heinz Mehlhorn |
| `protozoa` | `protozoa` | Protozoan parasites from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 17 GB | Collected from the German book ["Die Parasiten des Menschen"](https://link.springer.com/book/10.1007/978-3-662-65315-9) by Heinz Mehlhorn |
| `protozoa-rna` | `protozoa` | Same as `protozoa` but based on total RNA | 8.5 GB | |
| `vineyard` | `fungi` | Fungal infections of grapevine taken from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) | 4.7 GB | Collected from the German book ["Rebschutz"](https://books.google.de/books/about/Rebschutz_Taschenbuch.html?id=ov1JAAAAYAAJ&redir_esc=y) by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |
| `plasmopara` | `plant` | Peronosporales as infections of grapevine taken from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) | 5.6 GB | Collected from the German book ["Rebschutz"](https://books.google.de/books/about/Rebschutz_Taschenbuch.html?id=ov1JAAAAYAAJ&redir_esc=y) by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |


Note that Genestrip's [`updateddb`](https://github.com/pfeiferd/genestrip/blob/master/Goals.md)-phase accounts for unspecific *k*-mers and largely avoids false positive counts during `match`es.
To further reduce false positives, all databases (except for `vineyard`) are built such that *k*-mers also occurring in the [human genome](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz)
To further reduce false positives, all databases except for `vineyard`, `chronicb-rna` and `protozoa-rna` are built such that *k*-mers also occurring in the [human genome](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz)
are [pushed to the least common ancestor](https://github.com/pfeiferd/genestrip/blob/master/README.md#manually-adding-fasta-files).

## Testing the databases `borrelia`, `borrelia_plasmid` and `chronicb`
Expand All @@ -56,7 +58,7 @@ As expected, Genestrip finds DNA from borrelia and other tick-borne infections a

If you don't want to generate them yourself, the databases and indexes can also be [downloaded from genestrip.org](https://www.genestrip.org/files/data).
The [`projects` folder](https://www.genestrip.org/files/data/projects/) corresponds
to the [`projects` folder's](https://github.com/pfeiferd/genestrip-db/tree/master/data/projects) state of this project, after the scripts `bin/makedbs.sh` and `bin/matchticks.sh` have run successfully.
to the [`projects` folder's](https://github.com/pfeiferd/genestrip-db/tree/master/data/projects) state of this project, after the scripts `bin/makedbs.sh` and `bin/matchticks.sh` have run successfully on the [RefSeq Release 226](https://ftp.ncbi.nlm.nih.gov/refseq/release/RELEASE_NUMBER).



0 comments on commit b897560

Please sign in to comment.