diff --git a/README.md b/README.md index 8107ca7..68b5388 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ will generate 9 databases (and indexes) of different sizes. The generation proce Generating the bacterial databases is particularly time consuming. Your machine should have: -* 650 GB of free disk space - mainly for downloading genomes from [NCBI](https://www.ncbi.nlm.nih.gov/), +* 1 TB of free disk space - mainly for downloading genomes from [NCBI](https://www.ncbi.nlm.nih.gov/), * at least 8 cores - the more the better (some phases of the database generation keep 32 cores 100% busy), * 48 GB of main memory, * a high bandwidth Internet connection. @@ -27,23 +27,25 @@ The databases are based on and compatible with [Genestrip v1.4](https://github.c ## The databases -All databases are purely genomic. +All databases are genomic or based on total RNA. | Name | Category |Description | Database disk size | Sources and references | | ----------- | -----|------ | ----------- | ---------------------- | -| `babesia` | `protozoa` | Babesia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) and [Genbank](https://ftp.ncbi.nlm.nih.gov/genomes/genbank/) which are potentially pathogenic for humans | 936 MB | General knowledge | -| `borrelia` | `bacteria` | Borrelia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 844 MB | General knowledge | -| `borrelia_plasmid` | `plasmid` | Borrelia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 205 MB | General knowledge | -| `chronicb` | `bacteria` | Potentially tick-borne infections which are potentially pathogenic for humans and may lead to chronic diseases | 4.34 GB | Collected from [Armin Labs](https://arminlabs.com/en/services) | +| `babesia` | `protozoa` | Babesia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) and [Genbank](https://ftp.ncbi.nlm.nih.gov/genomes/genbank/) which are potentially pathogenic for humans | 1.1 G | General knowledge | +| `borrelia` | `bacteria` | Borrelia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 850 MB | General knowledge | +| `borrelia_plasmid` | `plasmid` | Borrelia species from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 219 MB | General knowledge | +| `chronicb` | `bacteria` | Potentially tick-borne infections which are potentially pathogenic for humans and may lead to chronic diseases | 2.8 GB | Collected from [Armin Labs](https://arminlabs.com/en/services) | +| `chronicb-rna` | `bacteria` | Same as `chronicb` but based on total RNA. | 1.1 M | | | `human_virus2` | `viral` | Viruses from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) and [Genbank](https://ftp.ncbi.nlm.nih.gov/genomes/genbank/) which are potentially pathogenic for humans |89 MB | Extracted from the [Viral Zone](https://viralzone.expasy.org/678) | -| `parasites` | `invertebrate` | Parasitic invertebrate animals from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 20.26 GB | Collected from the book ["Die Parasiten des Menschen"](https://link.springer.com/book/10.1007/978-3-662-65315-9) by Heinz Mehlhorn | -| `protozoa` | `protozoa` | Protozoan parasites from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 14.46 GB | Collected from the German book ["Die Parasiten des Menschen"](https://link.springer.com/book/10.1007/978-3-662-65315-9) by Heinz Mehlhorn | -| `vineyard` | `fungi` | Fungal infections of grapevine taken from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) | 5.6 GB | Collected from the German book ["Rebschutz"](https://books.google.de/books/about/Rebschutz_Taschenbuch.html?id=ov1JAAAAYAAJ&redir_esc=y) by Walter Hildebrand, Dieter Lorenz and Friedrich Louis | -| `plasmopara` | `plant` | Peronosporales as infections of grapevine taken from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) | 5.7 GB | Collected from the German book ["Rebschutz"](https://books.google.de/books/about/Rebschutz_Taschenbuch.html?id=ov1JAAAAYAAJ&redir_esc=y) by Walter Hildebrand, Dieter Lorenz and Friedrich Louis | +| `parasites` | `invertebrate` | Parasitic invertebrate animals from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 20 GB | Collected from the book ["Die Parasiten des Menschen"](https://link.springer.com/book/10.1007/978-3-662-65315-9) by Heinz Mehlhorn | +| `protozoa` | `protozoa` | Protozoan parasites from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) which are potentially pathogenic for humans | 17 GB | Collected from the German book ["Die Parasiten des Menschen"](https://link.springer.com/book/10.1007/978-3-662-65315-9) by Heinz Mehlhorn | +| `protozoa-rna` | `protozoa` | Same as `protozoa` but based on total RNA | 8.5 GB | | +| `vineyard` | `fungi` | Fungal infections of grapevine taken from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) | 4.7 GB | Collected from the German book ["Rebschutz"](https://books.google.de/books/about/Rebschutz_Taschenbuch.html?id=ov1JAAAAYAAJ&redir_esc=y) by Walter Hildebrand, Dieter Lorenz and Friedrich Louis | +| `plasmopara` | `plant` | Peronosporales as infections of grapevine taken from the [RefSeq](https://ftp.ncbi.nlm.nih.gov/refseq/release/) | 5.6 GB | Collected from the German book ["Rebschutz"](https://books.google.de/books/about/Rebschutz_Taschenbuch.html?id=ov1JAAAAYAAJ&redir_esc=y) by Walter Hildebrand, Dieter Lorenz and Friedrich Louis | Note that Genestrip's [`updateddb`](https://github.com/pfeiferd/genestrip/blob/master/Goals.md)-phase accounts for unspecific *k*-mers and largely avoids false positive counts during `match`es. -To further reduce false positives, all databases (except for `vineyard`) are built such that *k*-mers also occurring in the [human genome](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz) +To further reduce false positives, all databases except for `vineyard`, `chronicb-rna` and `protozoa-rna` are built such that *k*-mers also occurring in the [human genome](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz) are [pushed to the least common ancestor](https://github.com/pfeiferd/genestrip/blob/master/README.md#manually-adding-fasta-files). ## Testing the databases `borrelia`, `borrelia_plasmid` and `chronicb` @@ -56,7 +58,7 @@ As expected, Genestrip finds DNA from borrelia and other tick-borne infections a If you don't want to generate them yourself, the databases and indexes can also be [downloaded from genestrip.org](https://www.genestrip.org/files/data). The [`projects` folder](https://www.genestrip.org/files/data/projects/) corresponds -to the [`projects` folder's](https://github.com/pfeiferd/genestrip-db/tree/master/data/projects) state of this project, after the scripts `bin/makedbs.sh` and `bin/matchticks.sh` have run successfully. +to the [`projects` folder's](https://github.com/pfeiferd/genestrip-db/tree/master/data/projects) state of this project, after the scripts `bin/makedbs.sh` and `bin/matchticks.sh` have run successfully on the [RefSeq Release 226](https://ftp.ncbi.nlm.nih.gov/refseq/release/RELEASE_NUMBER).