|
1 | 1 | # MIRACUM-Pipe-docker
|
2 |
| -application using the dockerized version of MIRACUM-Pipe |
3 | 2 |
|
4 |
| -Currently under heavy development |
| 3 | +This repo offers a framework to easily work with the dockerized version of [MIRACUM-Pipe](https://github.com/AG-Boerries/MIRACUM-Pipe) |
| 4 | + |
| 5 | +## Setup and installation |
| 6 | + |
| 7 | +In order to run the miracum pipeline, one needs to setup tools and databases which we are not allowed to ship due to license issues. |
| 8 | +We prepared this project in a way which allows you to easily add the specific components into the pipeline. |
| 9 | +Prior running the setup script, some components need to be installed manually interaction: |
| 10 | + |
| 11 | +- tools |
| 12 | + - [annovar](http://download.openbioinformatics.org/annovar_download_form.php) |
| 13 | + - required additional database for annovar |
| 14 | + - create a database for the latest COSMIC release (according to the [annovar manual](http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#cosmic-annotations)) |
| 15 | + - Download [prepare_annovar_user.pl](http://www.openbioinformatics.org/annovar/download/prepare_annovar_user.pl) and add to annovar folder |
| 16 | + - register at [COSMIC](https://cancer.sanger.ac.uk/cosmic); |
| 17 | + - Download the latest release for GRCh37 (as of October 2019 the latest release is v90): |
| 18 | + - VCF/CosmicCodingMuts.vcf.gz |
| 19 | + - VCF/CosmicNonCodingVariants.vcf.gz |
| 20 | + - CosmicMutantExport.tsv.gz |
| 21 | + - CosmicNCV.tsv.gz |
| 22 | + - unzip all archives |
| 23 | + - commands to build the annovar database |
| 24 | + |
| 25 | + ```bash |
| 26 | + prepare_annovar_user.pl -dbtype cosmic CosmicMutantExport.tsv -vcf CosmicCodingMuts.vcf > hg19_cosmic_coding.txt |
| 27 | + prepare_annovar_user.pl -dbtype cosmic CosmicNCV.tsv -vcf CosmicNonCodingVariants.vcf > hg19_cosmic_noncoding.txt |
| 28 | + ``` |
| 29 | + |
| 30 | + - Move both created files to the annovar/humandb folder. |
| 31 | + |
| 32 | +- databases |
| 33 | + - [hallmarks of cancer](http://bbglab.irbbarcelona.org) |
| 34 | + - h.all.v7.0.entrez.gmt |
| 35 | + - [condel score](http://software.broadinstitute.org/gsea/msigdb/) |
| 36 | + - fannsdb.tsv.gz |
| 37 | + - fannsdb.tsv.gz.tbi |
| 38 | + |
| 39 | +For the tool annovar you need the download link. Follow the url above and request the link by filling out the form. They will send you an email. |
| 40 | +While `setup.sh` is running you'll be asked to enter this download link. Alternatively you could also install annovar by manually extracting it into the folder `tools`. |
| 41 | +To install the databases install follow the link, register and download the listed files. Just place them into the folder `databaeses` of your cloned project. |
| 42 | +
|
| 43 | +Next, run the setup script. We recommend to install everything, which dows **not** include the example and reference data. There are also options to install and setup parts: |
| 44 | +
|
| 45 | +```bash |
| 46 | +./setup.sh -t all |
| 47 | +``` |
| 48 | +
|
| 49 | +See `setup.sh -h` to list the available options. By default, we do not install the reference gene as well as our example. If you want to install it run |
| 50 | +
|
| 51 | +```bash |
| 52 | +# download and setup reference gene |
| 53 | +./setup.sh -t ref |
| 54 | +
|
| 55 | +# download and setup example data |
| 56 | +./setup.sh -t example |
| 57 | +``` |
| 58 | +
|
| 59 | +## How to configure and run it |
| 60 | +
|
| 61 | +The project structure is as follows: |
| 62 | +
|
| 63 | +```shell |
| 64 | +. |
| 65 | +├── conf |
| 66 | +│ └── custom.yaml |
| 67 | +├── databases |
| 68 | +├── input |
| 69 | +├── output |
| 70 | +├── references |
| 71 | +├── tools |
| 72 | +├── LICENSE |
| 73 | +├── miracum_pipe.sh |
| 74 | +├── README.md |
| 75 | +└── setup.sh |
| 76 | +``` |
| 77 | +
|
| 78 | +There are three levels of configuration: |
| 79 | +
|
| 80 | +- the docker file ships with [default.yaml](https://github.com/AG-Boerries/MIRACUM-Pipe/blob/master/conf/default.yaml) which is setup with default config parameters |
| 81 | +- `conf/custom.yaml` contains settings for the entire runtime environment and overwrites `default.yaml`'s values |
| 82 | +- In each patient directory one a `patient.yaml` can be created in which every setting of the other two configs can be overwritten. |
| 83 | + |
| 84 | +### Setting up a patient |
| 85 | + |
| 86 | +It is intended to create a patient folder in `input` for each patient containing `patient.yaml`. Further, we recommend to define in it at least the following parameters: |
| 87 | + |
| 88 | +```yaml |
| 89 | +sex: XX # or XY |
| 90 | +annotation: |
| 91 | + germline: yes # default is no |
| 92 | +``` |
| 93 | + |
| 94 | +Place the germline R1 and R2 files as well as the tumor files (R1 and R2) into the folder. Either name them `germline_R{1/2}.fastqz.gz` and `tumor_R{1/2}.fastq.gz` or adjust your `patient.yaml` accordingly: |
| 95 | + |
| 96 | +```yaml |
| 97 | +[..] |
| 98 | +common: |
| 99 | + files: |
| 100 | + tumor: tumor_R |
| 101 | + germline: germline_R |
| 102 | +``` |
| 103 | + |
| 104 | +### Run the pipeline |
| 105 | + |
| 106 | +There are multiple possibilities to run the pipeline: |
| 107 | + |
| 108 | +- run complete pipeline on one patient |
| 109 | + |
| 110 | + ```bash |
| 111 | + ./run-pipeline -d rel_patient_folder |
| 112 | + ``` |
| 113 | + |
| 114 | +- run a specific task on a given patient |
| 115 | + |
| 116 | + ```bash |
| 117 | + ./run-pipeline -d rel_patient_folder -t task |
| 118 | + ``` |
| 119 | + |
| 120 | +- run all unprocessed (no .processed file in the dir) patients |
| 121 | + |
| 122 | + ```bash |
| 123 | + ./run-pipeline |
| 124 | + ``` |
| 125 | + |
| 126 | +For more information see at the help of the command by running: |
| 127 | + |
| 128 | +```bash |
| 129 | +./run-pipeline -h |
| 130 | +``` |
| 131 | + |
| 132 | +### Parallel computation |
| 133 | + |
| 134 | +The MIRACUM-Pipe consits of five major steps (tasks) of which several can be computed in parallel: |
| 135 | + |
| 136 | +- `td` and `gd` |
| 137 | +- `vc` and `cnv` |
| 138 | +- `report` which is the last task and bases onto the results of the 4 prior tasks |
| 139 | + |
| 140 | +After the pipeline finishes successfully, it creates the file `.processed` into the patient's direcotry. Per default processed patients are skipped. |
| 141 | +The flag `-f` forces a recomputation and neglects that file. Furhtermore, sometimes it is required to rerun a single task. Therefore, use the flag `-t`. |
| 142 | +
|
| 143 | +## Logging |
| 144 | +
|
| 145 | +MIRACUM-pipe writes its logfiles into `output/<patient_name>/log`. For each task in the pipeline an own logfile is created. With the help of these logfiles one can monitor the current status of the pipeline process. |
| 146 | +
|
| 147 | +## Parallell & sequential computing |
| 148 | +
|
| 149 | +In `conf/custom.yaml` one can setup ressource parameters as cpucores and memory. If not intentionally called the pipeline on as single thread (sequentially), several tasks compute in parallel. The ressources are divided, thus you can enter the real 100% ressource you want to offer the entire pipline processes. Single threaded is intended to be used in case of limited hardware ressources or very large input files. |
| 150 | +
|
| 151 | +**BEWARE**: if you set tmp to be a tempfs (into ram), please consider this, while deciding the process ressources. |
| 152 | +
|
| 153 | +## License |
| 154 | +
|
| 155 | +This work is licensed under [GNU Affero General Public License version 3](https://opensource.org/licenses/AGPL-3.0). |
0 commit comments