Skip to content

Commit ed4f81e

Browse files
author
Szymon Szyszkowski
committed
chore: updated docs about the harmonisation
1 parent 528ff2a commit ed4f81e

File tree

2 files changed

+38
-9
lines changed

2 files changed

+38
-9
lines changed

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This catalog describes how the orchestration works in the current state
44

5-
### How to generate dag svg files
5+
## How to generate dag svg files
66

77
1. Locate your global `airflow.cfg` file and update the [core] dag_folder in `airflow.cfg` to point to the `src` directory of the orchestration repository or set the `AIRFLOW__CORE__DAGS_FOLDER` environment variable.
88

docs/datasources/gwas_catalog_data/README.md

Lines changed: 37 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Data stored under 4 buckets:
1313

1414
Bucket `gs://gwas_catalog_inputs` contains:
1515

16-
```
16+
```bash
1717
gs://gwas_catalog_inputs/gwas_catalog_associations_ontology_annotated.tsv
1818
gs://gwas_catalog_inputs/gwas_catalog_download_ancestries.tsv
1919
gs://gwas_catalog_inputs/gwas_catalog_download_studies.tsv
@@ -61,7 +61,7 @@ as failing.
6161
<details>
6262
<summary>Expand to see the example of manifest file</summary>
6363

64-
```
64+
```bash
6565
rawSumstatPath,study,harmonisedSumstatPath,isHarmonised,qcPath,qcPerformed
6666
gs://gwas_catalog_inputs/raw_summary_statistics/GCST000001-GCST001000/GCST000028/harmonised/17463246-GCST000028-EFO_0001360.h.tsv.gz,GCST000028,gs://gwas_catalog_inputs/harmonised_summary_statistics/GCST000028/,True,gs://gwas_catalog_inputs/summary_statistics_qc/GCST000028/,True
6767
```
@@ -77,7 +77,7 @@ This is the dataset containing meta information about the status of finemapping.
7777

7878
The files are stored under the per study directory in the form like below:
7979

80-
```
80+
```bash
8181
gs://gwas_catalog_inputs/harmonisation_summary/GCST90077749/202410141529/harmonisation.csv
8282
gs://gwas_catalog_inputs/harmonisation_summary/GCST90077749/202410141529/harmonisation.log
8383
gs://gwas_catalog_inputs/harmonisation_summary/GCST90077749/latest/harmonisation.csv
@@ -102,7 +102,7 @@ The file reports following metrics:
102102
<details>
103103
<summary>Expand to see the example</summary>
104104

105-
```
105+
```bash
106106
study,harmonisationExitCode,qcExitCode,rawSumstatFile,rawSumstatFileSize,rawUnzippedSumstatFileSize
107107
GCST90077749,0,1,gs://gwas_catalog_inputs/raw_summary_statistics/GCST90077001-GCST90078000/GCST90077749/harmonised/34662886-GCST90077749-EFO_1001919.h.tsv.gz,18M,62M
108108
```
@@ -116,7 +116,7 @@ This file contains logs from the harmonisation script collected during it's exec
116116
<details>
117117
<summary>Expand to see the example</summary>
118118

119-
```
119+
```bash
120120
[2024.10.14 15:33] Copying raw summary statistics from gs://gwas_catalog_inputs/raw_summary_statistics/GCST90078001-GCST90079000/GCST90079000/harmonised/GCST90079000.h.tsv.gz to GCST90079000.h.tsv.gz
121121
[2024.10.14 15:34] Raw file size 17M
122122
[2024.10.14 15:34] Unzipping GCST90079000.h.tsv.gz to GCST90079000.h.tsv
@@ -179,11 +179,40 @@ datasets: {}
179179

180180
This directory contains various analysis performed on harmonisation results.
181181

182+
## Gwas catalog harmonisation & qc dag
183+
184+
The `gwas_catalog_harmonisation` dag is used to perform the harmonisation and quality checks on the raw summary statistics. The dag configuration and topology can be found in `gwas_catalog_harmonisation.yaml` file under the config directory. Since this task is computationally expensive, it is run in parallel by the google batch operators. The dag contains 2 steps:
185+
186+
1. Harmonisation done by [gwas_catalog_sumstat_preprocess](https://opentargets.github.io/gentropy/python_api/steps/gwas_catalog_sumstat_preprocess/)
187+
2. Quality Control of the harmonised summary statistics done by [sumstat_qc_step](https://opentargets.github.io/gentropy/python_api/steps/summary_statistics_qc/)
188+
189+
To run the dag, one need to prepare the input files and gentropy overwritten docker image.
190+
191+
### Gentropy overwritten docker image
192+
193+
The image in the `/images/gentropy/Dockerfile` is based on the [gentropy image](https://github.com/opentargets/gentropy/blob/dev/Dockerfile). The additional packages are added to the image to make it compatible with Open Targets infrastructure in google cloud, that include:
194+
195+
- google cloud sdk (with gsutil)
196+
- bash script to run the gentropy harmonisation pipeline
197+
198+
> [!WARNING]
199+
> Before running the harmonisation pipeline (`gwas_catalog_harmonisation` dag) it is necessary to update the base docker container to reflect the changes in the `gentropy` image. This is done by running the `make build-gentropy-gcs-image` command run in the root of the repository.
200+
201+
## Gentropy image
202+
203+
The image in this directory is based on the [gentropy image](https://github.com/opentargets/gentropy/blob/dev/Dockerfile). The additional packages are added to the image to make it compatible with the Open Targets Platform, that include:
204+
205+
- google cloud sdk (with gsutil)
206+
- bash script to run the gentropy harmonisation pipeline
207+
208+
> [!WARNING]
209+
> Before running the harmonisation pipeline (`gwas_catalog_harmonisation` dag) it is necessary to update the base docker container to reflect the changes in the `gentropy` image. This is done by running the `make build-gentropy-gcs-image` command run in the root of the repository.
210+
182211
## GWAS Catalog top hits
183212

184213
Bucket `gs://gwas_catalog_top_hits` contains:
185214

186-
```
215+
```bash
187216
gs://gwas_catalog_top_hits/credible_sets/
188217
gs://gwas_catalog_top_hits/study_index/
189218
gs://gwas_catalog_top_hits/study_locus_ld_clumped/
@@ -218,7 +247,7 @@ The step that performs [PICS finemapping](https://opentargets.github.io/gentropy
218247

219248
Bucket `gs://gwas_catalog_sumstats_pics` contains:
220249

221-
```
250+
```bash
222251
gs://gwas_catalog_sumstats_pics/credible_sets/
223252
gs://gwas_catalog_sumstats_pics/study_index/
224253
gs://gwas_catalog_sumstats_pics/study_locus_ld_clumped/
@@ -258,7 +287,7 @@ The step that performs [PICS finemapping](https://opentargets.github.io/gentropy
258287

259288
Bucket `gs://gwas_catalog_sumstats_susie` contains:
260289

261-
```
290+
```bash
262291
gs://gwas_catalog_sumstats_susie/credible_set_datasets/
263292
gs://gwas_catalog_sumstats_susie/credible_sets_clean/
264293
gs://gwas_catalog_sumstats_susie/finemapping_logs/

0 commit comments

Comments
 (0)