You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(gwas_catalog): sumstat harmonisation dag (#55)
* feat(gentropy_image): gentropy image overloaded by orchestration config
* feat(gwas_catalog): harmonisation dag
* chore: cleanup
* feat: add pandas as dependency
* chore: format yaml files
* chore: update dag name
* chore: updated docs about the harmonisation
* chore: typos
* build: run image build on changed file only
* chore: bump gentropy base image
* test: command to prepare the test environment for harmonisation test
* feat: batch operator update
* feat: integration test approach
* feat: test and prod envs
* chore: drop test cleanup
* fix: ensure all variables are not unbound
* chore: format
* fix: fix path to the full artifact manifest
* chore: typos
---------
Co-authored-by: Szymon Szyszkowski <ss60@mib117351s.internal.sanger.ac.uk>
Co-authored-by: project-defiant <szymonszyszkowski@gmail.com>
Copy file name to clipboardExpand all lines: docs/README.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
This catalog describes how the orchestration works in the current state
4
4
5
-
###How to generate dag svg files
5
+
## How to generate dag svg files
6
6
7
7
1. Locate your global `airflow.cfg` file and update the [core] dag_folder in `airflow.cfg` to point to the `src` directory of the orchestration repository or set the `AIRFLOW__CORE__DAGS_FOLDER` environment variable.
@@ -116,7 +116,7 @@ This file contains logs from the harmonisation script collected during it's exec
116
116
<details>
117
117
<summary>Expand to see the example</summary>
118
118
119
-
```
119
+
```bash
120
120
[2024.10.14 15:33] Copying raw summary statistics from gs://gwas_catalog_inputs/raw_summary_statistics/GCST90078001-GCST90079000/GCST90079000/harmonised/GCST90079000.h.tsv.gz to GCST90079000.h.tsv.gz
121
121
[2024.10.14 15:34] Raw file size 17M
122
122
[2024.10.14 15:34] Unzipping GCST90079000.h.tsv.gz to GCST90079000.h.tsv
@@ -179,11 +179,40 @@ datasets: {}
179
179
180
180
This directory contains various analysis performed on harmonisation results.
181
181
182
+
## Gwas catalog harmonisation & qc dag
183
+
184
+
The `gwas_catalog_harmonisation` dag is used to perform the harmonisation and quality checks on the raw summary statistics. The dag configuration and topology can be found in `gwas_catalog_harmonisation.yaml` file under the config directory. Since this task is computationally expensive, it is run in parallel by the google batch operators. The dag contains 2 steps:
185
+
186
+
1. Harmonisation done by [gwas_catalog_sumstat_preprocess](https://opentargets.github.io/gentropy/python_api/steps/gwas_catalog_sumstat_preprocess/)
187
+
2. Quality Control of the harmonised summary statistics done by [sumstat_qc_step](https://opentargets.github.io/gentropy/python_api/steps/summary_statistics_qc/)
188
+
189
+
To run the dag, one need to prepare the input files and gentropy overwritten docker image.
190
+
191
+
### Gentropy overwritten docker image
192
+
193
+
The image in the `/images/gentropy/Dockerfile` is based on the [gentropy image](https://github.com/opentargets/gentropy/blob/dev/Dockerfile). The additional packages are added to the image to make it compatible with Open Targets infrastructure in google cloud, that include:
194
+
195
+
- google cloud sdk (with gsutil)
196
+
- bash script to run the gentropy harmonisation pipeline
197
+
198
+
> [!WARNING]
199
+
> Before running the harmonisation pipeline (`gwas_catalog_harmonisation` dag) it is necessary to update the base docker container to reflect the changes in the `gentropy` image. This is done by running the `make build-gentropy-gcs-image` command run in the root of the repository.
200
+
201
+
## Gentropy image
202
+
203
+
The image in this directory is based on the [gentropy image](https://github.com/opentargets/gentropy/blob/dev/Dockerfile). The additional packages are added to the image to make it compatible with the Open Targets Platform, that include:
204
+
205
+
- google cloud sdk (with gsutil)
206
+
- bash script to run the gentropy harmonisation pipeline
207
+
208
+
> [!WARNING]
209
+
> Before running the harmonisation pipeline (`gwas_catalog_harmonisation` dag) it is necessary to update the base docker container to reflect the changes in the `gentropy` image. This is done by running the `make build-gentropy-gcs-image` command run in the root of the repository.
0 commit comments