feat: gwas catalog top-hit + study step #808

d0choa · 2024-10-02T14:27:25Z

Step to run the top-hits in isolation of everything else.

The GWASCatalogTopHitIngestionStep included here was run in dataproc in 17 minutes.

Subsequent PRs will handle GWAS Catalog + Summary Statistics + PICS road which might result in removing some steps.

d0choa · 2024-10-02T15:02:46Z

@DSuveges, this creates the top-hits logic. It should be ready. Next Sumstats + PICS (and cleanup)

d0choa · 2024-10-02T16:34:35Z

src/gentropy/datasource/gwas_catalog/study_index.py

@@ -332,7 +330,6 @@ def from_source(
        return (
            cls._parse_study_table(catalog_studies)
            .annotate_ancestries(ancestry_file)
-            .annotate_sumstats_info(sumstats_lut)


This change can be problematic. Not on top-hits, or sumstats processing but for the generation of the curation table. @DSuveges we can discussed but I think @addramir was suggesting to not care too much about curation for now

d0choa · 2024-10-02T18:58:49Z

I'm having second thoughts. Because the 2 DAGs have interdependencies in the business logic, I want to see both working before we review this.

I'm reverting to draft

d0choa · 2024-10-16T19:04:44Z

Example how to run the step:

poetry run gentropy step=gwas_catalog_study_index \
step.catalog_study_files="[ '/Users/ochoa/Datasets/gwas_catalog_download_studies.tsv' ]" \
step.catalog_ancestry_files="[ '/Users/ochoa/Datasets/gwas_catalog_download_ancestries.tsv' ]" \
step.study_index_path=/Users/ochoa/Datasets/study_index_annotated \
step.gwas_catalog_study_curation_file=/Users/ochoa/Datasets/20241004_output_curation.tsv \
step.sumstats_qc_path=/Users/ochoa/Datasets/20241015_GwasCatQCLogs

Copy of the outputs

gs://ot-team/dochoa/study_index_annotated

Breakdown of GWAS Catalog flags

+----------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|studyType |analysisFlags          |qualityControls                                                                                                                                                                                                                                    |count|
+----------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., Harmonized summary statistics are not available or empty.]                                                                                                                              |54168|
|gwas      |[]                     |[]                                                                                                                                                                                                                                                 |16842|
|pQTL      |[]                     |[]                                                                                                                                                                                                                                                 |16533|
|gwas      |[Metabolite]           |[]                                                                                                                                                                                                                                                 |12064|
|gwas      |[ExWAS]                |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |3695 |
|gwas      |[Metabolite]           |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |2849 |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets.]                                                                                                                                                                                         |2145 |
|Microbiome|[]                     |[]                                                                                                                                                                                                                                                 |1128 |
|gwas      |[Metabolite]           |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |1033 |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range.]                                                                                                                                                                                   |793  |
|gwas      |[]                     |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |604  |
|Microbiome|[]                     |[The mean beta QC check value is not within the expected range.]                                                                                                                                                                                   |414  |
|gwas      |[ExWAS]                |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |294  |
|gwas      |[]                     |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |267  |
|gwas      |[]                     |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |255  |
|gwas      |[Multivariate analysis]|[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |248  |
|pQTL      |[]                     |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |200  |
|Microbiome|[]                     |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |138  |
|gwas      |[GxE]                  |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |73   |
|gwas      |[Case-case study]      |[]                                                                                                                                                                                                                                                 |68   |
|gwas      |[GxE]                  |[]                                                                                                                                                                                                                                                 |51   |
|gwas      |[Non-additive model]   |[]                                                                                                                                                                                                                                                 |45   |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The number of SNPs in the study is below the expected threshold.]                                                                                                                       |41   |
|gwas      |[GxG]                  |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |29   |
|gwas      |[]                     |[The GC lambda value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                                                                          |27   |
|gwas      |[Non-additive model]   |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                 |25   |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The PZ QC check values are not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                            |21   |
|gwas      |[GxG]                  |[]                                                                                                                                                                                                                                                 |21   |
|Microbiome|[]                     |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |20   |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range.]                                                                                                                        |16   |
|gwas      |[GxE]                  |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |16   |
|gwas      |[Multivariate analysis]|[]                                                                                                                                                                                                                                                 |15   |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The GC lambda value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                          |11   |
|gwas      |[Non-additive model]   |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |11   |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                 |10   |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The PZ QC check values are not within the expected range.]                                                                                                                              |9    |
|gwas      |[]                     |[The PZ QC check values are not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                                                                      |9    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]       |7    |
|gwas      |[]                     |[The GC lambda value is not within the expected range.]                                                                                                                                                                                            |7    |
|gwas      |[Case-case study]      |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |7    |
|gwas      |[Non-additive model]   |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range.]                                                                                                                        |6    |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                                                                 |6    |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The GC lambda value is not within the expected range.]                                                                                                                            |5    |
|pQTL      |[]                     |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |3    |
|gwas      |[Metabolite]           |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |3    |
|Microbiome|[]                     |[The PZ QC check values are not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                                                                      |3    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The GC lambda value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                |2    |
|gwas      |[ExWAS]                |[]                                                                                                                                                                                                                                                 |2    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range.]                                                              |2    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range.]                                                                                                                         |2    |
|gwas      |[Case-case study]      |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |2    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                       |2    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                       |1    |
|gwas      |[]                     |[The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                                                                                 |1    |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                      |1    |
|gwas      |[Non-additive model]   |[The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                                                                                 |1    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range., The GC lambda value is not within the expected range., The number of SNPs in the study is below the expected threshold.]|1    |
+----------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+

project-defiant

Business logic looks good. There is minor bug might break the step when reading the curation file. (Suspect that this is copilot issue).

The logic of this step (and others) raises a single concern. The VariantIndex is calculated at the end of the genetics_etl. Yet it is required for this step (also for some other ingestions like ukb_ppp).

src/gentropy/datasource/gwas_catalog/associations.py

project-defiant · 2024-10-17T13:23:20Z

src/gentropy/gwas_catalog_study_curation.py

+                gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_csv(
+                    session, gwas_catalog_study_curation_file
+                )
+            elif gwas_catalog_study_curation_file.startswith("http"):


Swap the order, the http can end with .csv as well. Then it will not work.

project-defiant · 2024-10-17T13:26:43Z

src/gentropy/gwas_catalog_study_curation.py

-        )
+
+        if gwas_catalog_study_curation_file:
+            if gwas_catalog_study_curation_file.endswith(".csv"):


Should it not be a .tsv as in the GWASCatalogStudyIndexGenerationStep

project-defiant · 2024-10-17T15:00:39Z

src/gentropy/gwas_catalog_study_index.py

+        if gwas_catalog_study_curation_file:
+            if gwas_catalog_study_curation_file.endswith(
+                ".tsv"
+            ) | gwas_catalog_study_curation_file.endswith(".tsv"):
+                gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_csv(
+                    session, gwas_catalog_study_curation_file
+                )
+            elif gwas_catalog_study_curation_file.startswith("http"):
+                gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_url(
+                    session, gwas_catalog_study_curation_file
+                )


I am not sure what the code was meant to do here, although I suspect that it could have inferr if the file is http based or not, then based on that if the file is a csv or tsv ??

project-defiant · 2024-10-17T15:01:22Z

src/gentropy/gwas_catalog_study_index.py

+
+        # Annotate with sumstats QC if provided:
+        if sumstats_qc_path:
+            schema = StructType(


New Dataset ?

project-defiant · 2024-10-17T15:02:41Z

src/gentropy/gwas_catalog_top_hits.py

-
-    !!! note This step currently only processes the GWAS Catalog curated list of top hits.
-    """
+class GWASCatalogTopHitIngestionStep:


NIT, not consistent, everywhere we say these are top hits

project-defiant · 2024-10-17T15:04:43Z

src/gentropy/gwas_catalog_top_hits.py

        """
        # Extract
-        gnomad_variants = VariantIndex.from_parquet(session, gnomad_variant_path)
+        gnomad_variants = VariantIndex.from_parquet(session, variant_annotation_path)


In this particullar case. Which VariantIndex dataset should be used to generate the top hits correctly?

In theory, it should be the gnomAD variant annotation dataset in the schema of a variantIndex. We haven't run it for a while so I wonder if everything works fine there after the ETL VariantIndex work. This DAG should have no dependency on the ETL VariantIndex

addramir · 2024-10-17T22:33:34Z

Added changes in ancestry mapping in gwas_population_2_LD_panel_map.json since we are going to remake the study index. These changes add CSA (central south asians) from UKBB to the list of ancestries.

d0choa · 2024-10-18T08:50:05Z

src/gentropy/assets/data/gwas_population_2_LD_panel_map.json

@addramir are you sure this won't break PICS if LDIndex does not contain csa?
Can you please double-check?

What will happen with PICS if the major ancestry code form the study (e.g. csa) is not in LD index for PICS @DSuveges @vivienho ?

@addramir In that case R will be 0 and the variant should be dropped before PICS.

In this case we need to revert the change with changing mapping. I will add a quick temporary fix in susie but in the future we have to change that.

@addramir @d0choa can I run the test from the branch to clump the sumstats for susie already?

DSuveges

I'm approving the PR assuming there might be some surprises along the way when it comes to curation.

d0choa added 6 commits October 1, 2024 14:40

fix: wrong step parameter

a479920

fix: persist va_subset

3950602

fix: remove broadcasts

9512771

feat: new gwas_catalog_top_hits step

0a2f309

docs: new step added to documentation

eab6bc9

fix: incorrect target

b09f6cd

d0choa requested a review from DSuveges October 2, 2024 14:27

github-actions bot added documentation Improvements or additions to documentation size-M Step Feature Datasource labels Oct 2, 2024

d0choa marked this pull request as draft October 2, 2024 14:29

d0choa and others added 3 commits October 2, 2024 15:36

fix: failing tests

9308c4d

fix: extra argument

0de744e

Merge branch 'dev' into dsdo_top_hits_step

8988954

d0choa marked this pull request as ready for review October 2, 2024 14:57

d0choa mentioned this pull request Oct 2, 2024

feat: GWAS catalog top-hit DAG opentargets/orchestration#33

Merged

d0choa commented Oct 2, 2024

View reviewed changes

fix: select does not require hasSumstats anymore

8c9c891

d0choa marked this pull request as draft October 2, 2024 18:59

feat: study inclusion step repurposed into study index step

8663f1d

github-actions bot added size-L and removed size-M labels Oct 3, 2024

d0choa added 3 commits October 3, 2024 16:59

docs: fix path for documentation

fa062b2

feat: remove GWASCatalogIngestionStep as it will no longer be necessary

5023e3a

fix: gwas catalog study curation step

040ae97

d0choa changed the title ~~feat: gwas catalog top-hit step~~ feat: gwas catalog top-hit + study step Oct 3, 2024

DSuveges and others added 9 commits October 8, 2024 16:57

Merge branch 'dev' into dsdo_top_hits_step

aad3799

Merge branch 'dev' into dsdo_top_hits_step

7954957

Merge branch 'dev' into dsdo_top_hits_step

ccefcc9

fix: drop duplicate rows after ingesting associations

056376d

Merge branch 'dev' into dsdo_top_hits_step

014a663

Merge branch 'dev' into dsdo_top_hits_step

e105519

fix: fix in study index ingestion

afdc442

fix: v1

0f8911d

feat: working study index with sumstats qc and curation

97900f3

github-actions bot added the Dataset label Oct 16, 2024

d0choa and others added 4 commits October 16, 2024 20:06

test: deprecate obsoleted testt

5fda7ec

test: remove colon causing tests to fail

8ba7f7d

test: curation quality controls no longer

50a0724

Merge branch 'dev' into dsdo_top_hits_step

c7f4bad

d0choa requested review from project-defiant and removed request for DSuveges October 17, 2024 10:50

d0choa marked this pull request as ready for review October 17, 2024 10:50

project-defiant reviewed Oct 17, 2024

View reviewed changes

project-defiant and others added 2 commits October 17, 2024 17:09

Merge branch 'dev' into dsdo_top_hits_step

766da20

fix: changing mapping for ancestries adding CSA

1bbc5d9

d0choa commented Oct 18, 2024

View reviewed changes

vivienho and others added 4 commits October 18, 2024 11:40

Merge branch 'dev' into dsdo_top_hits_step

bbe2234

fix: revert changes in mapping

b874c2e

Merge branch 'dev' into dsdo_top_hits_step

7ccdfae

Merge branch 'dev' into dsdo_top_hits_step

37c9df1

DSuveges approved these changes Oct 22, 2024

View reviewed changes

DSuveges merged commit df220e9 into dev Oct 22, 2024
5 checks passed

DSuveges deleted the dsdo_top_hits_step branch October 22, 2024 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: gwas catalog top-hit + study step #808

feat: gwas catalog top-hit + study step #808

d0choa commented Oct 2, 2024

d0choa commented Oct 2, 2024 •

edited

Loading

d0choa Oct 2, 2024

d0choa commented Oct 2, 2024

d0choa commented Oct 16, 2024

project-defiant left a comment

project-defiant Oct 17, 2024

project-defiant Oct 17, 2024

project-defiant Oct 17, 2024

project-defiant Oct 17, 2024

project-defiant Oct 17, 2024

project-defiant Oct 17, 2024

d0choa Oct 17, 2024

addramir commented Oct 17, 2024

d0choa Oct 18, 2024

addramir Oct 18, 2024

vivienho Oct 18, 2024

addramir Oct 18, 2024

addramir Oct 18, 2024

project-defiant Oct 18, 2024

DSuveges left a comment

feat: gwas catalog top-hit + study step #808

feat: gwas catalog top-hit + study step #808

Conversation

d0choa commented Oct 2, 2024

d0choa commented Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

d0choa commented Oct 2, 2024

d0choa commented Oct 16, 2024

Example how to run the step:

Copy of the outputs

Breakdown of GWAS Catalog flags

project-defiant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

addramir commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DSuveges left a comment

Choose a reason for hiding this comment

d0choa commented Oct 2, 2024 •

edited

Loading