Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: gwas catalog top-hit + study step #808

Merged
merged 38 commits into from
Oct 22, 2024
Merged

feat: gwas catalog top-hit + study step #808

merged 38 commits into from
Oct 22, 2024

Conversation

d0choa
Copy link
Collaborator

@d0choa d0choa commented Oct 2, 2024

Step to run the top-hits in isolation of everything else.

The GWASCatalogTopHitIngestionStep included here was run in dataproc in 17 minutes.

Subsequent PRs will handle GWAS Catalog + Summary Statistics + PICS road which might result in removing some steps.

@d0choa d0choa requested a review from DSuveges October 2, 2024 14:27
@github-actions github-actions bot added documentation Improvements or additions to documentation size-M Step Feature Datasource labels Oct 2, 2024
@d0choa d0choa marked this pull request as draft October 2, 2024 14:29
@d0choa d0choa marked this pull request as ready for review October 2, 2024 14:57
@d0choa
Copy link
Collaborator Author

d0choa commented Oct 2, 2024

@DSuveges, this creates the top-hits logic. It should be ready. Next Sumstats + PICS (and cleanup)

@@ -332,7 +330,6 @@ def from_source(
return (
cls._parse_study_table(catalog_studies)
.annotate_ancestries(ancestry_file)
.annotate_sumstats_info(sumstats_lut)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be problematic. Not on top-hits, or sumstats processing but for the generation of the curation table. @DSuveges we can discussed but I think @addramir was suggesting to not care too much about curation for now

@d0choa
Copy link
Collaborator Author

d0choa commented Oct 2, 2024

I'm having second thoughts. Because the 2 DAGs have interdependencies in the business logic, I want to see both working before we review this.

I'm reverting to draft

@d0choa d0choa marked this pull request as draft October 2, 2024 18:59
@github-actions github-actions bot added size-L and removed size-M labels Oct 3, 2024
@d0choa d0choa changed the title feat: gwas catalog top-hit step feat: gwas catalog top-hit + study step Oct 3, 2024
@d0choa
Copy link
Collaborator Author

d0choa commented Oct 16, 2024

Example how to run the step:

poetry run gentropy step=gwas_catalog_study_index \
step.catalog_study_files="[ '/Users/ochoa/Datasets/gwas_catalog_download_studies.tsv' ]" \
step.catalog_ancestry_files="[ '/Users/ochoa/Datasets/gwas_catalog_download_ancestries.tsv' ]" \
step.study_index_path=/Users/ochoa/Datasets/study_index_annotated \
step.gwas_catalog_study_curation_file=/Users/ochoa/Datasets/20241004_output_curation.tsv \
step.sumstats_qc_path=/Users/ochoa/Datasets/20241015_GwasCatQCLogs

Copy of the outputs

gs://ot-team/dochoa/study_index_annotated

Breakdown of GWAS Catalog flags

+----------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|studyType |analysisFlags          |qualityControls                                                                                                                                                                                                                                    |count|
+----------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., Harmonized summary statistics are not available or empty.]                                                                                                                              |54168|
|gwas      |[]                     |[]                                                                                                                                                                                                                                                 |16842|
|pQTL      |[]                     |[]                                                                                                                                                                                                                                                 |16533|
|gwas      |[Metabolite]           |[]                                                                                                                                                                                                                                                 |12064|
|gwas      |[ExWAS]                |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |3695 |
|gwas      |[Metabolite]           |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |2849 |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets.]                                                                                                                                                                                         |2145 |
|Microbiome|[]                     |[]                                                                                                                                                                                                                                                 |1128 |
|gwas      |[Metabolite]           |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |1033 |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range.]                                                                                                                                                                                   |793  |
|gwas      |[]                     |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |604  |
|Microbiome|[]                     |[The mean beta QC check value is not within the expected range.]                                                                                                                                                                                   |414  |
|gwas      |[ExWAS]                |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |294  |
|gwas      |[]                     |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |267  |
|gwas      |[]                     |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |255  |
|gwas      |[Multivariate analysis]|[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |248  |
|pQTL      |[]                     |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |200  |
|Microbiome|[]                     |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |138  |
|gwas      |[GxE]                  |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |73   |
|gwas      |[Case-case study]      |[]                                                                                                                                                                                                                                                 |68   |
|gwas      |[GxE]                  |[]                                                                                                                                                                                                                                                 |51   |
|gwas      |[Non-additive model]   |[]                                                                                                                                                                                                                                                 |45   |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The number of SNPs in the study is below the expected threshold.]                                                                                                                       |41   |
|gwas      |[GxG]                  |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |29   |
|gwas      |[]                     |[The GC lambda value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                                                                          |27   |
|gwas      |[Non-additive model]   |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                 |25   |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The PZ QC check values are not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                            |21   |
|gwas      |[GxG]                  |[]                                                                                                                                                                                                                                                 |21   |
|Microbiome|[]                     |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |20   |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range.]                                                                                                                        |16   |
|gwas      |[GxE]                  |[The number of SNPs in the study is below the expected threshold.]                                                                                                                                                                                 |16   |
|gwas      |[Multivariate analysis]|[]                                                                                                                                                                                                                                                 |15   |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The GC lambda value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                          |11   |
|gwas      |[Non-additive model]   |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |11   |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                 |10   |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The PZ QC check values are not within the expected range.]                                                                                                                              |9    |
|gwas      |[]                     |[The PZ QC check values are not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                                                                      |9    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]       |7    |
|gwas      |[]                     |[The GC lambda value is not within the expected range.]                                                                                                                                                                                            |7    |
|gwas      |[Case-case study]      |[Harmonized summary statistics are not available or empty.]                                                                                                                                                                                        |7    |
|gwas      |[Non-additive model]   |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range.]                                                                                                                        |6    |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                                                                 |6    |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The GC lambda value is not within the expected range.]                                                                                                                            |5    |
|pQTL      |[]                     |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |3    |
|gwas      |[Metabolite]           |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |3    |
|Microbiome|[]                     |[The PZ QC check values are not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                                                                      |3    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The GC lambda value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                                |2    |
|gwas      |[ExWAS]                |[]                                                                                                                                                                                                                                                 |2    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range.]                                                              |2    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range.]                                                                                                                         |2    |
|gwas      |[Case-case study]      |[The PZ QC check values are not within the expected range.]                                                                                                                                                                                        |2    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                       |2    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                       |1    |
|gwas      |[]                     |[The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                                                                                 |1    |
|gwas      |[]                     |[The mean beta QC check value is not within the expected range., The PZ QC check values are not within the expected range., The number of SNPs in the study is below the expected threshold.]                                                      |1    |
|gwas      |[Non-additive model]   |[The PZ QC check values are not within the expected range., The GC lambda value is not within the expected range.]                                                                                                                                 |1    |
|gwas      |[]                     |[GWAS Catalog study has not been curated by Open Targets., The mean beta QC check value is not within the expected range., The GC lambda value is not within the expected range., The number of SNPs in the study is below the expected threshold.]|1    |
+----------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+

@d0choa d0choa requested review from project-defiant and removed request for DSuveges October 17, 2024 10:50
@d0choa d0choa marked this pull request as ready for review October 17, 2024 10:50
Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Business logic looks good. There is minor bug might break the step when reading the curation file. (Suspect that this is copilot issue).

The logic of this step (and others) raises a single concern. The VariantIndex is calculated at the end of the genetics_etl. Yet it is required for this step (also for some other ingestions like ukb_ppp).

gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_csv(
session, gwas_catalog_study_curation_file
)
elif gwas_catalog_study_curation_file.startswith("http"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap the order, the http can end with .csv as well. Then it will not work.

)

if gwas_catalog_study_curation_file:
if gwas_catalog_study_curation_file.endswith(".csv"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it not be a .tsv as in the GWASCatalogStudyIndexGenerationStep

Comment on lines +57 to +67
if gwas_catalog_study_curation_file:
if gwas_catalog_study_curation_file.endswith(
".tsv"
) | gwas_catalog_study_curation_file.endswith(".tsv"):
gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_csv(
session, gwas_catalog_study_curation_file
)
elif gwas_catalog_study_curation_file.startswith("http"):
gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_url(
session, gwas_catalog_study_curation_file
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the code was meant to do here, although I suspect that it could have inferr if the file is http based or not, then based on that if the file is a csv or tsv ??


# Annotate with sumstats QC if provided:
if sumstats_qc_path:
schema = StructType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Dataset ?


!!! note This step currently only processes the GWAS Catalog curated list of top hits.
"""
class GWASCatalogTopHitIngestionStep:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, not consistent, everywhere we say these are top hits

"""
# Extract
gnomad_variants = VariantIndex.from_parquet(session, gnomad_variant_path)
gnomad_variants = VariantIndex.from_parquet(session, variant_annotation_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particullar case. Which VariantIndex dataset should be used to generate the top hits correctly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it should be the gnomAD variant annotation dataset in the schema of a variantIndex. We haven't run it for a while so I wonder if everything works fine there after the ETL VariantIndex work. This DAG should have no dependency on the ETL VariantIndex

@addramir
Copy link
Contributor

Added changes in ancestry mapping in gwas_population_2_LD_panel_map.json since we are going to remake the study index. These changes add CSA (central south asians) from UKBB to the list of ancestries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addramir are you sure this won't break PICS if LDIndex does not contain csa?
Can you please double-check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen with PICS if the major ancestry code form the study (e.g. csa) is not in LD index for PICS @DSuveges @vivienho ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addramir In that case R will be 0 and the variant should be dropped before PICS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we need to revert the change with changing mapping. I will add a quick temporary fix in susie but in the future we have to change that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addramir @d0choa can I run the test from the branch to clump the sumstats for susie already?

Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving the PR assuming there might be some surprises along the way when it comes to curation.

@DSuveges DSuveges merged commit df220e9 into dev Oct 22, 2024
5 checks passed
@DSuveges DSuveges deleted the dsdo_top_hits_step branch October 22, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants