-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: gwas catalog top-hit + study step #808
Conversation
@DSuveges, this creates the top-hits logic. It should be ready. Next Sumstats + PICS (and cleanup) |
@@ -332,7 +330,6 @@ def from_source( | |||
return ( | |||
cls._parse_study_table(catalog_studies) | |||
.annotate_ancestries(ancestry_file) | |||
.annotate_sumstats_info(sumstats_lut) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having second thoughts. Because the 2 DAGs have interdependencies in the business logic, I want to see both working before we review this. I'm reverting to draft |
Example how to run the step:poetry run gentropy step=gwas_catalog_study_index \
step.catalog_study_files="[ '/Users/ochoa/Datasets/gwas_catalog_download_studies.tsv' ]" \
step.catalog_ancestry_files="[ '/Users/ochoa/Datasets/gwas_catalog_download_ancestries.tsv' ]" \
step.study_index_path=/Users/ochoa/Datasets/study_index_annotated \
step.gwas_catalog_study_curation_file=/Users/ochoa/Datasets/20241004_output_curation.tsv \
step.sumstats_qc_path=/Users/ochoa/Datasets/20241015_GwasCatQCLogs Copy of the outputs
Breakdown of GWAS Catalog flags
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Business logic looks good. There is minor bug might break the step when reading the curation file. (Suspect that this is copilot issue).
The logic of this step (and others) raises a single concern. The VariantIndex is calculated at the end of the genetics_etl
. Yet it is required for this step (also for some other ingestions like ukb_ppp).
gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_csv( | ||
session, gwas_catalog_study_curation_file | ||
) | ||
elif gwas_catalog_study_curation_file.startswith("http"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap the order, the http
can end with .csv
as well. Then it will not work.
) | ||
|
||
if gwas_catalog_study_curation_file: | ||
if gwas_catalog_study_curation_file.endswith(".csv"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it not be a .tsv
as in the GWASCatalogStudyIndexGenerationStep
if gwas_catalog_study_curation_file: | ||
if gwas_catalog_study_curation_file.endswith( | ||
".tsv" | ||
) | gwas_catalog_study_curation_file.endswith(".tsv"): | ||
gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_csv( | ||
session, gwas_catalog_study_curation_file | ||
) | ||
elif gwas_catalog_study_curation_file.startswith("http"): | ||
gwas_catalog_study_curation = StudyIndexGWASCatalogOTCuration.from_url( | ||
session, gwas_catalog_study_curation_file | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the code was meant to do here, although I suspect that it could have inferr if the file is http based or not, then based on that if the file is a csv or tsv ??
|
||
# Annotate with sumstats QC if provided: | ||
if sumstats_qc_path: | ||
schema = StructType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New Dataset ?
|
||
!!! note This step currently only processes the GWAS Catalog curated list of top hits. | ||
""" | ||
class GWASCatalogTopHitIngestionStep: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT, not consistent, everywhere we say these are top hits
""" | ||
# Extract | ||
gnomad_variants = VariantIndex.from_parquet(session, gnomad_variant_path) | ||
gnomad_variants = VariantIndex.from_parquet(session, variant_annotation_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particullar case. Which VariantIndex
dataset should be used to generate the top hits correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, it should be the gnomAD variant annotation dataset in the schema of a variantIndex. We haven't run it for a while so I wonder if everything works fine there after the ETL VariantIndex work. This DAG should have no dependency on the ETL VariantIndex
Added changes in ancestry mapping in gwas_population_2_LD_panel_map.json since we are going to remake the study index. These changes add CSA (central south asians) from UKBB to the list of ancestries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@addramir are you sure this won't break PICS if LDIndex does not contain csa
?
Can you please double-check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@addramir In that case R will be 0 and the variant should be dropped before PICS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we need to revert the change with changing mapping. I will add a quick temporary fix in susie but in the future we have to change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving the PR assuming there might be some surprises along the way when it comes to curation.
Step to run the top-hits in isolation of everything else.
The
GWASCatalogTopHitIngestionStep
included here was run in dataproc in 17 minutes.Subsequent PRs will handle GWAS Catalog + Summary Statistics + PICS road which might result in removing some steps.