-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: remove gene_index step #946
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very thorough refactoring of this huge codebase! The target index is phased out in favour of the target index, which is now shared between the platform and gentropy. Luckily there was not much logic refactoring.
@@ -28,16 +27,16 @@ Available options: | |||
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. | |||
``` | |||
|
|||
As indicated, you can run a step by specifying the step's name with the `step` argument. For example, to run the `gene_index` step, you can run: | |||
As indicated, you can run a step by specifying the step's name with the `step` argument. For example, to run the `gwas_catalog_sumstat_preprocess` step, you can run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the documentation! I keep forgetting about it.
def mock_target_index(spark: SparkSession) -> TargetIndex: | ||
"""Mock target index dataset.""" | ||
ti_schema = TargetIndex.get_schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretical question: wouldn't make more sense to use an actual data sample from the target index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably? There is a sample_target_index
right above it in the code, but it doesn't seem to be used.
✨ Context
Gentropy uses a
gene_index
that is generated from the target index in the platform etl via thegene_index
step. Thegene_index
step is redundant, and we now want to use the target index from the platform etl directly.This PR is also connected to these PRs in the orchestration and platform-etl-backend repos.
Note: After this PR is merged, gentropy pipelines will depend on a
target_index
dataset generated by the platform etl that has atss
column and thus the gentropy pipeline cannot be run on its own. I have generated a patchedtarget_index
dataset usinggs://open-targets-pre-data-releases/24.12-uo_test-3/output/etl/parquet/targets
patched with thetss
column, just in case the gentropy pipelines need to be run on their own. The patchedtarget_index
can be found here:gs://ot-team/vivien/gentropy_patched_datasets/target_index_with_tss_column
🛠 What does this PR implement
gene_index
step.gene_index
are renamed totarget_index
.gene_index
essentially comprises of a subset of columns from thetarget_index
, some field names differ, so the PR renames field names to be compatible with thetarget_index
where necessary.gene_index
json schema has been replaced with thetarget_index
schema derived from atarget_index
dataset generated by the platform etl.gene_index
step was used an example in the docs here and here. Thegwas_catalog_sumstat_preprocess
step is now used as the example as it has inputs of similar complexity.🙈 Missing
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?