Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: remove gene_index step #946

Merged
merged 19 commits into from
Jan 22, 2025
Merged

Conversation

vivienho
Copy link
Contributor

@vivienho vivienho commented Dec 12, 2024

✨ Context

Gentropy uses a gene_index that is generated from the target index in the platform etl via the gene_index step. The gene_index step is redundant, and we now want to use the target index from the platform etl directly.

This PR is also connected to these PRs in the orchestration and platform-etl-backend repos.

Note: After this PR is merged, gentropy pipelines will depend on a target_index dataset generated by the platform etl that has a tss column and thus the gentropy pipeline cannot be run on its own. I have generated a patched target_index dataset using gs://open-targets-pre-data-releases/24.12-uo_test-3/output/etl/parquet/targets patched with the tss column, just in case the gentropy pipelines need to be run on their own. The patched target_index can be found here: gs://ot-team/vivien/gentropy_patched_datasets/target_index_with_tss_column

🛠 What does this PR implement

  • The PR removes all files/code related to the gene_index step.
  • All files/code with gene_index are renamed to target_index.
  • Although the gene_index essentially comprises of a subset of columns from the target_index, some field names differ, so the PR renames field names to be compatible with the target_index where necessary.
  • The gene_index json schema has been replaced with the target_index schema derived from a target_index dataset generated by the platform etl.
  • The gene_index step was used an example in the docs here and here. The gwas_catalog_sumstat_preprocess step is now used as the example as it has inputs of similar complexity.

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@github-actions github-actions bot added size-XL and removed size-M labels Jan 20, 2025
@vivienho vivienho marked this pull request as ready for review January 20, 2025 16:12
@vivienho vivienho requested a review from DSuveges January 20, 2025 16:12
Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very thorough refactoring of this huge codebase! The target index is phased out in favour of the target index, which is now shared between the platform and gentropy. Luckily there was not much logic refactoring.

@@ -28,16 +27,16 @@ Available options:
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
```

As indicated, you can run a step by specifying the step's name with the `step` argument. For example, to run the `gene_index` step, you can run:
As indicated, you can run a step by specifying the step's name with the `step` argument. For example, to run the `gwas_catalog_sumstat_preprocess` step, you can run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the documentation! I keep forgetting about it.

Comment on lines +572 to +574
def mock_target_index(spark: SparkSession) -> TargetIndex:
"""Mock target index dataset."""
ti_schema = TargetIndex.get_schema()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretical question: wouldn't make more sense to use an actual data sample from the target index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably? There is a sample_target_index right above it in the code, but it doesn't seem to be used.

@DSuveges DSuveges merged commit c847921 into dev Jan 22, 2025
5 checks passed
@DSuveges DSuveges deleted the vh-gene-index-to-target-index branch January 22, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants