Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add interval logic for l2g features #812

Open
wants to merge 43 commits into
base: dev
Choose a base branch
from
Open

Conversation

xyg123
Copy link
Contributor

@xyg123 xyg123 commented Oct 3, 2024

✨ Context

Adding interval based features to the l2g model, based on the feature list (opentargets/issues#3521).
opentargets/issues#3512

🛠 What does this PR implement

  • Implementation of PCHIC-based interval features for the L2G gene prediction model.
  • Added back interval processing steps into the L2G feature generation step.

🙈 Missing

More features from anderson + thurman.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@addramir addramir requested a review from ireneisdoomed October 5, 2024 07:21
# feature will be the same for any gene associated with a studyLocus)
local_max.withColumn(
"regional_maximum",
f.max(local_feature_name).over(Window.partitionBy("studyLocusId")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it maximum? According to the table and what we discussed it should be mean?
https://docs.google.com/spreadsheets/d/1wUs1AprRCCGItZmgDhc1fF5BtwCSosdzFv4NQ8V6Dtg/edit?gid=452826388#gid=452826388

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes Jack!!!

The logic to build the features looks good! Please see my comments, but they are more along the lines of how we process the interval data in the L2G step.
I suggested processing all interval sources to make the process simpler, but since the code is accommodated to take source names and paths individually and changing it is a mess, it's also fine to leave it like that as long as the interval_paths parameter is correctly configured.

The implemented changes wouldn't run, because of the creation of a Interval dataset with a mismatching schema. I would encourage you to:

  • add any features you add to the test_l2g_feature_matrix.py suite, to make sure that the code doesnt crash
  • In the same file, add a semantic test for the common logic
  • Update the documentation pages
  • Pull dev branch to bring the changes to the feature matrix step

src/gentropy/config.py Outdated Show resolved Hide resolved
src/gentropy/config.py Outdated Show resolved Hide resolved
src/gentropy/config.py Outdated Show resolved Hide resolved
src/gentropy/config.py Outdated Show resolved Hide resolved
src/gentropy/config.py Outdated Show resolved Hide resolved
src/gentropy/l2g.py Outdated Show resolved Hide resolved
src/gentropy/config.py Show resolved Hide resolved
src/gentropy/l2g.py Outdated Show resolved Hide resolved
src/gentropy/method/l2g/feature_factory.py Show resolved Hide resolved
src/gentropy/method/l2g/feature_factory.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added size-L and removed size-M labels Oct 17, 2024
@xyg123
Copy link
Contributor Author

xyg123 commented Dec 16, 2024

We have investigated the Intervals-only V2G dataset, the problem is that one variant can contain interval information from multiple genes (up to 200) from one interval source, in addition, the Interval-only V2G dataset is too big, and can potentially be 4x the size of the variant index.

Therefore, it is not feasible to include interval data inside the variant index, instead, the "processed interval" dataset (requires gene index), will be generated each release, and the feature generation step will intersect variants found within studyLocus to the "processed interval" using an overlap approach to generate the scores needed for the interval features.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 19, 2024
@xyg123 xyg123 requested a review from ireneisdoomed December 19, 2024 12:22
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your additions! It looks like the code is not yet ready so please look at my comments.

In general, I think that before integrating anything into dev, we should retrain the model and make a more informed decision about what and how we want to add it.

Like, it's a naive q, but is there a way to aggregate this data so that the feature represents if the credible set is part of any regulatory region that we know is physically close to the gene? As opposed to having separate features per source.

Based on the code, I guess you have agreed that Intervals will be written to disk and then the features will be extracted in the FM generation.

Correct me if I'm wrong, but based on your comment (#812 (comment)) the heavy part of processing interval data is not so much parsing the raw data, but getting the V2G relationships because of the high data volume. Have you considered doing the Interval parsing as part of the FM generation step?

In essence, I'd like to see the impact of these features before making a decision

processed_interval_path: str,
liftover_max_length_difference: int = 100,
) -> None:
"""Run Variant-to-gene (V2G) step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write the name of the module in lower case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step in in gentropys source root. Datasource is the folder where we store data parsers


gene_index_path: str = MISSING
liftover_chain_file_path: str = MISSING
max_distance: int = 250_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 500kb?

# intervals
"pchicMean",
"pchicMeanNeighbourhood",
"enhTssCorrelationMean",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like more readable feature names to pick up what they represent easily

"pchicMean",
"pchicMeanNeighbourhood",
"enhTssCorrelationMean",
"enhTssCorrelationMeanNeighbourhood",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like more readable feature names to pick up what they represent easily

class EnhTssCorrelationMeanFeature(L2GFeature):
"""Average weighted Enhancer-TSS correlation between studylocus and gene TSS."""

fill_na_value = 0 # would be 0 if implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

class DhsPmtrCorrelationMeanFeature(L2GFeature):
"""Average weighted DHS-promoter correlation between studylocus and gene TSS."""

fill_na_value = 0 # would be 0 if implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Compared to the Mean weighted DHS-promoter correlation for all genes in the vicinity.
"""

fill_na_value = 0 # would be 0 if implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -33,7 +35,7 @@ def __init__(
self,
session: Session,
*,
features_list: list[str],
features_list: list[str] = LocusToGeneFeatureMatrixConfig().features_list,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instantiating LocusToGeneFeatureMatrixConfig will call the session class by default. We tend to avoid calling step configs

{
"metadata": {},
"name": "geneId",
"nullable": false,
"nullable": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nullable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants