feat: add interval logic for l2g features #812

xyg123 · 2024-10-03T13:02:08Z

✨ Context

Adding interval based features to the l2g model, based on the feature list (opentargets/issues#3521).
opentargets/issues#3512

🛠 What does this PR implement

Implementation of PCHIC-based interval features for the L2G gene prediction model.
Added back interval processing steps into the L2G feature generation step.

🙈 Missing

More features from anderson + thurman.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…1_l2g_intervals

addramir · 2024-10-07T11:30:13Z

src/gentropy/dataset/l2g_features/intervals.py

+        # feature will be the same for any gene associated with a studyLocus)
+        local_max.withColumn(
+            "regional_maximum",
+            f.max(local_feature_name).over(Window.partitionBy("studyLocusId")),


Why is it maximum? According to the table and what we discussed it should be mean?
https://docs.google.com/spreadsheets/d/1wUs1AprRCCGItZmgDhc1fF5BtwCSosdzFv4NQ8V6Dtg/edit?gid=452826388#gid=452826388

…1_l2g_intervals

ireneisdoomed

Thank you for the changes Jack!!!

The logic to build the features looks good! Please see my comments, but they are more along the lines of how we process the interval data in the L2G step.
I suggested processing all interval sources to make the process simpler, but since the code is accommodated to take source names and paths individually and changing it is a mess, it's also fine to leave it like that as long as the interval_paths parameter is correctly configured.

The implemented changes wouldn't run, because of the creation of a Interval dataset with a mismatching schema. I would encourage you to:

add any features you add to the test_l2g_feature_matrix.py suite, to make sure that the code doesnt crash
In the same file, add a semantic test for the common logic
Update the documentation pages
Pull dev branch to bring the changes to the feature matrix step

src/gentropy/config.py

src/gentropy/l2g.py

src/gentropy/config.py

src/gentropy/l2g.py

src/gentropy/method/l2g/feature_factory.py

…1_l2g_intervals

…d test for interval features

…1_l2g_intervals

…ntropy into xg1_l2g_intervals

ireneisdoomed

Thank you for your additions! It looks like the code is not yet ready so please look at my comments.

In general, I think that before integrating anything into dev, we should retrain the model and make a more informed decision about what and how we want to add it.

Like, it's a naive q, but is there a way to aggregate this data so that the feature represents if the credible set is part of any regulatory region that we know is physically close to the gene? As opposed to having separate features per source.

Based on the code, I guess you have agreed that Intervals will be written to disk and then the features will be extracted in the FM generation.

Correct me if I'm wrong, but based on your comment (#812 (comment)) the heavy part of processing interval data is not so much parsing the raw data, but getting the V2G relationships because of the high data volume. Have you considered doing the Interval parsing as part of the FM generation step?

In essence, I'd like to see the impact of these features before making a decision

src/gentropy/datasource/intervals/Intervals.py

ireneisdoomed · 2025-01-08T10:24:25Z

src/gentropy/datasource/intervals/Intervals.py

Please write the name of the module in lower case

The step in in gentropys source root. Datasource is the folder where we store data parsers

src/gentropy/config.py

ireneisdoomed · 2025-01-08T12:01:34Z

src/gentropy/config.py

+            # intervals
+            "pchicMean",
+            "pchicMeanNeighbourhood",
+            "enhTssCorrelationMean",


I'd like more readable feature names to pick up what they represent easily

ireneisdoomed · 2025-01-08T12:01:41Z

src/gentropy/config.py

+            "pchicMean",
+            "pchicMeanNeighbourhood",
+            "enhTssCorrelationMean",
+            "enhTssCorrelationMeanNeighbourhood",


I'd like more readable feature names to pick up what they represent easily

ireneisdoomed · 2025-01-08T14:53:12Z

src/gentropy/dataset/l2g_features/intervals.py

+class EnhTssCorrelationMeanFeature(L2GFeature):
+    """Average weighted Enhancer-TSS correlation between studylocus and gene TSS."""
+
+    fill_na_value = 0  # would be 0 if implemented


same as above

ireneisdoomed · 2025-01-08T14:53:57Z

src/gentropy/dataset/l2g_features/intervals.py

+class DhsPmtrCorrelationMeanFeature(L2GFeature):
+    """Average weighted DHS-promoter correlation between studylocus and gene TSS."""
+
+    fill_na_value = 0  # would be 0 if implemented


same as above

ireneisdoomed · 2025-01-08T14:54:30Z

src/gentropy/dataset/l2g_features/intervals.py

+    Compared to the Mean weighted DHS-promoter correlation for all genes in the vicinity.
+    """
+
+    fill_na_value = 0  # would be 0 if implemented


same as above

ireneisdoomed · 2025-01-08T14:57:29Z

src/gentropy/l2g.py

@@ -33,7 +35,7 @@ def __init__(
        self,
        session: Session,
        *,
-        features_list: list[str],
+        features_list: list[str] = LocusToGeneFeatureMatrixConfig().features_list,


Instantiating LocusToGeneFeatureMatrixConfig will call the session class by default. We tend to avoid calling step configs

ireneisdoomed · 2025-01-08T16:10:05Z

src/gentropy/assets/schemas/intervals.json

    {
      "metadata": {},
      "name": "geneId",
-      "nullable": false,
+      "nullable": true,


Why nullable?

…1_l2g_intervals

…entargets/gentropy into xg1_l2g_intervals

…1_l2g_intervals

feat: add interval logic for l2g features

9c31f43

github-actions bot added size-M Method Dataset Step Feature labels Oct 3, 2024

xyg123 added 3 commits October 3, 2024 14:29

chore: fix docstrings

330b79e

chore: fix attribute errors

183c827

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

500bae8

…1_l2g_intervals

addramir requested a review from ireneisdoomed October 5, 2024 07:21

xyg123 added 2 commits October 7, 2024 11:14

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

7cb4b5f

…1_l2g_intervals

fix: multiple input lines from merge

2035a52

addramir reviewed Oct 7, 2024

View reviewed changes

xyg123 added 3 commits October 7, 2024 16:02

fix: change to mean comparison, add additional interval features

985a901

fix: change to mean comparison, add additional interval features

b01b4e8

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

688c73a

…1_l2g_intervals

ireneisdoomed requested changes Oct 11, 2024

View reviewed changes

xyg123 added 4 commits October 15, 2024 10:21

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

6837df3

…1_l2g_intervals

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

f194098

…1_l2g_intervals

fix: change interval schema, reorganise interval processing, begin ad…

a9c0f6b

…d test for interval features

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

63d6db6

…1_l2g_intervals

github-actions bot added size-L and removed size-M labels Oct 17, 2024

xyg123 and others added 7 commits October 18, 2024 13:20

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

374a7c3

…1_l2g_intervals

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

29ad08b

…1_l2g_intervals

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

42e4ce9

…1_l2g_intervals

fix: schema fixes

55f947f

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

1de5fcf

…1_l2g_intervals

Added working tests for interval + nbh features

c332d93

chore: pre-commit auto fixes [...]

ee8c4f2

xyg123 added 2 commits December 17, 2024 11:59

type hint issue

24dc8c3

add datasource step to process intervals

9aeb302

github-actions bot added the Datasource label Dec 17, 2024

pre-commit-ci bot and others added 5 commits December 17, 2024 12:25

chore: pre-commit auto fixes [...]

155fcdb

add interval doc .md

53a6ff3

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

b8914a7

…1_l2g_intervals

changes to config

78f661b

Merge branch 'xg1_l2g_intervals' of https://github.com/opentargets/ge…

cf8b260

…ntropy into xg1_l2g_intervals

github-actions bot added the documentation Improvements or additions to documentation label Dec 19, 2024

pre-commit-ci bot and others added 3 commits December 19, 2024 11:39

chore: pre-commit auto fixes [...]

880cacf

address feature name comments and tests

c076e17

Merge branch 'xg1_l2g_intervals' of https://github.com/opentargets/ge…

b074bc4

…ntropy into xg1_l2g_intervals

xyg123 requested a review from ireneisdoomed December 19, 2024 12:22

ireneisdoomed requested changes Jan 8, 2025

View reviewed changes

xyg123 added 16 commits January 22, 2025 15:18

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

af54441

…1_l2g_intervals

fix: adress comments and accomodate target index

0dacd58

fix: add interval path to feature matrix step

1f35a8e

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

23aa66a

…1_l2g_intervals

test: testing fm generation (no intervals)

2fce3d6

Merge branches 'xg1_l2g_intervals' and 'dev' of https://github.com/op…

8cc1f2c

…entargets/gentropy into xg1_l2g_intervals

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

3f66d87

…1_l2g_intervals

test: join credsets using variantId instead of overlap

a268312

test: enable interval features in fm generation

9c412b8

test: join based on variantId

5f9b417

test: join based on variantId

56c3c5c

test: config interval path training

01409cb

fix: feature names in feature matrix

468695c

test: config for interval test training

d3862df

fix: undo gene count features

439c2c3

test: run l2g train with no CV

d03844d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add interval logic for l2g features #812

feat: add interval logic for l2g features #812

xyg123 commented Oct 3, 2024 •

edited

Loading

addramir Oct 7, 2024

ireneisdoomed left a comment •

edited

Loading

ireneisdoomed left a comment

ireneisdoomed Jan 8, 2025

ireneisdoomed Jan 8, 2025

ireneisdoomed Jan 8, 2025

ireneisdoomed Jan 8, 2025

ireneisdoomed Jan 8, 2025

ireneisdoomed Jan 8, 2025

ireneisdoomed Jan 8, 2025

ireneisdoomed Jan 8, 2025

ireneisdoomed Jan 8, 2025

feat: add interval logic for l2g features #812

Are you sure you want to change the base?

feat: add interval logic for l2g features #812

Conversation

xyg123 commented Oct 3, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

Choose a reason for hiding this comment

ireneisdoomed left a comment • edited Loading

Choose a reason for hiding this comment

ireneisdoomed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xyg123 commented Oct 3, 2024 •

edited

Loading

ireneisdoomed left a comment •

edited

Loading