Include noise removal for feature selection #21

PaulaLlanos · 2024-12-05T17:51:19Z

Overview from mini grant:

Test noise_removal for feature selection because we have no equivalent step in the profiling recipe. Without a feature selection step that filters based on feature relevance, we may end up with lots of noisy features. Scenario 1 in https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection would be ideal to test this out on. Use mAP (and possibly other metrics) to report if the baseline performance improves if we add this step to the preprocessing workflow. See carpenter-singh-lab/2023_Arevalo_NatComm_BatchCorrection#4 for more details about preprocessing. Also note this paper that tests feature selection methods: 40. Siegismund, D., Fassler, M., Heyse, S. & Steigele, S. Benchmarking feature selection methods for compressing image information in high-content screening. SLAS Technol 27, 85–93 (2022) - was summarized in a review as "AutoML (automated machine learning), enable the most informative features from Cell Painting datasets to be identified faster and more accurately"

Steps discussed by email:

Add noise_removal to the feature selection steps in the jump profiling recipe to see if it improves results (compared to without)

You will need to figure out how to include it in this step: https://github.com/broadinstitute/jump-profiling-recipe/blob/main/preprocessing/feature_selection.py

This is the function: https://github.com/cytomining/pycytominer/blob/08f3a043fd22e8f86de5488fdfc2d21814a491fb/pycytominer/operations/noise_removal.py#L8

Test:
https://github.com/cytomining/pycytominer/blob/08f3a043fd22e8f86de5488fdfc2d21814a491fb/tests/test_feature_select.py#L76

PaulaLlanos · 2024-12-05T17:54:32Z

@shntnu
Scenario_1 use mad_drop_int_featselect and mad_int_featselect pipeline on Source_6

Since there is a new version for jump-profiling-recipe it is still interesting to run Scenario_1 pipeline?
I would like to double-check it with you because maybe is more convenient to run the new version of "orf.json" pipeline to test it? The orf has currently the best performance according to the last updates.

PaulaLlanos · 2024-12-09T04:50:36Z

@shntnu @johnarevalo

The analysis was conducted using the code provided in the repository.

I have tested the recipe by applying noise removal to Scenario 1 and Scenario 4:

Scenario 1: Includes "source 4" on "target2".
Scenario 4: Includes "source_1", "source_2", "source_3", "source_5", "source_6", "source_7", "source_8", "source_9", "source_10", "source_11" on "target2".

Observations

Scenario 1: The addition of the noise removal function did not show significant differences in the results.
Scenario 4: Following John's suggestion, I tested this pipeline in Scenario 4, which includes the majority of the sources.

Metrics:

For both scenarios, the metrics for mAP negcon and mAP nonrep were collect

Results: Number of Compounds with Significant Activity

Scenario 1: Significant Activity using stdev of noise removal 0.5

True mAP negcon: 289

True mAP negcon with noise removal: 293

True mAP nonrep: 293

True mAP nonrep with noise removal: 293

Scenario 4: Significant Activity using stdev of noise removal 0.8

True mAP negcon: 186

True mAP negcon with noise removal: 161

True mAP nonrep: 270

True mAP nonrep with noise removal: 190

Attached are the plots comparing mAP negcon and mAP nonrep metrics with and without noise removal.

Scenario 1: mAP nonrep	Scenario 1: mAP negcon

Scenario 4: mAP nonrep	Scenario 4: mAP negcon

johnarevalo · 2024-12-09T17:19:24Z

Thanks Paula for putting this results together!

Could you please include the number of features dropped by the "noise removal" step in both scenarios? also, how many features are in common between both scenarios for the noise removal pipeline after the "featselect" step?

Maybe after removing the noisy features, the feature selection yields to the same feature set, which would be a nice thing.

shntnu · 2025-02-12T15:39:26Z

Thanks Paula for putting this results together!

Could you please include the number of features dropped by the "noise removal" step in both scenarios? also, how many features are in common between both scenarios for the noise removal pipeline after the "featselect" step?

Maybe after removing the noisy features, the feature selection yields to the same feature set, which would be a nice thing.

@PaulaLlanos - please do ^^^

@johnarevalo - could you help drive this towards a conclusion about whether we should or should not use noise removal. If inconclusive, please help note what additional experiments will be needed in the future

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include noise removal for feature selection #21

Include noise removal for feature selection #21

PaulaLlanos commented Dec 5, 2024

PaulaLlanos commented Dec 5, 2024

PaulaLlanos commented Dec 9, 2024 •

edited

Loading

johnarevalo commented Dec 9, 2024

shntnu commented Feb 12, 2025

Include noise removal for feature selection #21

Include noise removal for feature selection #21

Comments

PaulaLlanos commented Dec 5, 2024

Overview from mini grant:

Steps discussed by email:

PaulaLlanos commented Dec 5, 2024

PaulaLlanos commented Dec 9, 2024 • edited Loading

johnarevalo commented Dec 9, 2024

shntnu commented Feb 12, 2025

PaulaLlanos commented Dec 9, 2024 •

edited

Loading