Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include noise removal for feature selection #21

Open
PaulaLlanos opened this issue Dec 5, 2024 · 4 comments
Open

Include noise removal for feature selection #21

PaulaLlanos opened this issue Dec 5, 2024 · 4 comments

Comments

@PaulaLlanos
Copy link

Overview from mini grant:

Test noise_removal for feature selection because we have no equivalent step in the profiling recipe. Without a feature selection step that filters based on feature relevance, we may end up with lots of noisy features. Scenario 1 in https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection would be ideal to test this out on. Use mAP (and possibly other metrics) to report if the baseline performance improves if we add this step to the preprocessing workflow. See carpenter-singh-lab/2023_Arevalo_NatComm_BatchCorrection#4 for more details about preprocessing. Also note this paper that tests feature selection methods: 40. Siegismund, D., Fassler, M., Heyse, S. & Steigele, S. Benchmarking feature selection methods for compressing image information in high-content screening. SLAS Technol 27, 85–93 (2022) - was summarized in a review as "AutoML (automated machine learning), enable the most informative features from Cell Painting datasets to be identified faster and more accurately"

Steps discussed by email:

Add noise_removal to the feature selection steps in the jump profiling recipe to see if it improves results (compared to without)

You will need to figure out how to include it in this step: https://github.com/broadinstitute/jump-profiling-recipe/blob/main/preprocessing/feature_selection.py

This is the function: https://github.com/cytomining/pycytominer/blob/08f3a043fd22e8f86de5488fdfc2d21814a491fb/pycytominer/operations/noise_removal.py#L8

Test:
https://github.com/cytomining/pycytominer/blob/08f3a043fd22e8f86de5488fdfc2d21814a491fb/tests/test_feature_select.py#L76

@PaulaLlanos
Copy link
Author

@shntnu
Scenario_1 use mad_drop_int_featselect and mad_int_featselect pipeline on Source_6

Since there is a new version for jump-profiling-recipe it is still interesting to run Scenario_1 pipeline?
I would like to double-check it with you because maybe is more convenient to run the new version of "orf.json" pipeline to test it? The orf has currently the best performance according to the last updates.

@PaulaLlanos
Copy link
Author

PaulaLlanos commented Dec 9, 2024

@shntnu @johnarevalo

The analysis was conducted using the code provided in the repository.

I have tested the recipe by applying noise removal to Scenario 1 and Scenario 4:

Scenario 1: Includes "source 4" on "target2".
Scenario 4: Includes "source_1", "source_2", "source_3", "source_5", "source_6", "source_7", "source_8", "source_9", "source_10", "source_11" on "target2".

Observations

Scenario 1: The addition of the noise removal function did not show significant differences in the results.
Scenario 4: Following John's suggestion, I tested this pipeline in Scenario 4, which includes the majority of the sources.

Metrics:

For both scenarios, the metrics for mAP negcon and mAP nonrep were collect

Results: Number of Compounds with Significant Activity

Scenario 1: Significant Activity using stdev of noise removal 0.5

True mAP negcon: 289

True mAP negcon with noise removal: 293

True mAP nonrep: 293

True mAP nonrep with noise removal: 293

Scenario 4: Significant Activity using stdev of noise removal 0.8

True mAP negcon: 186

True mAP negcon with noise removal: 161

True mAP nonrep: 270

True mAP nonrep with noise removal: 190

Attached are the plots comparing mAP negcon and mAP nonrep metrics with and without noise removal.

Scenario 1: mAP nonrep Scenario 1: mAP negcon
Screenshot 2024-12-08 at 11 38 57 PM Screenshot 2024-12-08 at 11 38 50 PM
Scenario 4: mAP nonrep Scenario 4: mAP negcon
Screenshot 2024-12-08 at 11 39 23 PM Screenshot 2024-12-08 at 11 39 08 PM

@johnarevalo
Copy link
Collaborator

Thanks Paula for putting this results together!

Could you please include the number of features dropped by the "noise removal" step in both scenarios? also, how many features are in common between both scenarios for the noise removal pipeline after the "featselect" step?

Maybe after removing the noisy features, the feature selection yields to the same feature set, which would be a nice thing.

@shntnu
Copy link
Contributor

shntnu commented Feb 12, 2025

Thanks Paula for putting this results together!

Could you please include the number of features dropped by the "noise removal" step in both scenarios? also, how many features are in common between both scenarios for the noise removal pipeline after the "featselect" step?

Maybe after removing the noisy features, the feature selection yields to the same feature set, which would be a nice thing.

@PaulaLlanos - please do ^^^

@johnarevalo - could you help drive this towards a conclusion about whether we should or should not use noise removal. If inconclusive, please help note what additional experiments will be needed in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants