Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need to correct the batch effects of given datasets #43

Open
HelloWorldLTY opened this issue Aug 30, 2024 · 14 comments
Open

Do we need to correct the batch effects of given datasets #43

HelloWorldLTY opened this issue Aug 30, 2024 · 14 comments

Comments

@HelloWorldLTY
Copy link

Hi, thanks for your great work. I wonder if we need to correct the batch effects of these spatial transcriptomic data or not. Thanks a lot!

@guillaumejaume
Copy link
Collaborator

guillaumejaume commented Aug 30, 2024

Hi, it depends on what you want to do with HEST data. What's your use case?

@HelloWorldLTY
Copy link
Author

HelloWorldLTY commented Aug 31, 2024

I am interested in the Visium data only. Thanks.

@guillaumejaume
Copy link
Collaborator

Visium data integrated into HEST-1k are very diverse: 2 species (mouse and human), multiple diseases, and organs. Batch effect correction should always be done if there are some guarantees that it won't significantly alter the biological signal.

To give a better answer, I need a better understanding of your problem statement, e.g., multimodal representation learning, ST prediction from H&E, characterization of morphological correlates of expression changes, etc.

If you want to explore batch effect, we implemented 2 core functions:

  • Batch effect visualization, here, which does a UMAP viz of the gene expression of housekeeping genes (ie stable genes) in the stromal region. The function can take as input a series of visium samples that you want to use.
  • Batch effect correction, here, which can correct batch effects using MNN, Harmony, and Combat. The output of each method is different, e.g., Harmony creates a new latent space, so the output cannot be interpreted as gene counts anymore (this may or may not be an issue for your problem statement)

@HelloWorldLTY
Copy link
Author

Thanks! I will take a look at it!

@guillaumejaume
Copy link
Collaborator

@HelloWorldLTY, feel free to document any findings on this GitHub issue.

@skambha6
Copy link

skambha6 commented Oct 1, 2024

Related to this, I am noticing fairly strong batch effects by sample-of-origin for the H&E patch embeddings from Visium data even from the same tissue and disease. Is this to be expected or am I missing a key pre-processing step? I am loading in the patches using a H5HESTDataset object and applying only the model-specific eval_transforms (which generally appear to be resizing and ImageNet Normalization).

@guillaumejaume
Copy link
Collaborator

Batch effects in the H&E images exist. Why patch encoder are you using?

@skambha6
Copy link

skambha6 commented Oct 2, 2024

I see this with both the Gigapath and UNI encoders.

@guillaumejaume
Copy link
Collaborator

In my experience CONCH is less sensitive to staining variations. Also, keep in mind that the image latent space can express staining variations, while also encoding all the relevant biological signal. Depending on the downstream task, it may not be critical.

@skambha6
Copy link

skambha6 commented Oct 2, 2024

I see. Are there any ways to correct for the staining variations with preprocessing/normalization? It seems that Harmony can remove some of the image batch effects from the embeddings, but not all.

@guillaumejaume
Copy link
Collaborator

Many approaches exist for stain normalization in computational pathology, e.g., Macenko or Vahadane normalization. However, these can also alter the biological signal from the image. I'd need to better understand your problem statement to provide a more informed answer.

@skambha6
Copy link

skambha6 commented Oct 2, 2024

Got it! We were interested in predicting gene expression from the patch embeddings, but it seems from what you're saying that batch effect correction can hurt more than help for this task.

@guillaumejaume
Copy link
Collaborator

In HEST-Benchmark we didn't apply additional corrections. I'm sure that performance can be improved. But the big unknown becomes how to ensure good generalization.

@skambha6
Copy link

skambha6 commented Oct 2, 2024

Okay got it, thank you for the information!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants