[ arXiv | Data | Documentation | Tutorials | Cite ]
Welcome to the official GitHub repository of the HEST-Library introduced in "HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis", NeurIPS Spotlight, 2024. This project was developed by the Mahmood Lab at Harvard Medical School and Brigham and Women's Hospital.
- HEST-1k: Free access to HEST-1K, a dataset of 1,229 paired Spatial Transcriptomics samples with HE-stained whole-slide images
- HEST-Library: A series of helpers to assemble new ST samples (ST, Visium, Visium HD, Xenium) and work with HEST-1k (ST analysis, batch effect viz and correction, etc.)
- HEST-Benchmark: A new benchmark to assess the predictive performance of foundation models for histology in predicting gene expression from morphology
HEST-1k, HEST-Library, and HEST-Benchmark are released under the Attribution-NonCommercial-ShareAlike 4.0 International license.
-
21.10.24: HEST has been accepted to NeurIPS 2024 as a Spotlight! We will be in Vancouver from Dec 10th to 15th. Send us a message if you wanna learn more about HEST (gjaume@bwh.harvard.edu).
-
23.09.24: 121 new samples released, including 27 Xenium and 7 Visium HD! We also make the aligned Xenium transcripts + the aligned DAPI segmented cells/nuclei public.
-
30.08.24: HEST-Benchmark results updated. Includes H-Optimus-0, Virchow 2, Virchow, and GigaPath. New COAD task based on 4 Xenium samples. HuggingFace bench data have been updated.
-
28.08.24: New set of helpers for batch effect visualization and correction. Tutorial here.
To download/query HEST-1k, follow the tutorial 1-Downloading-HEST-1k.ipynb or follow instructions on Hugging Face.
NOTE: The entire dataset weighs more than 1TB but you can easily download a subset by querying per id, organ, species...
git clone https://github.com/mahmoodlab/HEST.git
cd HEST
conda create -n "hest" python=3.9
conda activate hest
pip install -e .
sudo apt install libvips libvips-dev openslide-tools
If a GPU is available on your machine, we recommend installing cucim on your conda environment. (hest was tested with cucim-cu12==24.4.0
and CUDA 12.1
)
pip install \
--extra-index-url=https://pypi.nvidia.com \
cudf-cu12==24.6.* dask-cudf-cu12==24.6.* cucim-cu12==24.6.* \
raft-dask-cu12==24.6.*
NOTE: HEST-Library was only tested on Linux/macOS machines, please report any bugs in the GitHub issues.
You can then simply view the dataset as,
from hest import iter_hest
for st in iter_hest('../hest_data', id_list=['TENX95']):
print(st)
The HEST-Library allows assembling new samples using HEST format and interacting with HEST-1k. We provide two tutorials:
- 2-Interacting-with-HEST-1k.ipynb: Playing around with HEST data for loading patches. Includes a detailed description of each scanpy object.
- 3-Assembling-HEST-Data.ipynb: Walkthrough to transform a Visum sample into HEST.
- 5-Batch-effect-visualization.ipynb: Batch effect visualization and correction (MNN, Harmony, ComBat).
In addition, we provide complete documentation.
The HEST-Benchmark was designed to assess 11 foundation models for pathology under a new, diverse, and challenging benchmark. HEST-Benchmark includes nine tasks for gene expression prediction (50 highly variable genes) from morphology (112 x 112 um regions at 0.5 um/px) in nine different organs and eight cancer types. We provide a step-by-step tutorial to run HEST-Benchmark and reproduce our results in 4-Running-HEST-Benchmark.ipynb.
HEST-Benchmark was used to assess 11 publicly available models. Reported results are based on a Ridge Regression with PCA (256 factors). Ridge regression unfairly penalizes models with larger embedding dimensions. To ensure fair and objective comparison between models, we opted for PCA-reduction. Model performance measured with Pearson correlation. Best is bold, second best is underlined. Additional results based on Random Forest and XGBoost regression are provided in the paper.
Model | IDC | PRAD | PAAD | SKCM | COAD | READ | ccRCC | LUAD | LYMPH IDC | Average |
---|---|---|---|---|---|---|---|---|---|---|
Resnet50 | 0.4741 | 0.3075 | 0.3889 | 0.4822 | 0.2528 | 0.0812 | 0.2231 | 0.4917 | 0.2322 | 0.326 |
CTransPath | 0.511 | 0.3427 | 0.4378 | 0.5106 | 0.2285 | 0.11 | 0.2279 | 0.4985 | 0.2353 | 0.3447 |
Phikon | 0.5327 | 0.342 | 0.4432 | 0.5355 | 0.2585 | 0.1517 | 0.2423 | 0.5468 | 0.2373 | 0.3656 |
CONCH | 0.5363 | 0.3548 | 0.4475 | 0.5791 | 0.2533 | 0.1674 | 0.2179 | 0.5312 | 0.2507 | 0.3709 |
Remedis | 0.529 | 0.3471 | 0.4644 | 0.5818 | 0.2856 | 0.1145 | 0.2647 | 0.5336 | 0.2473 | 0.3742 |
Gigapath | 0.5508 | 0.3708 | 0.4768 | 0.5538 | 0.301 | 0.186 | 0.2391 | 0.5399 | 0.2493 | 0.3853 |
UNI | 0.5702 | 0.314 | 0.4764 | 0.6254 | 0.263 | 0.1762 | 0.2427 | 0.5511 | 0.2565 | 0.3862 |
Virchow | 0.5702 | 0.3309 | 0.4875 | 0.6088 | 0.311 | 0.2019 | 0.2637 | 0.5459 | 0.2594 | 0.3977 |
Virchow2 | 0.5922 | 0.3465 | 0.4661 | 0.6174 | 0.2578 | 0.2084 | 0.2788 | 0.5605 | 0.2582 | 0.3984 |
UNIv1.5 | 0.5989 | 0.3645 | 0.4902 | 0.6401 | 0.2925 | 0.2240 | 0.2522 | 0.5586 | 0.2597 | 0.4090 |
Hoptimus0 | 0.5982 | 0.385 | 0.4932 | 0.6432 | 0.2991 | 0.2292 | 0.2654 | 0.5582 | 0.2595 | 0.4146 |
Our tutorial in 4-Running-HEST-Benchmark.ipynb will guide users interested in benchmarking their own model on HEST-Benchmark.
Note: Spontaneous contributions are encouraged if researchers from the community want to include new models. To do so, simply create a Pull Request.
- The preferred mode of communication is via GitHub issues.
- If GitHub issues are inappropriate, email
gjaume@bwh.harvard.edu
(and cchomedoucetpaul@gmail.com
). - Immediate response to minor issues may not be available.
If you find our work useful in your research, please consider citing:
Jaume, G., Doucet, P., Song, A. H., Lu, M. Y., Almagro-Perez, C., Wagner, S. J., Vaidya, A. J., Chen, R. J., Williamson, D. F. K., Kim, A., & Mahmood, F. HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis. Advances in Neural Information Processing Systems, December 2024.
@inproceedings{jaume2024hest,
author = {Guillaume Jaume and Paul Doucet and Andrew H. Song and Ming Y. Lu and Cristina Almagro-Perez and Sophia J. Wagner and Anurag J. Vaidya and Richard J. Chen and Drew F. K. Williamson and Ahrong Kim and Faisal Mahmood},
title = {HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis},
booktitle = {Advances in Neural Information Processing Systems},
year = {2024},
month = dec,
}