scPRINT is a large transformer model built for the inference of gene networks (connections between genes explaining the cell's expression profile) from scRNAseq data.
It uses novel encoding and decoding of the cell expression profile and new pre-training methodologies to learn a cell model.
scPRINT can be used to perform the following analyses:
- expression denoising: increase the resolution of your scRNAseq data
- cell embedding: generate a low-dimensional representation of your dataset
- label prediction: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
- gene network inference: generate a gene network from any cell or cell cluster in your scRNAseq dataset
Read the paper! if you would like to know more about scPRINT.
For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10.
If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.
conda create -n "[whatever]" python==3.10
git clone https://github.com/jkcobject/scPRINT
git clone https://github.com/jkobject/GRnnData
git clone https://github.com/jkobject/benGRN
cd scPRINT
git submodule init
git submodule update
pip install 'lamindb[jupyter,bionty]'
pip install -e scDataloader
pip install -e ../GRnnData/
pip install -e ../benGRN/
pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1
# install the dev tooling if you need it too
pip install -e ".[dev]"
pip install -r requirements-dev.txt
pip install triton==2.0.0.dev20221202 --no-deps # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
# install triton as mentioned in .toml if you want to
mkdocs serve # to view the dev documentation
We make use of some additional packages we developed alongside scPRint.
Please refer to their documentation for more information:
- scDataLoader: a dataloader for training large cell models.
- GRnnData: a package to work with gene networks from single cell data.
- benGRN: a package to benchmark gene network inference methods from single cell data.
In that case, connect with google or github to lamin.ai, then be sure to connect before running anything (or before starting a notebook): lamin login <email> --key <API-key>
. Follow the instructions on their website.
(Work In Progress)
This is the most minimal example of how scPRINT works:
from lightning.pytorch import Trainer
from scprint import scPrint
from scdataloader import DataModule
datamodule = DataModule(...)
model = scPrint(...)
trainer = Trainer(...)
trainer.fit(model, datamodule=datamodule)
...
or, from a bash command line
$ scprint fit/train/predict/test --config config/[medium|large|vlarge] ...
If you do not have triton installed you will not be able to take advantage of GPU acceleration, but you can still use the model on the CPU.
In that case, if loading from a checkpoint that was trained with flashattention, you will need to specify transformer="normal"
in the load_from_checkpoint
function like so:
model = scPrint.load_from_checkpoint(
'../data/temp/last.ckpt', precpt_gene_emb=None,
transformer="normal")
We now explore the different usages of scPRINT:
-> refer to the section 1. gene network inference in this notebook.
-> more examples in this notebook ./notebooks/assessments/bench_omni.ipynb.
-> Refer to the embeddings and cell annotations section in this notebook.
-> Refer to the Denoising of B-cell section in this notebook.
-> More example in our benchmark notebook ./notebooks/assessments/bench_denoising.ipynb.
-> refer to the notebook nice_umap.ipynb.
/!\ WIP /!\
Model weights are available on hugging face.
Read the CONTRIBUTING.md file.
Read the training runs document to know more about how training was performed and the results there.
acknowledgement: python template laminDB lightning
Awesome Large Cell Model created by Jeremie Kalfon.