Skip to content
This repository has been archived by the owner on Sep 19, 2024. It is now read-only.

Commit

Permalink
improving docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jkobject committed Aug 7, 2024
1 parent a04b4ad commit 46f3e13
Show file tree
Hide file tree
Showing 12 changed files with 132 additions and 115 deletions.
58 changes: 40 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,12 @@
# scPRINT: Large Cell Model for scRNAseq data

[![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
[![Documentation Status](https://readthedocs.org/projects/scprint/badge/?version=latest)](https://scprint.readthedocs.io/en/latest/?badge=latest)
[![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
[![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
[![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
[![GitHub issues](https://img.shields.io/github/issues/jkobject/scPRINT)](https://img.shields.io/github/issues/jkobject/scPRINT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![DOI](https://zenodo.org/badge/391909874.svg)]()
[![DOI](https://zenodo.org/badge/391909874.svg)](https://doi.org/10.1101/2024.07.29.605556)

![logo](docs/logo.png)

Expand All @@ -23,7 +22,7 @@ scPRINT can be used to perform the following analyses:
- __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
- __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset

[Read the paper!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.
[Read the manuscript!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.

![figure1](docs/figure1.png)

Expand All @@ -36,16 +35,21 @@ scPRINT can be used to perform the following analyses:
- [Usage](#usage)
- [scPRINT's basic commands](#scprints-basic-commands)
- [Notes on GPU/CPU usage with triton](#notes-on-gpucpu-usage-with-triton)
- [FAQ](#faq)
- [I want to generate gene networks from scRNAseq data:](#i-want-to-generate-gene-networks-from-scrnaseq-data)
- [I want to generate cell embeddings and cell label predictions from scRNAseq data:](#i-want-to-generate-cell-embeddings-and-cell-label-predictions-from-scrnaseq-data)
- [I want to denoising my scRNAseq dataset:](#i-want-to-denoising-my-scrnaseq-dataset)
- [I want to denoise my scRNAseq dataset:](#i-want-to-denoise-my-scrnaseq-dataset)
- [I want to generate an atlas-level embedding](#i-want-to-generate-an-atlas-level-embedding)
- [I need to generate gene tokens using pLLMs](#i-need-to-generate-gene-tokens-using-pllms)
- [I want to pre-train scPRINT from scratch on my own data](#i-want-to-pre-train-scprint-from-scratch-on-my-own-data)
- [Documentation](#documentation)
- [Model Weights](#model-weights)
- [how can I find if scPRINT was trained on my data?](#how-can-i-find-if-scprint-was-trained-on-my-data)
- [can I use scPRINT on other organisms rather than human?](#can-i-use-scprint-on-other-organisms-rather-than-human)
- [how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)](#how-long-does-scprint-takes-what-kind-of-resources-do-i-need-or-in-alternative-can-i-run-scprint-locally)
- [I have different scRNASeq batches. Should I integrate my data before running scPRINT?](#i-have-different-scrnaseq-batches-should-i-integrate-my-data-before-running-scprint)
- [Documentation](#documentation)
- [Model Weights](#model-weights)
- [Development](#development)
- [Work in progress:](#work-in-progress)
- [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)


## Install `scPRINT`
Expand All @@ -54,15 +58,15 @@ For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Py

If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.

```python
```bash
conda create -n "[whatever]" python==3.10
git clone https://github.com/jkobject/scPRINT
#one of
pip install scPRINT # OR
pip install scPRINT[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
pip install scPRINT[flash] && pip install -e "git+https:/
pip install scprint # OR
pip install scprint[dev] # for the dev dependencies (building etc..) OR
pip install scprint[flash] && pip install -e "git+https:/
/github.com/triton-lang/triton.git@legacy-backend
#egg=triton&subdirectory=python" # to use flashattention2, you will need to install triton 2.0.0.dev20221202 specifically, working on removing this dependency # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
#OR pip install scPRINT[dev,flash]
```

We make use of some additional packages we developed alongside scPRint.
Expand Down Expand Up @@ -126,6 +130,8 @@ model = scPrint.load_from_checkpoint(

We now explore the different usages of scPRINT:

## FAQ

### I want to generate gene networks from scRNAseq data:

-> Refer to the section . gene network inference in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).
Expand All @@ -136,7 +142,7 @@ We now explore the different usages of scPRINT:

-> Refer to the embeddings and cell annotations section in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).

### I want to denoising my scRNAseq dataset:
### I want to denoise my scRNAseq dataset:

-> Refer to the Denoising of B-cell section in [this notebook](./docs/notebooks/cancer_usecase.ipynb).

Expand All @@ -156,11 +162,27 @@ To run scPRINT, you can use the option to define the gene tokens using protein l

-> Refer to the documentation page [pretrain scprint](docs/pretrain.md)

### Documentation
### how can I find if scPRINT was trained on my data?

If your data is available in cellxgene, scPRINT was likely trained on it. However some cells, datasets were dropped due to low quality data and some were randomly removed to be part of the validation / test sets.

### can I use scPRINT on other organisms rather than human?

scPRINT has been pretrained on both humans and mouse, and can be used on any organism with a similar gene set. If you want to use scPRINT on very different organisms, you will need to generate gene embeddings for that organism and re-train scPRINT

### how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)

please look at our supplementary tables in the [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1)

### I have different scRNASeq batches. Should I integrate my data before running scPRINT?

scPRINT takes raw count as inputs, so please don't use integrated data. Just give the raw counts to scPRINT and it will take care of the rest.

## Documentation

For more information on usage please see the documentation in [https://www.jkobject.com/scPrint/](https://www.jkobject.com/scPrint/)
For more information on usage please see the documentation in [https://www.jkobject.com/scPRINT/](https://www.jkobject.com/scPRINT/)

### Model Weights
## Model Weights

Model weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).

Expand All @@ -175,12 +197,12 @@ Acknowledgement:
[laminDB](https://lamin.ai/)
[lightning](https://lightning.ai/)

## Work in progress:
## Work in progress (PR welcomed):

1. remove the triton dependencies
2. add version with additional labels (tissues, age) and organisms (mouse, zebrafish) and more datasets from cellxgene
3. version with separate transformer blocks for the encoding part of the bottleneck learning and for the cell embeddings
4. improve classifier to output uncertainties and topK predictions when unsure
5.
5. setup latest lamindb version

Awesome Large Cell Model created by Jeremie Kalfon.
32 changes: 24 additions & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,17 +36,15 @@ If you want to be using flashattention2, know that it only supports triton 2.0 M

```python
conda create -n "[whatever]" python==3.10
git clone https://github.com/jkobject/scPRINT
#one of
pip install scPRINT # OR
pip install scPRINT[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
pip install scPRINT[flash] && pip install -e "git+https:/
pip install scprint # OR
pip install scprint[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
pip install scprint[flash] && pip install -e "git+https:/
/github.com/triton-lang/triton.git@legacy-backend
#egg=triton&subdirectory=python" # to use flashattention2, you will need to install triton 2.0.0.dev20221202 specifically, working on removing this dependency # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
```

We make use of some additional packages we developed alongside scPRint.

Please refer to their documentation for more information:

- [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
Expand Down Expand Up @@ -106,6 +104,8 @@ model = scPrint.load_from_checkpoint(

We now explore the different usages of scPRINT:

## FAQ

### I want to generate gene networks from scRNAseq data:

-> Refer to the section . gene network inference in [this notebook](./notebooks/cancer_usecase.ipynb#).
Expand Down Expand Up @@ -136,11 +136,27 @@ To run scPRINT, you can use the option to define the gene tokens using protein l

-> Refer to the documentation page [pretrain scprint](pretrain.md)

### Documentation
### how can I find if scPRINT was trained on my data?

If your data is available in cellxgene, scPRINT was likely trained on it. However some cells, datasets were dropped due to low quality data and some were randomly removed to be part of the validation / test sets.

### can I use scPRINT on other organisms rather than human?

scPRINT has been pretrained on both humans and mouse, and can be used on any organism with a similar gene set. If you want to use scPRINT on very different organisms, you will need to generate gene embeddings for that organism and re-train scPRINT

### how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)

please look at our supplementary tables in the [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1)

### I have different scRNASeq batches. Should I integrate my data before running scPRINT?

scPRINT takes raw count as inputs, so please don't use integrated data. Just give the raw counts to scPRINT and it will take care of the rest.

## Documentation

For more information on usage please see the documentation in [https://www.jkobject.com/scPrint/](https://www.jkobject.com/scPrint/)
For more information on usage please see the documentation in [https://www.jkobject.com/scPRINT/](https://www.jkobject.com/scPRINT/)

### Model Weights
## Model Weights

Model weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).

Expand Down
46 changes: 31 additions & 15 deletions docs/notebooks/cancer_usecase.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cancer usecase\n",
"# scPRINT use case on BPH\n",
"\n",
"In this use-case, also presented in Figure 5 of our [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1), we perform an extensive analysis of a multi studies dataset of benign prostatic hyperplasia. \n",
"\n",
"We would want to know if in this pre-cancerous state of the prostate, there exist cells that start to resemble cancerous ones. In those cells we would want to know which genes create might be implicated in these cell state changes. Providing us with potentially novel targets in the treatment of prostate cancer and BPH.\n",
"Our biological question is to check if there exist pre-cancerous cells that exhibits behaviors of mature cancer cells at this early stage of the disease. \n",
"\n",
"We start with a fresh datasets coming from the cellxgene database and representing [2 studies of BPH](https://pathsocjournals.onlinelibrary.wiley.com/doi/10.1002/path.5751).\n",
"In those cells, we want to know which genes might be implicated in cell state changes, and explore potentially novel targets in the treatment of prostate cancer and BPH.\n",
"\n",
"From these dataset we will ask many questions, like what are the cell types, what are the cell distributions, what sequencers were used, etc. \n",
"We will start with a fresh datasets coming from the [cellXgene database](https://cellxgene.cziscience.com/) and representing [2 studies of BPH](https://pathsocjournals.onlinelibrary.wiley.com/doi/10.1002/path.5751).\n",
"\n",
"We might want to find novel targets or confirm existing ones. \n",
"We will first explore these dataset to understand:\n",
"\n",
"Finally we might want to know how these targets interact with each other and form pathways. \n",
"- what are the cell types that are present in the data\n",
"- what are the cell distributions (cell distributions? what are they?)\n",
"- what sequencers were used, etc.\n",
"\n",
"We also want to confirm existing target in prostate cancer through precancerous lesion analysis, and find potentially novel ones that would serve as less invasive BPH treatments than current ones.\n",
"\n",
"Finally we want to know how these targets interacts and are involved in biological pathways.\n",
"\n",
"We now showcase how to use scPRINT across its different functionalities to answer some of these questions.\n",
"\n",
Expand Down Expand Up @@ -109,7 +115,9 @@
"\n",
"We then use scDataloader's preprocessing method. This method is quite extensive and does a few things.. find our more about it [on its documentation](https://www.jkobject.com/scDataLoader/).\n",
"\n",
"On our end we are using it mostly to make sure that the data is raw count and that there is enough genes expressed and enough counts per cells in the dataset. It will also increase the size of the expression matrix to be a fixed set of genes defined by the latest version of ensembl."
"On our end we are using the preprocessor to make sure that the the gene expression that we have are raw counts and that we have enough information to use scPRINT (i.e., enough genes expressed and enough counts per cells across the dataset). \n",
" \n",
"Finally, the preprocessor will also increase the size of the expression matrix to be a fixed set of genes defined by the latest version of ensemble."
]
},
{
Expand Down Expand Up @@ -200,7 +208,7 @@
"source": [
"## Embedding and annotations\n",
"\n",
"We now start to load a large version of scprint from a specific checkpoint. Please [download](https://huggingface.co/jkobject/scPRINT/tree/main) the checkpoints following the instructions in the README.\n",
"We now start to load a large version of scPRINT from a specific checkpoint. Please [download](https://huggingface.co/jkobject/scPRINT/tree/main) the checkpoints following the instructions in the README.\n",
"\n",
"We will then use out Embedder class to embed the data and annotate the cells. These classes are how we parametrize and access the different functions of `scPRINT`. Find out more about its parameters in our [documentation](https://www.jkobject.com/scPrint/).\n",
"\n"
Expand Down Expand Up @@ -369,11 +377,15 @@
"source": [
"## Annotation cleanup\n",
"\n",
"Since scPRINT generates predictions over hundreds of possible labels and do it per cell. it is often nice to cleanup the predictions. Here we use the most straightforward approach to remove any annotations that appear a small number of times.\n",
"scPRINT generates predictions over hundreds of possible labels for each cell. \n",
"\n",
"It is often advised to \"cleanup\" the predictions, e.g. making sure to remove low frequency cells and misslabellings. \n",
"\n",
"But a better approach would be majority voting over some cell cluster.\n",
"Here, we use the most straightforward approach which is to remove any annotations that appear a small number of times.\n",
"\n",
"We will also have a look at the embeddings of `scPRINT` by plotting its umap visualization."
"A better approach would be doing majority voting over cell clusters as it would aggregate and smoothout the predictions over multiple cells. it would also remove most of the low frequency mistakes in the predictions.\n",
"\n",
"We will also have a look at the embeddings of `scPRINT` by plotting its UMAP visualization.\n"
]
},
{
Expand Down Expand Up @@ -589,7 +601,9 @@
"source": [
"## Clustering and differential expression\n",
"\n",
"We will now cluster using the louvain algorithm on a kNN graph. Once we detect a cluster of interest we will perform differential expression analysis on it. Here taking as example some B-cell clusters. We will use scanpy's implementation of rank_gene_groups for our differential expression"
"We will now cluster using the louvain algorithm on a kNN graph. \n",
"\n",
"Once we detect a cluster of interest we will perform differential expression analysis on it. Taking as example some B-cell clusters, we will use scanpy's implementation of rank_gene_groups for our differential expression"
]
},
{
Expand Down Expand Up @@ -891,11 +905,13 @@
"source": [
"## Denoising and differential expression\n",
"\n",
"What we found out from our previous analysis is that there is not a lot of normal B-cells in our cluster. If we wanted to compare BPH B-cells to normal B-cells we might be very underpowered...\n",
"What we found out from our previous analysis is that there is not a lot of normal (i.e. healthy) B-cells in our cluster, most of them are BPH associated. In this case, if we wanted to compare BPH B-cells to normal B-cells we might be very underpowered...\n",
"\n",
"Instead of going to look for some other dataset, let's use `scPRINT` to increase the depth of the expression profile of the cells, virtually adding more signal to our dataset.\n",
"\n",
"We will use another class from scPRINT, the `Denoiser` (see more about the class in our [documentation](https://www.jkobject.com/scPrint/)). We will then show the results of differential expression analysis before and after denoising."
"We will use the `Denoiser` class (see more about the class in our [documentation](https://www.jkobject.com/scPrint/)) in a similar way `Trainer` is used in pytorch lightning to denoise the expression profile of the cells.\n",
"\n",
"We will then show the results of differential expression analysis before and after denoising."
]
},
{
Expand Down Expand Up @@ -1052,7 +1068,7 @@
"\n",
"Finally we will use scPRINT to infer gene networks on another cell of interest, the fibroblasts, in both normal and BPH conditions.\n",
"\n",
"We will use the `GRNfer` scPRINT class to infer gene networks. see the cancer_usecase_part2.ipynb for more details on how to analyse the gene networks."
"We will use the `GRNfer` class to infer gene networks. (_see the [cancer_usecase_part2.ipynb](./cancer_usecase_part2.ipynb) for more details on how to analyse the gene networks._)"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/notebooks/cancer_usecase_part2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cancer usecase (part 2)\n",
"# scPRINT use case on BPH (part 2, GN analysis)\n",
"\n",
"In this use-case, which some of the results are presented in Figure 5 of our [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1), we perform an extensive analysis of gene networks generated by scPRINT in our previous [notebook](./cancer_usecase.ipynb) for fibroblasts of the prostate in both normal and benign prostatic hyperplasia states. \n",
"\n",
Expand Down
1 change: 0 additions & 1 deletion docs/structure.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# structure


## gene embedders

Function to get embeddings from a set of genes, given their ensembl ids. For now use 2 different models:
Expand Down
2 changes: 1 addition & 1 deletion notebooks/bench_omni.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -25858,7 +25858,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
"version": "3.10.14"
},
"papermill": {
"default_parameters": {},
Expand Down
4 changes: 2 additions & 2 deletions notebooks/bench_perturbseq.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"tags": []
},
"source": [
"# grn bench perturb seq\n"
"# grn bench on genome wide perturb seq\n"
]
},
{
Expand Down Expand Up @@ -4510,7 +4510,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
"version": "3.10.14"
},
"papermill": {
"default_parameters": {},
Expand Down
Loading

0 comments on commit 46f3e13

Please sign in to comment.