improving docs

jkobject · Aug 7, 2024 · 46f3e13 · 46f3e13
1 parent a04b4ad
commit 46f3e13
Show file tree

Hide file tree

Showing 12 changed files with 132 additions and 115 deletions.
diff --git a/README.md b/README.md
@@ -2,13 +2,12 @@
 # scPRINT: Large Cell Model for scRNAseq data
 
 [![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
-[![Documentation Status](https://readthedocs.org/projects/scprint/badge/?version=latest)](https://scprint.readthedocs.io/en/latest/?badge=latest)
 [![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
 [![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
 [![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
 [![GitHub issues](https://img.shields.io/github/issues/jkobject/scPRINT)](https://img.shields.io/github/issues/jkobject/scPRINT)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
-[![DOI](https://zenodo.org/badge/391909874.svg)]()
+[![DOI](https://zenodo.org/badge/391909874.svg)](https://doi.org/10.1101/2024.07.29.605556)
 
 ![logo](docs/logo.png)
 
@@ -23,7 +22,7 @@ scPRINT can be used to perform the following analyses:
 - __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
 - __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset
 
-[Read the paper!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.
+[Read the manuscript!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.
 
 ![figure1](docs/figure1.png)
 
@@ -36,16 +35,21 @@ scPRINT can be used to perform the following analyses:
   - [Usage](#usage)
     - [scPRINT's basic commands](#scprints-basic-commands)
     - [Notes on GPU/CPU usage with triton](#notes-on-gpucpu-usage-with-triton)
+  - [FAQ](#faq)
     - [I want to generate gene networks from scRNAseq data:](#i-want-to-generate-gene-networks-from-scrnaseq-data)
     - [I want to generate cell embeddings and cell label predictions from scRNAseq data:](#i-want-to-generate-cell-embeddings-and-cell-label-predictions-from-scrnaseq-data)
-    - [I want to denoising my scRNAseq dataset:](#i-want-to-denoising-my-scrnaseq-dataset)
+    - [I want to denoise my scRNAseq dataset:](#i-want-to-denoise-my-scrnaseq-dataset)
     - [I want to generate an atlas-level embedding](#i-want-to-generate-an-atlas-level-embedding)
     - [I need to generate gene tokens using pLLMs](#i-need-to-generate-gene-tokens-using-pllms)
     - [I want to pre-train scPRINT from scratch on my own data](#i-want-to-pre-train-scprint-from-scratch-on-my-own-data)
-    - [Documentation](#documentation)
-    - [Model Weights](#model-weights)
+    - [how can I find if scPRINT was trained on my data?](#how-can-i-find-if-scprint-was-trained-on-my-data)
+    - [can I use scPRINT on other organisms rather than human?](#can-i-use-scprint-on-other-organisms-rather-than-human)
+    - [how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)](#how-long-does-scprint-takes-what-kind-of-resources-do-i-need-or-in-alternative-can-i-run-scprint-locally)
+    - [I have different scRNASeq batches. Should I integrate my data before running scPRINT?](#i-have-different-scrnaseq-batches-should-i-integrate-my-data-before-running-scprint)
+  - [Documentation](#documentation)
+  - [Model Weights](#model-weights)
   - [Development](#development)
-  - [Work in progress:](#work-in-progress)
+  - [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)
 
 
 ## Install `scPRINT`
@@ -54,15 +58,15 @@ For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Py
 
 If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.
 
-```python
+```bash
 conda create -n "[whatever]" python==3.10
-git clone https://github.com/jkobject/scPRINT
 #one of
-pip install scPRINT # OR
-pip install scPRINT[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
-pip install scPRINT[flash] && pip install -e "git+https:/
+pip install scprint # OR
+pip install scprint[dev] # for the dev dependencies (building etc..) OR
+pip install scprint[flash] && pip install -e "git+https:/
 /github.com/triton-lang/triton.git@legacy-backend
 #egg=triton&subdirectory=python" # to use flashattention2, you will need to install triton 2.0.0.dev20221202 specifically, working on removing this dependency # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
+#OR pip install scPRINT[dev,flash]
 ```
 
 We make use of some additional packages we developed alongside scPRint.
@@ -126,6 +130,8 @@ model = scPrint.load_from_checkpoint(
 
 We now explore the different usages of scPRINT:
 
+## FAQ
+
 ### I want to generate gene networks from scRNAseq data:
 
 -> Refer to the section . gene network inference in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).
@@ -136,7 +142,7 @@ We now explore the different usages of scPRINT:
 
 -> Refer to the embeddings and cell annotations section in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).
 
-### I want to denoising my scRNAseq dataset:
+### I want to denoise my scRNAseq dataset:
 
 -> Refer to the Denoising of B-cell section in [this notebook](./docs/notebooks/cancer_usecase.ipynb).
 
@@ -156,11 +162,27 @@ To run scPRINT, you can use the option to define the gene tokens using protein l
 
 -> Refer to the documentation page [pretrain scprint](docs/pretrain.md)
 
-### Documentation
+### how can I find if scPRINT was trained on my data?
+
+If your data is available in cellxgene, scPRINT was likely trained on it. However some cells, datasets were dropped due to low quality data and some were randomly removed to be part of the validation / test sets.
+
+### can I use scPRINT on other organisms rather than human?
+
+scPRINT has been pretrained on both humans and mouse, and can be used on any organism with a similar gene set. If you want to use scPRINT on very different organisms, you will need to generate gene embeddings for that organism and re-train scPRINT
+
+### how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)
+
+please look at our supplementary tables in the [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1)
+
+### I have different scRNASeq batches. Should I integrate my data before running scPRINT?
+
+scPRINT takes raw count as inputs, so please don't use integrated data. Just give the raw counts to scPRINT and it will take care of the rest.
+
+## Documentation
 
-For more information on usage please see the documentation in [https://www.jkobject.com/scPrint/](https://www.jkobject.com/scPrint/)
+For more information on usage please see the documentation in [https://www.jkobject.com/scPRINT/](https://www.jkobject.com/scPRINT/)
 
-### Model Weights
+## Model Weights
 
 Model weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).
 
@@ -175,12 +197,12 @@ Acknowledgement:
 [laminDB](https://lamin.ai/)
 [lightning](https://lightning.ai/)
 
-## Work in progress:
+## Work in progress (PR welcomed):
 
 1. remove the triton dependencies
 2. add version with additional labels (tissues, age) and organisms (mouse, zebrafish) and more datasets from cellxgene
 3. version with separate transformer blocks for the encoding part of the bottleneck learning and for the cell embeddings
 4. improve classifier to output uncertainties and topK predictions when unsure
-5. 
+5. setup latest lamindb version
 
 Awesome Large Cell Model created by Jeremie Kalfon.
diff --git a/docs/index.md b/docs/index.md
@@ -36,17 +36,15 @@ If you want to be using flashattention2, know that it only supports triton 2.0 M
 
 ```python
 conda create -n "[whatever]" python==3.10
-git clone https://github.com/jkobject/scPRINT
 #one of
-pip install scPRINT # OR
-pip install scPRINT[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
-pip install scPRINT[flash] && pip install -e "git+https:/
+pip install scprint # OR
+pip install scprint[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
+pip install scprint[flash] && pip install -e "git+https:/
 /github.com/triton-lang/triton.git@legacy-backend
 #egg=triton&subdirectory=python" # to use flashattention2, you will need to install triton 2.0.0.dev20221202 specifically, working on removing this dependency # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
 ```
 
 We make use of some additional packages we developed alongside scPRint.
-
 Please refer to their documentation for more information:
 
 - [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
@@ -106,6 +104,8 @@ model = scPrint.load_from_checkpoint(
 
 We now explore the different usages of scPRINT:
 
+## FAQ
+
 ### I want to generate gene networks from scRNAseq data:
 
 -> Refer to the section . gene network inference in [this notebook](./notebooks/cancer_usecase.ipynb#).
@@ -136,11 +136,27 @@ To run scPRINT, you can use the option to define the gene tokens using protein l
 
 -> Refer to the documentation page [pretrain scprint](pretrain.md)
 
-### Documentation
+### how can I find if scPRINT was trained on my data?
+
+If your data is available in cellxgene, scPRINT was likely trained on it. However some cells, datasets were dropped due to low quality data and some were randomly removed to be part of the validation / test sets.
+
+### can I use scPRINT on other organisms rather than human?
+
+scPRINT has been pretrained on both humans and mouse, and can be used on any organism with a similar gene set. If you want to use scPRINT on very different organisms, you will need to generate gene embeddings for that organism and re-train scPRINT
+
+### how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)
+
+please look at our supplementary tables in the [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1)
+
+### I have different scRNASeq batches. Should I integrate my data before running scPRINT?
+
+scPRINT takes raw count as inputs, so please don't use integrated data. Just give the raw counts to scPRINT and it will take care of the rest.
+
+## Documentation
 
-For more information on usage please see the documentation in [https://www.jkobject.com/scPrint/](https://www.jkobject.com/scPrint/)
+For more information on usage please see the documentation in [https://www.jkobject.com/scPRINT/](https://www.jkobject.com/scPRINT/)
 
-### Model Weights
+## Model Weights
 
 Model weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).
 

diff --git a/docs/notebooks/cancer_usecase.ipynb b/docs/notebooks/cancer_usecase.ipynb
@@ -4,19 +4,25 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Cancer usecase\n",
+    "# scPRINT use case on BPH\n",
     "\n",
     "In this use-case, also presented in Figure 5 of our [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1), we perform an extensive analysis of a multi studies dataset of benign prostatic hyperplasia. \n",
     "\n",
-    "We would want to know if in this pre-cancerous state of the prostate, there exist cells that start to resemble cancerous ones. In those cells we would want to know which genes create might be implicated in these cell state changes. Providing us with potentially novel targets in the treatment of prostate cancer and BPH.\n",
+    "Our biological question is to check if there exist pre-cancerous cells that exhibits behaviors of mature cancer cells at this early stage of the disease. \n",
     "\n",
-    "We start with a fresh datasets coming from the cellxgene database and representing [2 studies of BPH](https://pathsocjournals.onlinelibrary.wiley.com/doi/10.1002/path.5751).\n",
+    "In those cells, we  want to know which genes might be implicated in cell state changes, and explore  potentially novel targets in the treatment of prostate cancer and BPH.\n",
     "\n",
-    "From these dataset we will ask many questions, like what are the cell types, what are the cell distributions, what sequencers were used, etc. \n",
+    "We will start with a fresh datasets coming from the [cellXgene database](https://cellxgene.cziscience.com/) and representing [2 studies of BPH](https://pathsocjournals.onlinelibrary.wiley.com/doi/10.1002/path.5751).\n",
     "\n",
-    "We might want to find novel targets or confirm existing ones. \n",
+    "We will first explore these dataset to understand:\n",
     "\n",
-    "Finally we might want to know how these targets interact with each other and form pathways. \n",
+    "- what are the cell types that are present in the data\n",
+    "- what are the cell distributions (cell distributions? what are they?)\n",
+    "- what sequencers were used, etc.\n",
+    "\n",
+    "We also want to confirm existing target in prostate cancer through precancerous lesion analysis, and find potentially novel ones that would serve as less invasive BPH treatments than current ones.\n",
+    "\n",
+    "Finally we  want to know how these targets interacts and are involved in biological pathways.\n",
     "\n",
     "We now showcase how to use scPRINT across its different functionalities to answer some of these questions.\n",
     "\n",
@@ -109,7 +115,9 @@
     "\n",
     "We then use scDataloader's preprocessing method. This method is quite extensive and does a few things.. find our more about it [on its documentation](https://www.jkobject.com/scDataLoader/).\n",
     "\n",
-    "On our end we are using it mostly to make sure that the data is raw count and that there is enough genes expressed and enough counts per cells in the dataset. It will also increase the size of the expression matrix to be a fixed set of genes defined by the latest version of ensembl."
+    "On our end we are using the preprocessor to make sure that the the gene expression that we have are raw counts and that we have enough information to use scPRINT (i.e., enough genes expressed and enough counts per cells across the dataset). \n",
+    " \n",
+    "Finally, the preprocessor will also increase the size of the expression matrix to be a fixed set of genes defined by the latest version of ensemble."
    ]
   },
   {
@@ -200,7 +208,7 @@
    "source": [
     "## Embedding and annotations\n",
     "\n",
-    "We now start to load a large version of scprint from a specific checkpoint. Please [download](https://huggingface.co/jkobject/scPRINT/tree/main) the checkpoints following the instructions in the README.\n",
+    "We now start to load a large version of scPRINT from a specific checkpoint. Please [download](https://huggingface.co/jkobject/scPRINT/tree/main) the checkpoints following the instructions in the README.\n",
     "\n",
     "We will then use out Embedder class to embed the data and annotate the cells. These classes are how we parametrize and access the different functions of `scPRINT`. Find out more about its parameters in our [documentation](https://www.jkobject.com/scPrint/).\n",
     "\n"
@@ -369,11 +377,15 @@
    "source": [
     "## Annotation cleanup\n",
     "\n",
-    "Since scPRINT generates predictions over hundreds of possible labels and do it per cell. it is often nice to cleanup the predictions. Here we use the most straightforward approach to remove any annotations that appear a small number of times.\n",
+    "scPRINT generates predictions over hundreds of possible labels for each cell. \n",
+    "\n",
+    "It is often advised to \"cleanup\" the predictions, e.g. making sure to remove low frequency cells and misslabellings. \n",
     "\n",
-    "But a better approach would be majority voting over some cell cluster.\n",
+    "Here, we use the most straightforward approach which is to remove any annotations that appear a small number of times.\n",
     "\n",
-    "We will also have a look at the embeddings of `scPRINT` by plotting its umap visualization."
+    "A better approach would be doing majority voting over cell clusters as it would aggregate and smoothout the predictions over multiple cells. it would also remove most of the low frequency mistakes in the predictions.\n",
+    "\n",
+    "We will also have a look at the embeddings of `scPRINT` by plotting its UMAP visualization.\n"
    ]
   },
   {
@@ -589,7 +601,9 @@
    "source": [
     "## Clustering and differential expression\n",
     "\n",
-    "We will now cluster using the louvain algorithm on a kNN graph. Once we detect a cluster of interest we will perform differential expression analysis on it. Here taking as example some B-cell clusters. We will use scanpy's implementation of rank_gene_groups for our differential expression"
+    "We will now cluster using the louvain algorithm on a kNN graph. \n",
+    "\n",
+    "Once we detect a cluster of interest we will perform differential expression analysis on it. Taking as example some B-cell clusters, we will use scanpy's implementation of rank_gene_groups for our differential expression"
    ]
   },
   {
@@ -891,11 +905,13 @@
    "source": [
     "## Denoising and differential expression\n",
     "\n",
-    "What we found out from our previous analysis is that there is not a lot of normal B-cells in our cluster. If we wanted to compare BPH B-cells to normal B-cells we might be very underpowered...\n",
+    "What we found out from our previous analysis is that there is not a lot of normal (i.e. healthy) B-cells in our cluster, most of them are BPH associated. In this case, if we wanted to compare BPH B-cells to normal B-cells we might be very underpowered...\n",
     "\n",
     "Instead of going to look for some other dataset, let's use `scPRINT` to increase the depth of the expression profile of the cells, virtually adding more signal to our dataset.\n",
     "\n",
-    "We will use another class from scPRINT, the `Denoiser` (see more about the class in our [documentation](https://www.jkobject.com/scPrint/)). We will then show the results of differential expression analysis before and after denoising."
+    "We will use the `Denoiser` class (see more about the class in our [documentation](https://www.jkobject.com/scPrint/)) in a similar way `Trainer` is used in pytorch lightning to denoise the expression profile of the cells.\n",
+    "\n",
+    "We will then show the results of differential expression analysis before and after denoising."
    ]
   },
   {
@@ -1052,7 +1068,7 @@
     "\n",
     "Finally we will use scPRINT to infer gene networks on another cell of interest, the fibroblasts, in both normal and BPH conditions.\n",
     "\n",
-    "We will  use the `GRNfer` scPRINT class to infer gene networks. see the cancer_usecase_part2.ipynb for more details on how to analyse the gene networks."
+    "We will  use the `GRNfer` class to infer gene networks. (_see the [cancer_usecase_part2.ipynb](./cancer_usecase_part2.ipynb) for more details on how to analyse the gene networks._)"
    ]
   },
   {

diff --git a/docs/notebooks/cancer_usecase_part2.ipynb b/docs/notebooks/cancer_usecase_part2.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Cancer usecase (part 2)\n",
+    "# scPRINT use case on BPH (part 2, GN analysis)\n",
     "\n",
     "In this use-case, which some of the results are presented in Figure 5 of our [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1), we perform an extensive analysis of gene networks generated by scPRINT in our previous [notebook](./cancer_usecase.ipynb) for fibroblasts of the prostate in both normal and benign prostatic hyperplasia states. \n",
     "\n",

diff --git a/docs/structure.md b/docs/structure.md
@@ -1,6 +1,5 @@
 # structure
 
-
 ## gene embedders
 
 Function to get embeddings from a set of genes, given their ensembl ids. For now use 2 different models:

diff --git a/notebooks/bench_omni.ipynb b/notebooks/bench_omni.ipynb
@@ -25858,7 +25858,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.0"
+   "version": "3.10.14"
   },
   "papermill": {
    "default_parameters": {},

diff --git a/notebooks/bench_perturbseq.ipynb b/notebooks/bench_perturbseq.ipynb
@@ -14,7 +14,7 @@
     "tags": []
    },
    "source": [
-    "# grn bench perturb seq\n"
+    "# grn bench on genome wide perturb seq\n"
    ]
   },
   {
@@ -4510,7 +4510,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.0"
+   "version": "3.10.14"
   },
   "papermill": {
    "default_parameters": {},