misc doc + prepare staining based tissue segmentation

gustaveroussy · Oct 16, 2024 · ea94902 · ea94902
1 parent 76c3f3d
commit ea94902
Show file tree

Hide file tree

Showing 8 changed files with 121 additions and 78 deletions.
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@
 
 Built on top of [SpatialData](https://github.com/scverse/spatialdata), Sopa enables processing and analyses of spatial omics data with single-cell resolution (spatial transcriptomics or multiplex imaging data) using a standard data structure and output. We currently support the following technologies: Xenium, Visium HD, MERSCOPE, CosMX, PhenoCycler, MACSima, Hyperion. Sopa was designed for generability and low memory consumption on large images (scales to `1TB+` images).
 
-The pipeline outputs contain: (i) Xenium Explorer files for interactive visualization, (ii) an HTML report for quick quality controls, and (iii) a SpatialData `.zarr` directory for further analyses.
+🎉 `sopa==2.0.0` is out! It introduces many new API features; check [our migration guide](https://github.com/gustaveroussy/sopa/discussions/138) to smoothly update your code base.
 
 # Documentation
 
@@ -32,7 +32,7 @@ The following illustration describes the main steps of `sopa`:
 # Installation
 
 ### PyPI installation
-Sopa can be installed via `PyPI` on all operating systems. The preferred Python version is `python==3.10`, but we also support `3.9` to `3.11`. On a new environment, run the following command:
+Sopa can be installed via `PyPI` on all operating systems, with the only requirement being Python (`>=3.9` and `<=3.11`). On a new environment, run the following command:
 ```
 pip install sopa
 ```
@@ -42,21 +42,35 @@ To install extras (for example, if you want to use `snakemake`/`cellpose`/`bayso
 pip install 'sopa[snakemake,cellpose,baysor,tangram]'
 ```
 
-Important: even though `pip install 'sopa[baysor]'` will install some dependencies related to baysor, you still have to install the `baysor` command line (see the [official repository](https://github.com/kharchenkolab/Baysor)) if you want to use it.
+**Important**: even though `pip install 'sopa[baysor]'` will install some dependencies related to baysor, you still have to install the `baysor` command line (see the [official repository](https://github.com/kharchenkolab/Baysor)) if you want to use it.
 
 ### Other installation modes
 
 You can clone the repository and run one of these command lines at the root of `sopa`:
-```
-pip install -e . # dev mode installation
+```sh
+pip install -e .  # dev mode installation
 poetry install    # poetry installation
 ```
 
 # Features
 Sopa comes in three different flavours, each corresponding to a different use case:
+- `API`: use directly `sopa` as a Python package for complete flexibility and customization
 - `Snakemake pipeline`: choose a config, and run our pipeline on your spatial data in a couple of minutes
 - `CLI`: use our command-line-interface for prototyping quickly your own pipeline
-- `API`: use directly `sopa` as a Python package for complete flexibility and customization
+
+### API
+
+Below is an example of a minimal API usage. For a complete API description, please refer to the [documentation](https://gustaveroussy.github.io/sopa).
+
+```python
+import sopa
+
+sdata = sopa.io.xenium("path/to/data") # reading Xenium data
+
+sopa.make_image_patches(sdata) # creating overlapping patches
+sopa.segmentation.cellpose(sdata, "DAPI", diameter=30) # running cellpose segmentation
+sopa.aggregate(sdata) # counting the transcripts inside the cells
+```
 
 ### Snakemake pipeline
 
@@ -71,7 +85,7 @@ For more details on `snakemake` configuration and how to properly setup your env
 
 ### CLI
 
-Below are examples of commands that can be run with the `sopa` CLI:
+Below are examples of commands that can be run with the `sopa` CLI. For a complete description of the CLI, please refer to the [documentation](https://gustaveroussy.github.io/sopa/cli).
 
 ```bash
 > sopa --help # show command names and arguments
@@ -83,18 +97,6 @@ Below are examples of commands that can be run with the `sopa` CLI:
 > sopa explorer write merscope_directory.zarr # convert for interactive vizualisation
 ```
 
-For a complete description of the CLI, please refer to the [documentation](https://gustaveroussy.github.io/sopa/cli).
-
-### API
-
-```python
-import sopa
-
-# use the 'sopa' python package
-```
-
-For a complete API description, please refer to the [documentation](https://gustaveroussy.github.io/sopa).
-
 # Cite us
 Our article is published in [Nature Communications](https://www.nature.com/articles/s41467-024-48981-z). You can cite our paper as below:
 

diff --git a/docs/api/misc.md b/docs/api/misc.md
@@ -25,19 +25,19 @@ sopa.settings.parallelization_backend = None # no backend (i.e., sequential)
 
 ## Xenium Explorer
 
-::: sopa.io.write_xenium_explorer
+::: sopa.io.explorer.write
 
-::: sopa.io.align
+::: sopa.io.explorer.align
 
-::: sopa.io.add_xenium_explorer_selection
+::: sopa.io.explorer.add_explorer_selection
 
-::: sopa.io.int_cell_id
+::: sopa.io.explorer.int_cell_id
 
-::: sopa.io.str_cell_id
+::: sopa.io.explorer.str_cell_id
 
-::: sopa.io.write_image
+::: sopa.io.explorer.write_image
 
-::: sopa.io.save_column_csv
+::: sopa.io.explorer.save_column_csv
 
 ## Report
 

diff --git a/docs/api/readers.md b/docs/api/readers.md
@@ -1,10 +1,9 @@
 # Readers
 
-!!! notes
-    Due to many updates in the data format provided by the different companies, you might have issues loading your data. In this case, consider [opening an issue](https://github.com/gustaveroussy/sopa/issues) detailing the version of the machine you used and the error log, as well as an example of file names that you are trying to read.
+The readers below are used to read your raw spatial data into a `SpatialData` object. Choose the right function below, according to your technology of interest.
 
-!!! notes "Related to `spatialdata-io`"
-    A library called [`spatialdata-io`](https://spatialdata.scverse.org/projects/io/en/latest/) already contains a lot of readers. Here, we add minor functionnalities on top of the readers from `spatialdata-io`, and add a few others.
+!!! warning
+    Due to many updates in the data format provided by the different companies, you might have issues loading your data. In this case, consider [opening an issue](https://github.com/gustaveroussy/sopa/issues) detailing the version of the machine you used and the error log, as well as an example of file names that you are trying to read.
 
 ::: sopa.io.xenium
 

diff --git a/docs/index.md b/docs/index.md
@@ -6,7 +6,7 @@
 
 Built on top of [SpatialData](https://github.com/scverse/spatialdata), Sopa enables processing and analyses of image-based spatial omics using a standard data structure and output. We currently support the following technologies: Xenium, MERSCOPE, CosMX, PhenoCycler, MACSIMA, Hyperion. Sopa was designed for generability and low memory consumption on large images (scales to `1TB+` images).
 
-The pipeline outputs contain: (i) Xenium Explorer files for interactive visualization, (ii) an HTML report for quick quality controls, and (iii) a SpatialData `.zarr` directory for further analyses.
+🎉 `sopa==2.0.0` is out! It introduces many new API features; check [our migration guide](https://github.com/gustaveroussy/sopa/discussions/138) to smoothly update your code base.
 
 ## Overview
 

diff --git a/sopa/patches/cluster.py b/sopa/patches/cluster.py
@@ -32,7 +32,7 @@ def cluster_embeddings(
     key_added: str = "cluster",
     **method_kwargs: str,
 ) -> gpd.GeoDataFrame:
-    """Cluster the patches embeddings using a clustering method
+    """Create clusters of the patches embeddings (obtained from [sopa.patches.compute_embeddings][]).
 
     Args:
         sdata: A `SpatialData` object

diff --git a/sopa/patches/infer.py b/sopa/patches/infer.py
@@ -37,7 +37,7 @@ def compute_embeddings(
     batch_size: int = 32,
     device: str = None,
 ) -> DataArray:
-    """Create an image made of patch based predictions of an image (useful for WSI images notably).
+    """It creates patches, runs a computer vision model on each patch, and store the embeddings of each all patches as an image. This is mostly useful for WSI images.
 
     !!! info
         The image will be saved into the `SpatialData` object with the key `sopa_{model_name}` (see the argument below).
@@ -54,7 +54,7 @@ def compute_embeddings(
         device: Device used for the computer vision model.
 
     Returns:
-        The `DataArray` of shape `(C,Y,X)` containing the model predictions (they are also saved in the `SpatialData` object).
+        The `DataArray` of shape `(C,Y,X)` containing the model predictions (also added to the `SpatialData` object).
     """
     try:
         import torch

diff --git a/sopa/patches/patches.py b/sopa/patches/patches.py
@@ -33,6 +33,14 @@
 def make_image_patches(
     sdata: SpatialData, patch_width: int = 2000, patch_overlap: int = 50, image_key: str | None = None
 ):
+    """Create overlapping patches on an image. This can be used for image-based segmentation methods such as Cellpose, which will run on each patch.
+
+    Args:
+        sdata: A `SpatialData` object.
+        patch_width: Width of the patches, in pixels.
+        patch_overlap: Number of pixels of overlap between patches.
+        image_key: Optional key of the image on which the patches will be made. If not provided, it is found automatically.
+    """
     image_key, _ = get_spatial_image(sdata, key=image_key, return_key=True)
     patches = Patches2D(sdata, image_key, patch_width=patch_width, patch_overlap=patch_overlap)
 
@@ -45,7 +53,15 @@ def make_transcript_patches(
     patch_overlap: int = 50,
     points_key: str | None = None,
     **kwargs: int,
-) -> list[int]:
+):
+    """Create overlapping patches on a transcripts dataframe, and save it in a cache. This can be used for trancript-based segmentation methods such as Baysor.
+
+    Args:
+        sdata: A `SpatialData` object.
+        patch_width: Width of the patches, in microns.
+        patch_overlap: Number of microns of overlap between patches.
+        points_key: Optional key of the points on which the patches will be made. If not provided, it is found automatically.
+    """
     points_key, _ = get_spatial_element(
         sdata.points, key=points_key or sdata.attrs.get(SopaAttrs.TRANSCRIPTS), return_key=True
     )

diff --git a/sopa/segmentation/tissue.py b/sopa/segmentation/tissue.py
@@ -18,14 +18,7 @@
 
 log = logging.getLogger(__name__)
 
-
-def hsv_otsu(*args, **kwargs):
-    warnings.warn(
-        "This function is deprecated and will be removed in late 2024. Use `sopa.tissue_segmentation` instead.",
-        DeprecationWarning,
-        stacklevel=2,
-    )
-    tissue_segmentation(*args, **kwargs)
+AVAILABLE_MODES = ["hsv_otsu", "staining"]
 
 
 def tissue_segmentation(
@@ -36,6 +29,7 @@ def tissue_segmentation(
     open_k: int = 5,
     close_k: int = 5,
     drop_threshold: int = 0.01,
+    mode: str | None = None,
     key_added: str = SopaKeys.ROI,
 ):
     """Perform WSI tissue segmentation. The resulting regions-of-interests (ROI) are saved as shapes.
@@ -57,8 +51,11 @@ def tissue_segmentation(
         open_k: The kernel size of the morphological openning operation
         close_k: The kernel size of the morphological closing operation
         drop_threshold: Segments that cover less area than `drop_threshold`*100% of the number of pixels of the image will be removed
+        mode: Two modes are available: `hsv_otsu` (for H&E data) and `staining`. By default, `hsv_otsu` is used only if there are exactly 3 channels.
         key_added: Name of the spatial element that will be added, containing the segmented tissue polygons.
     """
+    assert mode is None or mode in AVAILABLE_MODES, f"`mode` argument should be one of {AVAILABLE_MODES}"
+
     if key_added in sdata.shapes:
         log.warning(f"sdata['{key_added}'] was already existing, but tissue segmentation is run on top")
 
@@ -72,50 +69,79 @@ def tissue_segmentation(
         level_keys = list(image.keys())
         image: DataArray = next(iter(image[level_keys[level]].values()))
 
-    geo_df = _get_polygons(image, blur_k, open_k, close_k, drop_threshold)
+    geo_df = TissueSegmentation(image, blur_k, open_k, close_k, drop_threshold).get_polygons(mode)
 
-    assert len(
-        geo_df
-    ), "No polygon has been found after tissue segmentation. Check that there is some tissue in the image, or consider updating the segmentation parameters."
+    if not len(geo_df):
+        log.warning(
+            "No polygon has been found after tissue segmentation. "
+            "Check that there is some tissue in the image, or consider updating the function parameters."
+        )
+        return
 
     geo_df = to_valid_polygons(geo_df)
     geo_df = ShapesModel.parse(geo_df, transformations=get_transformation(image, get_all=True).copy())
 
     add_spatial_element(sdata, key_added, geo_df)
 
 
-def _get_polygons(image: DataArray, blur_k: int, open_k: int, close_k: int, drop_threshold: int) -> gpd.GeoDataFrame:
-    import cv2
+class TissueSegmentation:
+    def __init__(self, image: DataArray, blur_k: int, open_k: int, close_k: int, drop_threshold: int):
+        self.image = image
+        self.blur_k = blur_k
+        self.open_k = open_k
+        self.close_k = close_k
+        self.drop_threshold = drop_threshold
 
-    thumbnail = np.array(image.transpose("y", "x", "c"))
+    def _get_default_mode(self) -> str:
+        return "hsv_otsu" if self.image.sizes["c"] == 3 else "staining"
 
-    assert thumbnail.shape[2] == 3, "The image should be in RGB color space"
+    def get_polygons(self, mode: str | None) -> gpd.GeoDataFrame:
+        return getattr(self, mode or self._get_default_mode())()
 
-    if thumbnail.shape[0] * thumbnail.shape[1] > 1e8:
-        log.warning(
-            "Tissue segmentation is computationally expensive for large images. Consider using a smaller image, or set the `level` parameter."
-        )
+    def hsv_otsu(self) -> gpd.GeoDataFrame:
+        import cv2
+
+        thumbnail = np.array(self.image.transpose("y", "x", "c"))
+
+        assert thumbnail.shape[2] == 3, "The image should be in RGB color space"
 
-    thumbnail_hsv = cv2.cvtColor(thumbnail, cv2.COLOR_RGB2HSV)
-    thumbnail_hsv_blurred = cv2.medianBlur(thumbnail_hsv[:, :, 1], blur_k)
-    _, mask = cv2.threshold(thumbnail_hsv_blurred, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY)
-
-    mask_open = cv2.morphologyEx(mask, cv2.MORPH_OPEN, np.ones((open_k, open_k), np.uint8))
-    mask_open_close = cv2.morphologyEx(mask_open, cv2.MORPH_CLOSE, np.ones((close_k, close_k), np.uint8))
-
-    num_labels, labels, stats, _ = cv2.connectedComponentsWithStats(mask_open_close, 4, cv2.CV_32S)
-
-    contours = []
-    for i in range(1, num_labels):
-        if stats[i, 4] > drop_threshold * np.prod(mask_open_close.shape):
-            cc = cv2.findContours(
-                np.array(labels == i, dtype="uint8"),
-                cv2.RETR_TREE,
-                cv2.CHAIN_APPROX_NONE,
-            )[
-                0
-            ][0]
-            c_closed = np.array(list(cc) + [cc[0]])
-            contours.extend([c_closed.squeeze()])
-
-    return gpd.GeoDataFrame(geometry=[Polygon(contour) for contour in contours])
+        if thumbnail.shape[0] * thumbnail.shape[1] > 1e8:
+            log.warning(
+                "Tissue segmentation is computationally expensive for large images. Consider using a smaller image, or set the `level` parameter."
+            )
+
+        thumbnail_hsv = cv2.cvtColor(thumbnail, cv2.COLOR_RGB2HSV)
+        thumbnail_hsv_blurred = cv2.medianBlur(thumbnail_hsv[:, :, 1], self.blur_k)
+        _, mask = cv2.threshold(thumbnail_hsv_blurred, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY)
+
+        mask_open = cv2.morphologyEx(mask, cv2.MORPH_OPEN, np.ones((self.open_k, self.open_k), np.uint8))
+        mask_open_close = cv2.morphologyEx(mask_open, cv2.MORPH_CLOSE, np.ones((self.close_k, self.close_k), np.uint8))
+
+        num_labels, labels, stats, _ = cv2.connectedComponentsWithStats(mask_open_close, 4, cv2.CV_32S)
+
+        contours = []
+        for i in range(1, num_labels):
+            if stats[i, 4] > self.drop_threshold * np.prod(mask_open_close.shape):
+                cc = cv2.findContours(
+                    np.array(labels == i, dtype="uint8"),
+                    cv2.RETR_TREE,
+                    cv2.CHAIN_APPROX_NONE,
+                )[
+                    0
+                ][0]
+                c_closed = np.array(list(cc) + [cc[0]])
+                contours.extend([c_closed.squeeze()])
+
+        return gpd.GeoDataFrame(geometry=[Polygon(contour) for contour in contours])
+
+    def staining(self) -> gpd.GeoDataFrame:
+        raise NotImplementedError
+
+
+def hsv_otsu(*args, **kwargs):
+    warnings.warn(
+        "This function is deprecated and will be removed in late 2024. Use `sopa.tissue_segmentation` instead.",
+        DeprecationWarning,
+        stacklevel=2,
+    )
+    tissue_segmentation(*args, **kwargs)