Documentation for the scBank data structure #112

subercui · 2023-11-08T03:55:11Z

subercui
Nov 8, 2023
Maintainer

The Data Structure for Large-Scale Computing

Building up the data structure with large-scale computing bear in mind, support accessing and grouping cells across studies:

Key Features for 10+ Million Data

fast full splicing and indexing across studies
data streaming
easily appending new data or removing studies, without constraint of the gene dimensions
runtime data object of hybrid memory and disk storage
tracking, synchronizing and versioning of data changes
maximizing interpretability if saving in json, the on disk directory and files are self explanatory to a large extent
efficient compression and loading if saving in parquet

Data Schema

The key structure of scBank is the datatables. Each datatable essentially contains rows of data, each row per cell. Firstly, there will alway be a main datatable, which has no difference to other datatables, only its name will be indicated by the main_data field in the manifest.json file.

example_main.datatable.jsonl:
```
{
  "id": "cell_id", // required
  "genes": [gene_id_1, gene_id_3, ...],  // used if data is sparse
  "expressions": [value_1, value_3, ...],
},
...
```
We support additonal cell-specific contents like nromalized expressions, etc. Each additional data will be stored in an separate datatable.

An example of data in consecutive keys and values, usually can be used to store sparse cell-gene expressions,

normalized_expression.datatable.jsonl:
```
{
  "id": "cell_id",  // required
  "genes": [gene_id_1, gene_id_3, ...],  // used if data is sparse
  "expressions": [value_1, value_3, ...],
},
...
```
An example of data containing only dense values. Using this assumes cells having the same number of dimensions/columns, for example, like the umap coordinates, latent embedding, etc. The dim/col name can be specified in the study table,

some_dense_data.datatable.jsonl:
```
{
  "id": "cell_id",  // required
  "row_name": [value_1, value_2, ...], // find column keys in study table
},
...
```
Note: the difference between the two types of datatable is the number of fields. scBank will use this to load and maintain the data correctly, so the top level fields should always be id, [custom key name], custom value name.

The cell metatable to store cell-specific information, such as cell type, etc.

cellmeta.jsonl:

{
  "id": "cell_id",  // required
  "meta": {  // required
    "study": "study_id",  // required
    "cell_type": "cell_type",
    "cell_line": "cell_line",
    "disease": "disease",
    "tissue": "tissue",
    "age": "age",
    },
},
...

The study table is like a group of study cards. Each study card has information like study metadata, the cell ids that belong to the study. Study metadata include copy numbers, hvgs, cell type set, etc.

studytable.jsonl:

{  // a study card
  "id": "study_id",  // required
  "cells": [  // required
    "cell_id_1",
    "cell_id_2",
    ...
    ],
  "meta": {
    "cell_types": ["cell_type_1", "cell_type_2", ...],
    "hvgs": ["gene_id_1", "gene_id_3", ...],
    "copy_number": {
      "gene_id_1": copy_number_value_1,
      "gene_id_3": copy_number_value_3,
      ...
      },
    },
  "key_map": {  // optional, the column keys for dense datatables
    "some_dense_data": [gene_id_1, gene_id_2, ...],
    ...
    },
},
...

The paired gene vocabulary to link the gene_id to gene_name. Note: we can also have a celltype vocabulary to make sure the celltypes are represented in shared ids across studies.

gene.vocab.json:
```
{
  "1": "gene_name_1",  // required
  "2": "gene_name_2",
  ...
}
```

Gene annotation table. In theory, some gene annotations do not need to be associated with the studies.

gene.annotation.jsonl:

{
  "id": "gene_id", // required
  "function": "function_1",
  "total_variance": total_variance_1,
  "alias": ["alias_1", "alias_2", ...],
  ...
},
...

md5 checksum of the data, particularly for the gene vocabulary.

manifest.json:

{
  "gene_vocab_file_name": "md5_checksum_of_gene_vocab",  // required
  "main_data": "example_main",  // required, the name of the main datatable
  ...
}

Overall, the data can be stored in jsonl format. Or you can really setup a mongoDB database. All 6+ files stored in a specific directory, and file metadata stored in the md5 manifest file. Note: the data directory should be condidered as the protected data structure of scBank. Sould only use the scBank API to access and edit the data files.

Compared to Anndata, the X goes to the main table content, obs (like celltype, tissue ...) goes to the cell metadata, global var (like gene_name, function, all variance ...) goes to the gene annotation table, study-specific var (like copy numbers ...) goes to the study table metadata, uns (like cell_types, hvgs ...) goes to the study table metadata, obsm (like umap, pca ...) goes to the additional data tables, global varm goes to the gene annotation table, layers (like normalized expressions ...) goes to the additional data tables. obsp and varp will need future support, since need to nicely support custom annotation dimensions beyond cells or genes.

YubinXie · 2024-08-08T22:42:49Z

YubinXie
Aug 8, 2024

Hi @subercui, thanks for the writeup. Is this scbank? or there is another document for scbank used in scGPT. I would love to know more about how to build a loader for huge scRNA dataset. Loading everything in memory is really killing the memory and parquet via dask does not allow me to use row indexing. Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Documentation for the scBank data structure #112

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Documentation for the scBank data structure #112

Uh oh!

subercui Nov 8, 2023 Maintainer

The Data Structure for Large-Scale Computing

Key Features for 10+ Million Data

Data Schema

Replies: 1 comment

Uh oh!

YubinXie Aug 8, 2024

subercui
Nov 8, 2023
Maintainer

YubinXie
Aug 8, 2024