Documentation for the scBank data structure #112
subercui
announced in
Announcements
Replies: 1 comment
-
Hi @subercui, thanks for the writeup. Is this scbank? or there is another document for scbank used in scGPT. I would love to know more about how to build a loader for huge scRNA dataset. Loading everything in memory is really killing the memory and parquet via dask does not allow me to use row indexing. Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The Data Structure for Large-Scale Computing
Building up the data structure with large-scale computing bear in mind, support accessing and grouping cells across studies:
Key Features for 10+ Million Data
Data Schema
The key structure of scBank is the datatables. Each datatable essentially contains rows of data, each row per cell. Firstly, there will alway be a main datatable, which has no difference to other datatables, only its name will be indicated by the
main_data
field in themanifest.json
file.example_main.datatable.jsonl
:We support additonal cell-specific contents like nromalized expressions, etc. Each additional data will be stored in an separate datatable.
An example of data in consecutive keys and values, usually can be used to store sparse cell-gene expressions,
normalized_expression.datatable.jsonl
:An example of data containing only dense values. Using this assumes cells having the same number of dimensions/columns, for example, like the umap coordinates, latent embedding, etc. The dim/col name can be specified in the study table,
some_dense_data.datatable.jsonl
:Note: the difference between the two types of datatable is the number of fields. scBank will use this to load and maintain the data correctly, so the top level fields should always be id, [custom key name], custom value name.
The cell metatable to store cell-specific information, such as cell type, etc.
cellmeta.jsonl
:The study table is like a group of study cards. Each study card has information like study metadata, the cell ids that belong to the study. Study metadata include copy numbers, hvgs, cell type set, etc.
studytable.jsonl
:The paired gene vocabulary to link the gene_id to gene_name. Note: we can also have a celltype vocabulary to make sure the celltypes are represented in shared ids across studies.
gene.vocab.json
:Gene annotation table. In theory, some gene annotations do not need to be associated with the studies.
gene.annotation.jsonl
:md5 checksum of the data, particularly for the gene vocabulary.
manifest.json
:Overall, the data can be stored in jsonl format. Or you can really setup a mongoDB database. All 6+ files stored in a specific directory, and file metadata stored in the md5 manifest file. Note: the data directory should be condidered as the protected data structure of scBank. Sould only use the scBank API to access and edit the data files.
Compared to Anndata, the
X
goes to the main table content,obs
(like celltype, tissue ...) goes to the cell metadata, globalvar
(like gene_name, function, all variance ...) goes to the gene annotation table, study-specificvar
(like copy numbers ...) goes to the study table metadata,uns
(like cell_types, hvgs ...) goes to the study table metadata,obsm
(like umap, pca ...) goes to the additional data tables, globalvarm
goes to the gene annotation table,layers
(like normalized expressions ...) goes to the additional data tables.obsp
andvarp
will need future support, since need to nicely support custom annotation dimensions beyond cells or genes.Beta Was this translation helpful? Give feedback.
All reactions