Cellarium Nexus

A comprehensive platform for managing, processing, and administering single-cell RNA sequencing (scRNA-seq) omics data.

Project Overview

Cellarium Nexus is a monorepository containing the codebase responsible for:

Data ingestion into TileDB SOMA (primary) or BigQuery (legacy)
Data extraction into AnnData files
Processing and validation of omics datasets
Orchestration of data workflows through Kubeflow pipelines

System Architecture

Cellarium Nexus is structured around these key components:

Backend

Django-based API server with admin dashboard for system interaction
Manages metadata, users, and job coordination
Provides an intuitive interface for monitoring and controlling pipelines
Modules for cell management, ingest management, and curriculum management

Coordinator

Bridges backend services with data operations
Handles data ingestion and extraction workflows
Manages validation of AnnData files

Omics Datastore

Pluggable storage backends for omics datasets
TileDB SOMA is the primary storage backend
BigQuery is supported as a legacy/alternative backend
Protocol-driven design allows transparent backend selection per dataset

Clients

REST client library for communicating with the backend API
Pydantic schemas for typed request/response handling

Workflows

Kubeflow pipelines for scalable data processing
Components for data ingestion, extraction, and validation
Parallelised execution for high-throughput data operations
Pipelines can be run locally or submitted to Vertex AI Pipelines

Key Workflows

Data Ingestion

ingest_data_pipeline: Orchestrates the ingestion process
- Validates and prepares input AnnData files in parallel
- Ingests prepared data into the configured omics datastore (SOMA or BigQuery)
- Handles multiple input files concurrently

Data Extraction

extract_data_pipeline: Manages the extraction process
- Prepares extraction metadata with specified features and filters
- Exports data to AnnData files in parallel
- Marks curriculum as finished upon completion

Development

Project Structure

cellarium/
└── nexus/
    ├── backend/              # Django-based API server and admin dashboard
    │   ├── application/      # Django settings, URLs, WSGI/ASGI entry points
    │   ├── cell_management/  # Dataset and feature metadata management
    │   ├── ingest_management/# Data ingestion tracking and validation
    │   ├── curriculum/       # Extract job and training data management
    │   └── core/             # Shared backend utilities and admin helpers
    ├── clients/              # REST client library for backend communication
    ├── coordinator/          # Orchestrates data operations across services
    ├── omics_datastore/      # Storage implementations (TileDB SOMA + BigQuery)
    │   ├── soma_ops/         # TileDB SOMA operations (primary backend)
    │   └── bq_ops/           # BigQuery operations (legacy backend)
    ├── shared/               # Common utilities and Pydantic schemas
    └── workflows/            # Kubeflow pipeline definitions
        └── kubeflow/
            ├── components/   # Reusable pipeline component definitions
            ├── pipelines/    # Complete workflow orchestrations
            └── utils/        # Pipeline utilities and constants
conf/                         # Local configuration and .env files
deploy/                       # Dockerfiles and deployment scripts
tests/                        # Test suite mirroring source structure

Working with Kubeflow Pipelines

The system uses Kubeflow pipelines to orchestrate data workflows:

components/: Define individual processing steps for ingestion, extraction, and validation
pipelines/: Orchestrate components into end-to-end workflows
- ingest_data_pipeline: Manages the complete ingestion process
- extract_data_pipeline: Handles the full extraction workflow
conf.py: Defines runtime configuration passed to each component

Environment Configuration

The application reads environment variables from conf/.env (and optionally conf/.env.local for local overrides). This file is not checked into version control — create it from the template below.

# conf/.env

# Core
SECRET_KEY=your-secret-key
ENVIRONMENT=local           # local | development | production | test
MAIN_HOST_ALLOWED=localhost
SITE_URL=http://localhost:8000

# PostgreSQL database
DB_NAME=nexus
DB_USER=nexus
DB_PASSWORD=nexus
DB_HOST=localhost            # not used in production (Cloud SQL socket)
DB_PORT=5432
DB_INSTANCE_CONNECTION_NAME= # required in production: project:region:instance

# Cloud Storage
GCP_PROJECT_ID=your-gcp-project
GCP_APPLICATION_BILLING_LABEL=your-billing-label
BUCKET_NAME_PRIVATE=your-private-bucket
BUCKET_NAME_PUBLIC=your-public-bucket

# Pipelines
PIPELINE_BASE_IMAGE=your-registry/nexus-workflows:tag
PIPELINE_SERVICE_ACCOUNT=   # optional: SA used to run pipeline jobs
PIPELINE_ROOT_PATH=         # optional: GCS path used as pipeline root

The ENVIRONMENT variable controls which Django settings module is loaded:

local: loads settings/local.py (local filesystem static files, development DB)
development: loads settings/development.py
production: loads settings/production.py (Cloud SQL, GCS static files)
test: loads settings/test.py (SQLite, dummy credentials)

Docker

Building images locally

Pinned requirements must be exported before building (see Export pinned requirements for Docker below).

# Backend image
docker build -f deploy/backend/Dockerfile -t nexus-backend:local .

# Workflows image
docker build -f deploy/workflows/Dockerfile -t nexus-workflows:local .

Triggering a remote build via GitHub Actions

The docker-update-workflow.yaml workflow builds and pushes images to the artifact registry. Trigger it manually with the GitHub CLI:

gh workflow run docker-update-workflow.yaml \
  -R cellarium-ai/cellarium-nexus \
  --ref <branch-or-tag> \
  -f image-types=<backend|workflows|both> \
  -f image-tag=<semver-tag> \
  -f add-latest-tag=<true|false> \
  -f skip_tests=<true|false>

Input parameters:

image-types: which image(s) to build — backend, workflows, or both
image-tag: tag pushed to the registry (defaults to the short commit hash)
add-latest-tag: also push a latest tag when true
skip_tests: skip lint and tests and only build Docker images

The workflow also runs automatically on pushes to any branch and on git tags, building and pushing images only for tagged commits.

Environment and Tooling (Poetry)

This repository uses Poetry for dependency management, virtual environments, tasks, and packaging.

Install tooling

Using pip (in a virtual environment)

python3 -m venv .venv-poetry
source .venv-poetry/bin/activate
pip install poetry
poetry self add "poetry-dynamic-versioning[plugin]"
poetry self add poetry-plugin-export
poetry --version

Create the environment and install dependencies

poetry install --with dev,test,backend

# Optional: keep the venv inside the repo
poetry config virtualenvs.in-project true

Run tests

# All tests
poetry run poe test

# Subsets (pytest markers defined under tests/)
poetry run poe unit
poetry run poe integration

Lint and format

# Lint check (Ruff + Black --check)
poetry run poe lint

# Auto-format (Ruff fixes + Black)
poetry run poe format

Pre-commit hooks

This repository includes a .pre-commit-config.yaml that runs a local hook to lint via poetry run poe lint.

# Install Git hooks
poetry run pre-commit install

# Run all hooks on the entire codebase
poetry run pre-commit run --all-files

Export pinned requirements for Docker

poetry run poe export-backend-reqs
poetry run poe export-workflows-reqs

Dockerfiles consume the exported files under deploy/requirements/ and install the package with pip install . for reproducible builds.

Contributing

When contributing to this repository, please follow these guidelines:

Use built-in type annotations for all function signatures
Write docstrings in imperative mood and reST format
Include proper error documentation with :raise: sections
Use absolute imports throughout the codebase

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
.github		.github
cellarium		cellarium
deploy		deploy
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.rst		README.rst
__init__.py		__init__.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cellarium Nexus

Project Overview

System Architecture

Backend

Coordinator

Omics Datastore

Clients

Workflows

Key Workflows

Data Ingestion

Data Extraction

Development

Project Structure

Working with Kubeflow Pipelines

Environment Configuration

Docker

Building images locally

Triggering a remote build via GitHub Actions

Environment and Tooling (Poetry)

Install tooling

Using pip (in a virtual environment)

Create the environment and install dependencies

Run tests

Lint and format

Pre-commit hooks

Export pinned requirements for Docker

Contributing

About

Uh oh!

Releases 10

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cellarium Nexus

Project Overview

System Architecture

Backend

Coordinator

Omics Datastore

Clients

Workflows

Key Workflows

Data Ingestion

Data Extraction

Development

Project Structure

Working with Kubeflow Pipelines

Environment Configuration

Docker

Building images locally

Triggering a remote build via GitHub Actions

Environment and Tooling (Poetry)

Install tooling

Using pip (in a virtual environment)

Create the environment and install dependencies

Run tests

Lint and format

Pre-commit hooks

Export pinned requirements for Docker

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages