A comprehensive platform for managing, processing, and administering single-cell RNA sequencing (scRNA-seq) omics data.
Cellarium Nexus is a monorepository containing the codebase responsible for:
- Data ingestion into TileDB SOMA (primary) or BigQuery (legacy)
- Data extraction into AnnData files
- Processing and validation of omics datasets
- Orchestration of data workflows through Kubeflow pipelines
Cellarium Nexus is structured around these key components:
- Django-based API server with admin dashboard for system interaction
- Manages metadata, users, and job coordination
- Provides an intuitive interface for monitoring and controlling pipelines
- Modules for cell management, ingest management, and curriculum management
- Bridges backend services with data operations
- Handles data ingestion and extraction workflows
- Manages validation of AnnData files
- Pluggable storage backends for omics datasets
- TileDB SOMA is the primary storage backend
- BigQuery is supported as a legacy/alternative backend
- Protocol-driven design allows transparent backend selection per dataset
- REST client library for communicating with the backend API
- Pydantic schemas for typed request/response handling
- Kubeflow pipelines for scalable data processing
- Components for data ingestion, extraction, and validation
- Parallelised execution for high-throughput data operations
- Pipelines can be run locally or submitted to Vertex AI Pipelines
ingest_data_pipeline: Orchestrates the ingestion process- Validates and prepares input AnnData files in parallel
- Ingests prepared data into the configured omics datastore (SOMA or BigQuery)
- Handles multiple input files concurrently
extract_data_pipeline: Manages the extraction process- Prepares extraction metadata with specified features and filters
- Exports data to AnnData files in parallel
- Marks curriculum as finished upon completion
cellarium/
└── nexus/
├── backend/ # Django-based API server and admin dashboard
│ ├── application/ # Django settings, URLs, WSGI/ASGI entry points
│ ├── cell_management/ # Dataset and feature metadata management
│ ├── ingest_management/# Data ingestion tracking and validation
│ ├── curriculum/ # Extract job and training data management
│ └── core/ # Shared backend utilities and admin helpers
├── clients/ # REST client library for backend communication
├── coordinator/ # Orchestrates data operations across services
├── omics_datastore/ # Storage implementations (TileDB SOMA + BigQuery)
│ ├── soma_ops/ # TileDB SOMA operations (primary backend)
│ └── bq_ops/ # BigQuery operations (legacy backend)
├── shared/ # Common utilities and Pydantic schemas
└── workflows/ # Kubeflow pipeline definitions
└── kubeflow/
├── components/ # Reusable pipeline component definitions
├── pipelines/ # Complete workflow orchestrations
└── utils/ # Pipeline utilities and constants
conf/ # Local configuration and .env files
deploy/ # Dockerfiles and deployment scripts
tests/ # Test suite mirroring source structure
The system uses Kubeflow pipelines to orchestrate data workflows:
components/: Define individual processing steps for ingestion, extraction, and validationpipelines/: Orchestrate components into end-to-end workflowsingest_data_pipeline: Manages the complete ingestion processextract_data_pipeline: Handles the full extraction workflow
conf.py: Defines runtime configuration passed to each component
The application reads environment variables from conf/.env (and optionally conf/.env.local for local overrides). This file is not checked into version control — create it from the template below.
# conf/.env
# Core
SECRET_KEY=your-secret-key
ENVIRONMENT=local # local | development | production | test
MAIN_HOST_ALLOWED=localhost
SITE_URL=http://localhost:8000
# PostgreSQL database
DB_NAME=nexus
DB_USER=nexus
DB_PASSWORD=nexus
DB_HOST=localhost # not used in production (Cloud SQL socket)
DB_PORT=5432
DB_INSTANCE_CONNECTION_NAME= # required in production: project:region:instance
# Cloud Storage
GCP_PROJECT_ID=your-gcp-project
GCP_APPLICATION_BILLING_LABEL=your-billing-label
BUCKET_NAME_PRIVATE=your-private-bucket
BUCKET_NAME_PUBLIC=your-public-bucket
# Pipelines
PIPELINE_BASE_IMAGE=your-registry/nexus-workflows:tag
PIPELINE_SERVICE_ACCOUNT= # optional: SA used to run pipeline jobs
PIPELINE_ROOT_PATH= # optional: GCS path used as pipeline rootThe ENVIRONMENT variable controls which Django settings module is loaded:
local: loadssettings/local.py(local filesystem static files, development DB)development: loadssettings/development.pyproduction: loadssettings/production.py(Cloud SQL, GCS static files)test: loadssettings/test.py(SQLite, dummy credentials)
Pinned requirements must be exported before building (see Export pinned requirements for Docker below).
# Backend image
docker build -f deploy/backend/Dockerfile -t nexus-backend:local .
# Workflows image
docker build -f deploy/workflows/Dockerfile -t nexus-workflows:local .The docker-update-workflow.yaml workflow builds and pushes images to the artifact registry. Trigger it manually with the GitHub CLI:
gh workflow run docker-update-workflow.yaml \
-R cellarium-ai/cellarium-nexus \
--ref <branch-or-tag> \
-f image-types=<backend|workflows|both> \
-f image-tag=<semver-tag> \
-f add-latest-tag=<true|false> \
-f skip_tests=<true|false>Input parameters:
image-types: which image(s) to build —backend,workflows, orbothimage-tag: tag pushed to the registry (defaults to the short commit hash)add-latest-tag: also push alatesttag whentrueskip_tests: skip lint and tests and only build Docker images
The workflow also runs automatically on pushes to any branch and on git tags, building and pushing images only for tagged commits.
This repository uses Poetry for dependency management, virtual environments, tasks, and packaging.
python3 -m venv .venv-poetry
source .venv-poetry/bin/activate
pip install poetry
poetry self add "poetry-dynamic-versioning[plugin]"
poetry self add poetry-plugin-export
poetry --versionpoetry install --with dev,test,backend
# Optional: keep the venv inside the repo
poetry config virtualenvs.in-project true# All tests
poetry run poe test
# Subsets (pytest markers defined under tests/)
poetry run poe unit
poetry run poe integration# Lint check (Ruff + Black --check)
poetry run poe lint
# Auto-format (Ruff fixes + Black)
poetry run poe formatThis repository includes a .pre-commit-config.yaml that runs a local hook to lint via poetry run poe lint.
# Install Git hooks
poetry run pre-commit install
# Run all hooks on the entire codebase
poetry run pre-commit run --all-filespoetry run poe export-backend-reqs
poetry run poe export-workflows-reqsDockerfiles consume the exported files under deploy/requirements/ and install the package with pip install . for reproducible builds.
When contributing to this repository, please follow these guidelines:
- Use built-in type annotations for all function signatures
- Write docstrings in imperative mood and reST format
- Include proper error documentation with
:raise:sections - Use absolute imports throughout the codebase