Skip to content

cellarium-ai/cellarium-nexus

Repository files navigation

Cellarium Nexus

A comprehensive platform for managing, processing, and administering single-cell RNA sequencing (scRNA-seq) omics data.

Project Overview

Cellarium Nexus is a monorepository containing the codebase responsible for:

  1. Data ingestion into TileDB SOMA (primary) or BigQuery (legacy)
  2. Data extraction into AnnData files
  3. Processing and validation of omics datasets
  4. Orchestration of data workflows through Kubeflow pipelines

System Architecture

Cellarium Nexus is structured around these key components:

Backend

  • Django-based API server with admin dashboard for system interaction
  • Manages metadata, users, and job coordination
  • Provides an intuitive interface for monitoring and controlling pipelines
  • Modules for cell management, ingest management, and curriculum management

Coordinator

  • Bridges backend services with data operations
  • Handles data ingestion and extraction workflows
  • Manages validation of AnnData files

Omics Datastore

  • Pluggable storage backends for omics datasets
  • TileDB SOMA is the primary storage backend
  • BigQuery is supported as a legacy/alternative backend
  • Protocol-driven design allows transparent backend selection per dataset

Clients

  • REST client library for communicating with the backend API
  • Pydantic schemas for typed request/response handling

Workflows

  • Kubeflow pipelines for scalable data processing
  • Components for data ingestion, extraction, and validation
  • Parallelised execution for high-throughput data operations
  • Pipelines can be run locally or submitted to Vertex AI Pipelines

Key Workflows

Data Ingestion

  1. ingest_data_pipeline: Orchestrates the ingestion process
    • Validates and prepares input AnnData files in parallel
    • Ingests prepared data into the configured omics datastore (SOMA or BigQuery)
    • Handles multiple input files concurrently

Data Extraction

  1. extract_data_pipeline: Manages the extraction process
    • Prepares extraction metadata with specified features and filters
    • Exports data to AnnData files in parallel
    • Marks curriculum as finished upon completion

Development

Project Structure

cellarium/
└── nexus/
    ├── backend/              # Django-based API server and admin dashboard
    │   ├── application/      # Django settings, URLs, WSGI/ASGI entry points
    │   ├── cell_management/  # Dataset and feature metadata management
    │   ├── ingest_management/# Data ingestion tracking and validation
    │   ├── curriculum/       # Extract job and training data management
    │   └── core/             # Shared backend utilities and admin helpers
    ├── clients/              # REST client library for backend communication
    ├── coordinator/          # Orchestrates data operations across services
    ├── omics_datastore/      # Storage implementations (TileDB SOMA + BigQuery)
    │   ├── soma_ops/         # TileDB SOMA operations (primary backend)
    │   └── bq_ops/           # BigQuery operations (legacy backend)
    ├── shared/               # Common utilities and Pydantic schemas
    └── workflows/            # Kubeflow pipeline definitions
        └── kubeflow/
            ├── components/   # Reusable pipeline component definitions
            ├── pipelines/    # Complete workflow orchestrations
            └── utils/        # Pipeline utilities and constants
conf/                         # Local configuration and .env files
deploy/                       # Dockerfiles and deployment scripts
tests/                        # Test suite mirroring source structure

Working with Kubeflow Pipelines

The system uses Kubeflow pipelines to orchestrate data workflows:

  1. components/: Define individual processing steps for ingestion, extraction, and validation
  2. pipelines/: Orchestrate components into end-to-end workflows
    • ingest_data_pipeline: Manages the complete ingestion process
    • extract_data_pipeline: Handles the full extraction workflow
  3. conf.py: Defines runtime configuration passed to each component

Environment Configuration

The application reads environment variables from conf/.env (and optionally conf/.env.local for local overrides). This file is not checked into version control — create it from the template below.

# conf/.env

# Core
SECRET_KEY=your-secret-key
ENVIRONMENT=local           # local | development | production | test
MAIN_HOST_ALLOWED=localhost
SITE_URL=http://localhost:8000

# PostgreSQL database
DB_NAME=nexus
DB_USER=nexus
DB_PASSWORD=nexus
DB_HOST=localhost            # not used in production (Cloud SQL socket)
DB_PORT=5432
DB_INSTANCE_CONNECTION_NAME= # required in production: project:region:instance

# Cloud Storage
GCP_PROJECT_ID=your-gcp-project
GCP_APPLICATION_BILLING_LABEL=your-billing-label
BUCKET_NAME_PRIVATE=your-private-bucket
BUCKET_NAME_PUBLIC=your-public-bucket

# Pipelines
PIPELINE_BASE_IMAGE=your-registry/nexus-workflows:tag
PIPELINE_SERVICE_ACCOUNT=   # optional: SA used to run pipeline jobs
PIPELINE_ROOT_PATH=         # optional: GCS path used as pipeline root

The ENVIRONMENT variable controls which Django settings module is loaded:

  • local: loads settings/local.py (local filesystem static files, development DB)
  • development: loads settings/development.py
  • production: loads settings/production.py (Cloud SQL, GCS static files)
  • test: loads settings/test.py (SQLite, dummy credentials)

Docker

Building images locally

Pinned requirements must be exported before building (see Export pinned requirements for Docker below).

# Backend image
docker build -f deploy/backend/Dockerfile -t nexus-backend:local .

# Workflows image
docker build -f deploy/workflows/Dockerfile -t nexus-workflows:local .

Triggering a remote build via GitHub Actions

The docker-update-workflow.yaml workflow builds and pushes images to the artifact registry. Trigger it manually with the GitHub CLI:

gh workflow run docker-update-workflow.yaml \
  -R cellarium-ai/cellarium-nexus \
  --ref <branch-or-tag> \
  -f image-types=<backend|workflows|both> \
  -f image-tag=<semver-tag> \
  -f add-latest-tag=<true|false> \
  -f skip_tests=<true|false>

Input parameters:

  • image-types: which image(s) to build — backend, workflows, or both
  • image-tag: tag pushed to the registry (defaults to the short commit hash)
  • add-latest-tag: also push a latest tag when true
  • skip_tests: skip lint and tests and only build Docker images

The workflow also runs automatically on pushes to any branch and on git tags, building and pushing images only for tagged commits.

Environment and Tooling (Poetry)

This repository uses Poetry for dependency management, virtual environments, tasks, and packaging.

Install tooling

Using pip (in a virtual environment)

python3 -m venv .venv-poetry
source .venv-poetry/bin/activate
pip install poetry
poetry self add "poetry-dynamic-versioning[plugin]"
poetry self add poetry-plugin-export
poetry --version

Create the environment and install dependencies

poetry install --with dev,test,backend

# Optional: keep the venv inside the repo
poetry config virtualenvs.in-project true

Run tests

# All tests
poetry run poe test

# Subsets (pytest markers defined under tests/)
poetry run poe unit
poetry run poe integration

Lint and format

# Lint check (Ruff + Black --check)
poetry run poe lint

# Auto-format (Ruff fixes + Black)
poetry run poe format

Pre-commit hooks

This repository includes a .pre-commit-config.yaml that runs a local hook to lint via poetry run poe lint.

# Install Git hooks
poetry run pre-commit install

# Run all hooks on the entire codebase
poetry run pre-commit run --all-files

Export pinned requirements for Docker

poetry run poe export-backend-reqs
poetry run poe export-workflows-reqs

Dockerfiles consume the exported files under deploy/requirements/ and install the package with pip install . for reproducible builds.

Contributing

When contributing to this repository, please follow these guidelines:

  1. Use built-in type annotations for all function signatures
  2. Write docstrings in imperative mood and reST format
  3. Include proper error documentation with :raise: sections
  4. Use absolute imports throughout the codebase

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors