Skip to content

bigbio/sdrf-cellline-metadata-db

Repository files navigation

SDRF Cell Line Metadata Database

DOI

This repository provides tools to create and manage a cell line metadata database for annotating SDRFs (Sample and Data Relationship Format) in proteomics studies. The primary use case is enhancing annotation consistency for quantms.org datasets. The scripts integrate multiple ontologies and natural language processing (NLP) methods to standardize cell line metadata.

You can query the database for cell line metadata, including information on the organism, tissue, disease, and other relevant fields in GitHub. The database is designed to be easily extensible and can be updated with new cell line information.

NOTE: This is NOT an Ontology of cell lines; but a registry/table where users can find the required information about a cell line in a standardized format for SDRF annotation.


Table of Contents

  1. Motivation
  2. Metadata Sources
  3. Ontologies
  4. Database Structure
  5. Features
  6. Requirements
  7. Installation
  8. Usage
  9. Contribution
  10. License

Motivation

Cell lines are essential in biological research but often lack standardized metadata, leading to inconsistencies. This repository aims to:

  • Create a centralized database for cell line metadata.
  • Facilitate annotation and validation of cell line SDRFs, particularly in proteomics datasets.

Cell line metadata sources

We integrate metadata from three main sources and additional curation efforts:

  1. Cellosaurus:
    The primary metadata source.

  2. Cell Model Passports:
    A collection of cell lines from various sources.

  3. Expression Atlas (EA):
    Metadata curated over RNA experiments for over 10 years.

    • Collected data: Stored in the ea/ folder.
    • Script: ea_db.py processes this source.

Additional Curation: Manual annotation is performed using data from:


Ontologies

The following ontologies are used for annotation:

  1. MONDO:
    Used to annotate the disease associated with a cell line.

  2. BTO:
    Provides additional references for cell line IDs.


Database Structure

The database is implemented using tsv and contains the following key fields:

Field Name Description
cell line Cell line code
cellosaurus name Name as annotated in Cellosaurus ID.
cellosaurus accession Accession ID from Cellosaurus AC.
bto cell line Name as annotated in BTO.
organism Organism species (from Cellosaurus).
organism part Annotated from supplementary sources.
sampling site Sampling site of the cell line.
age Age of the cell line (from Cellosaurus or additional sources).
developmental stage Developmental stage (inferred from age if missing).
sex Sex information (from Cellosaurus).
ancestry category Ancestry classification (from Cellosaurus or supplementary sources).
disease Agreed-upon disease annotation across sources.
cell type Agreed-upon cell type annotation across sources.
material type Agreed-upon material classification.
synonyms Consolidated synonyms and accessions from all sources.
curated Curation status: _not curated_, _AI curated_, or _manual curated_.

Note: The final database is provided as a tab-delimited file for easy integration. It can be loaded into tools like Pandas or viewed directly via GitHub's table renderer.


Features

  • Standardizes metadata from multiple sources.
  • Uses ontologies to annotate diseases and tissue information.
  • Supports AI-based curation and manual validation for accuracy.
  • Provides easy-to-query tab-delimited outputs.

SDRF Cell Line Annotator

This script annotates the cell lines from an SDRF (Sample to Data relationship format) with cell line information from a provided cell line metadata database. It matches cell line names from the SDRF with entries in the database, considering exact matches for cell line, cellosaurus name, and cellosaurus accession, as well as partial matches against synonyms. If a match is found, the corresponding metadata (e.g., organism, disease, age, and more) is provided. If no match is found, the fields are populated with "not available" and a warning is logged.

python annotator.py --sdrf-file MSV000085836.sdrf.tsv --db-file cl-annotations-db.tsv --output-file suggested-terms.tsv

Key Features:

  • Database Matching: Matches cell line names from the SDRF file against a cell line database with multiple matching criteria (exact and synonym-based).
  • Synonym Handling: Synonyms in the database are split by semicolon and compared to the cell line names, ensuring flexible matching.
  • Logging and Error Handling: Warnings are logged for any unmatched cell lines, and errors are gracefully handled.
  • TSV Output: Annotates and outputs the results to a new TSV file, maintaining structured data for downstream analysis.

Requirements

To use the scripts, ensure the following is installed:

  • Python 3.x
  • Required libraries:
  • pandas
  • numpy
  • spacy
  • install the en_core_web_lg model for spaCy: python -m spacy download en_core_web_sm

Code of Conduct

We strive to foster a welcoming, inclusive, and respectful community where everyone feels encouraged to participate and contribute. As contributors and maintainers, we are committed to upholding ethical standards to prevent conflicts, harassment, and discrimination. We ask all participants to communicate respectfully, avoid personal attacks, and be constructive in their feedback. Contributions should be made with honesty, empathy, and respect for differing perspectives. Read the full Code of Conduct.

Commenting and contributing

We welcome contributions from the community. If you would like to contribute, please open an issue or a pull request. We will review your contribution and provide feedback. We aim to be inclusive and collaborative, and we welcome all contributions that are in line with our goals.

  • If you want to contribute to the manuscript, please do the following:
    • Fork the repository
    • Change the content manuscript.md
    • Submit a pull request
    • We will review your contribution and provide feedback
  • If you want to discuss a topic, please open an issue.

NOTE: If, based on your contribution, you would like to be added as a co-author, please open an issue and provide your name and affiliation and a short description of your contribution or a link to the relevant issue and pull request.


Contributors