Skip to content

Ensembl/ensembl-cdm-docs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Ensembl Core Data Model (CDM)

Introduction

The Ensembl Core Data Model (CDM) describes how the central concepts of Ensembl (Features (such as genes and transcripts), genomic locations, external references, metadata etc.) are presented through APIs and services to internal (e.g the new Ensembl website) and external clients (e.g those who interact with Ensembl APIs). The model will aid Ensembl in future development efforts, being compatible with known strategic directions, including pan-genomes.

Key concepts

The key concepts in the model can be broken into the following categories:-

  • Features (genes, transcripts etc.)
  • Localisation concepts, assemblies, species and organisms
  • External references
  • Metadata

Features and localisation

Features

In the model, Feature is defined as an abstract concept for modelling areas of interest in a genomic coordinate space. Typically concrete representations of features have a stable_id for identification. The term stable_id refers to a publicly available identifier (e.g. ENST00000380152.8) assigned by a project or institute and can be considered to be unique within an Assembly .

Features in the model are:

"Feature-like" entities are also included in the model. These are not considered to be features because they do not exist in the genomic coordinate space in the same way as a feature, like a gene, does. These concepts include:-

The transcription and translation event is captured in the ProductGeneratingContext (PGC). The PGC describes the product it is making through the type field, and identifies the features and feature-like entities (gene, transcripts, phased exons, UTRs CDS and cDNA) involved in the process. It also links to the eventual product (if one exists).

PhasedExons in PGCs allow for identifiable exons to be present in multiple PGCs with additional phase information. SplicedExons describe the location of identifiable exons in one or more transcripts.

Products (Protein and RNA) are included in this Feature section as they are "Feature-like" in many ways. However, as they do not occupy the genomic coordinate space, they do not inherit from Feature. RNA products are not currently available in Ensembl and so have not been fully modelled.

Sequence, localisation concepts, assemblies, species and organisms

Sequence information for Features and "Feature-like" entities is managed in two different ways.

Features with genomic locations have a Slice. Slice is the mechanism used to link together Region (a contig or chromosome), Location (coordinates and length) and Strand. Sequence is obtained via Region.

For "Feature-like" entities which have a sequence but do not have a slice or genomic location (e.g. cDNA, Product etc.) the Sequence object is associated with them directly.

Sequence allows for integration with RefGet instances via the checksum attribute. Region provides the link between an Assembly and its sequence.

Species and Organism are the entities involved in specifying the source of the assembly. Species provides the taxonomic detail (e.g Triticum aestivum), while Organism provides more granular information about an individual or cultivar (e.g Jagger (wheat cultivar)). This granularity is provided through OrganismGroup, a concept shared with the Metadata Schema.

External references

ExternalReference represents a reference to a database outside of Ensembl.

Metadata

Metadata is used to capture information which supports or extends the data held within the model.

CDM separates Metadata into two types:

ExternalReferenceMetadata

ExternalReferenceMetadata is metadata which comes from a source external to Ensembl. An example of this would be the name of a Gene, which could come from VGNC.

ValueSetMetadata

ValueSetMetadata is more akin to a well structured controlled vocabulary. Examples of this would be MANE Select or MANE Plus Clinical.

Further information

Development of the CDM

The development of the model has been conducted by a group of staff from the Ensembl project with specialisms in a variety of areas, including:

  • Variation genomics
  • Comparative genomics
  • Genome annotation
  • Software & API development

The model has been developed iteratively over a number of months and this work has been influenced by the Ensembl website redevelopment. The requirements from the website and related services have been fed into the CDM group's regular meetings where they have been discussed, designed, documented and reviewed.

The CDM has been used as the basis for our new API development. This ensures that the concepts used are well understood and this will reduce divergence, creating a common nomenclature for key ideas across the project.

Versioning

The CDM uses semantic versioning. This is managed through Git tags.

Issues

Issues with the CDM are tracked using Github issues.

Contact us

The Ensembl team can be contacted using the "contact us" link on the Ensembl website if there are any questions relating to the model.

About

Core data modeling documentation repository.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 10