SEP | 10 |
---|---|
Title | simplify description of sequence features and sub-parts |
Authors | Raik Gruenberg <raik.gruenberg at gmail com> |
Editor | James McLaughlin |
Type | Data Model |
SBOL Version | 3.0 |
Replaces | |
Status | Accepted |
Created | 20-Sep-2016 |
Last modified | 31-Aug-2019 |
Issue | #25 |
There are two very different types of 'part annotation'. (1) part composition relationships -- These always point to an existing (and presumably re-usable) sub-component. Sequence location or indeed sequence information may or may not be available. (2) Classic sequence feature annotations -- As known from the genbank format, these only apply to clearly specified sequence regions but often do not point to meaningful sub-parts.
Currently, Component
alone is sufficient to describe sub-part relations without any sequence information. This however is the exception in synthetic biology practice. Both SequenceAnnotation
and Component
are needed for SBOL encoding of actual genetic designs with parts and sub-parts (because Component
lacks a location
field). Conversely, simple sequence features can be described using SequenceAnnotation
alone (as of SBOL 2.0) but this possibility is not widely known and additional Component
and ComponentDefinition
are often created instead.
We propose to modify Component
and SequenceAnnotation
such that Component
is solely responsible for the description of part - subpart relationships (with or without sequence) and SequenceAnnotation
is solely responsible for the description of genbank-style sequence features. SequenceAnnotation
should be renamed to SequenceFeature
.
- 1. Rationale
- 1.1 current situation
- 1.2 Goals of the proposal
- 2. Specification
- 2.1 Add
location
field toComponent
- 2.2 Rename
SequenceAnnotation
toSequenceFeature
- 2.3 Restrict
SequenceFeature
to sequence feature annotation - 2.4 Let
SequenceConstraint.object
point toSequenceFeature
- 2.1 Add
- 3. Example or Use Case
- 4. Backwards Compatibility
- 4.1 suggested transition path
- 4.2 Conversion of SBOL 2.1 records to 3.0
- 4.3 Backwards conversion of SBOL 3.x to 2.1 records
- 5. Discussion
- 6. Competing SEPs
- References
- Copyright
The current SequenceAnnotation
class has a dual purpose:
(1) Its primary role is to specify the location of "sub-parts" within the sequence of a parent ComponentDefinition. To this end, SequenceAnnotation
links one or more Location
records with a Component
. The Component
, in turn, refers to a ComponentDefinition
(via its definition
field). This ComponentDefinition
is the description of the actual sub-part. Actual physical composition is therefore defined like this:
ComponentDefinition
-[sequenceAnnotation
]-> SequenceAnnotation
-[component
]-> Component
-[definition
]-> ComponentDefinition
The directionality (which one is parent and which one is a sub-part) is frequently confused. Moreover, the parent ComponentDefinition
also directly links to the sub-part Component
via a component
field. This is necessary so that composition can be described before any sequences (and thus sequence locations) are known. An additional chain of references is therefore needed, in parallel to the one shown above:
ComponentDefintion
-[component
]-> Component
-[definition
]-> ComponentDefinition
Current SBOL 2.1 part - subpart relations are summarized in the following figure:
Adding to this redundancy, both Component
and SequenceAnnotation
may have role
properties that diverge from the role
(functional classification) of the target ComponentDefinition
. Whether a diverging role
is attached to Component
or SequenceAnnotation
is an arbitrary choice. This invites conflicting implementations and interpretations of this field.
Evidently, SBOL makes the description of "undefined", "loose bag", part composition without any sequence information relatively easy. By contrast, the description of actual genetic designs with actual sequences is surprisingly complex and redundant. This is unfortunate because the latter is, by and far, the overwhelming use case of SBOL. It also hinders adoption by sequence-level designers and tool developers.
(2) The secondary role of SequenceAnnotation
is to simply annotate regions of interest within a given sequence. Arguably, this should be its primary role (hence the name) as it is a very common use case in practice. A SequenceAnnotation
without component
can be created and linked to a region of, e.g., DNA. SequenceAnnotation
inherits name
and description
fields from Identified
and is therefore sufficient for the description of "flat" sequence features. In practice however, most tools mix SequenceAnnotation
and Component
even for simple sequence features:
(1) Restrict the use of SequenceAnnotation
to annotations of features which do not fall into the part - subpart category. As a welcome side effect, this should make it much easier to move back and forth between SBOL and large bodies of existing genbank-formatted information and related software.
(2) Simplify the part-subpart relationship via Component
so that it does not any longer require SequenceAnnotation
.
(3) Create a syntactic parallel between sequence/physical and functional part-subpart relations in SBOL -- The Component
class will be equivalent in syntax and meaning to the existing Participation
class. For programmers, the pattern ComponentDefinition -> Component(role) -> ComponentDefinition
will look and feel like the already established pattern Interaction -> Participation(role) -> ComponentDefinition
.
(4) Remove ambiguity as to how things can / should be expressed at the sequence layer to aid meaningful data exchange.
Add the following optional field to Component
:
- [0..n]
location
pointing to aLocation
on the parentComponentDefinition
sequence; iflocation
is missing, this indicates a part / sub-part relationship for which sequence details have not (yet) been determined.
The Location
record(s) specified by a Component
are subject to the same restrictions currently in place for SequenceAnnotation
Location
. Concretely, two Location
records attached to the same Component
MUST NOT overlap in their range as it would not be clear what that means. The Location
of two separate Components
may overlap.
- rename class
SequenceAnnotation
toSequenceFeature
- rename
sequenceAnnotation
field ofComponentDefinition
tosequenceFeature
Remove the following fields from SequenceFeature
(formerly SequenceAnnotation
):
component
-- SequenceAnnotation is not any longer used for part - subpart relationsroleIntegration
-- there is no sub-part/definition thatrole
fields may be in conflict with
Update the specification to clarify usage of existing fields:
- [0..n]
role
pointing to a SequenceOntology term (optional), corresponds to genbank type field - [0..1]
name
corresponding to genbank name field (optional but now RECOMMENDED) - [0..1]
description
corresponds to genbank description field (optional)
Moreover, a validation rule is needed: SequenceFeature
can only be used
if an actual sequence record is specified for the parent ComponentDefinition
.
SequenceConstraint.object
and SequenceConstraint.subject
can point to either ComponentInstance
derivatives
(as before) or to SequenceFeature
.
This change allows to anchor constraints on sequence regions that are not actually sub-parts. Examples may be start / stop codons, transcription start sites or specific mutations.
Example use cases for the modified SequenceAnnotation are feature annotations such as START or STOP codons, mutations, highlighting regions referred to in a paper, sequence conflicts, etc, all mainly intended for human consumption. Over the evolution of a design, sequence features may later be formalized into re-usable subparts (i.e. 'Component's) It is therefore conceivable that a sequence editor reads in a genbank file with many sequence features and offers the user the easy conversion of some of those features into sub-parts. This, in fact, is a workflow already used and supported by the Benchling Sequence editor (http://benchling.com).
Implement all changes at once in SBOL v 3.0.
- remove intermediate
SequenceAnnotation
and moveSequenceAnnotation.location
toComponent
- optionally, try to flatten
SequenceAnnotation
-Component
-ComponentDefinition
chains of trivial annotations intoSequenceFeature
records
-
conversion of localized
Component
:(1) create
SequenceAnnotation
record pointing to subpartComponent
(2) move
location
fromComponent
toSequenceAnnotation
-
conversion of non-localized
Component
:no change required
-
conversion of
SequenceFeature
(1) rename
SequenceFeature
toSequenceAnnotation
(2) rename
ComponentDefinition.sequenceFeature
field tosequenceAnnotation
As an added benefit, the proposed change creates a symmetry between the sequence
and the functional layer of SBOL: Component
is now the equivalent of
Participation
. The former describes a physical part- subpart relation whereas
the latter describes a functional part - subpart relation. Both specify one or
more role
properties, both point to a (sub)ComponentDefinition
. Currently,
this parallel is obfuscated by the multiple direct and indirect references
between parent and sub-part ComponentDefinition
.
- It was pointed out that
SequenceAnnotation
already can have its ownname
anddescription
fields as it is derrived fromIdentified
. The SEP was changed accordingly. - Renaming
SequenceAnnotation
toSequenceFeature
was universally considered a good idea (for symmetry with genbank, bioinformatics practics and in order to avoid confusion with "Annotation" in SBOL and SBML).
-
At COMBINE, it was suggested that
SequenceConstraint
should also be allowed to point toSequenceFeature
. This would avoid construction ofComponent
-ComponentDefinition
chains for, e.g. mutations or other simple features that are not sub-parts but nevertheless restrict/orient the positioning of other Components. This change has been incorporated into the SEP. -
Originally, this link
SequenceConstraint
->SequenceFeature
link was restricted to theSequenceConstraint.object
field. This was meant to enforce thatComponents
(sub-parts) can be anchored to sequence features but not the other way round. However, the types of constraints allowed assume that the directionality of aSequenceConstraint
can be freely chosen. We can say that part Apreceeds
part B but we cannot say that part A "follows" part B. In SBOL, the latter is expressed as "part Bpreceeds
part A" (i.e.subject
andobject
of theSequenceConstrain
are reversed). For this reason, bothobject
andsubject
of the constraint need to be allowed pointing toSequenceFeature
. -
At COMBINE, it was suggested to rename
SequenceConstraint
intoComponentConstraint
-- this should be put into a separate SEP. -
It was suggested to put additional restrictions on the use of
Component.location
so that a fully specified sequence can more easily be pieced together from Component sequences. This raises an important issue with the current data model, which does not allow an easy distinction between partially defined and fully specified sequences. However, the editors consider this as an orthogonal problem which should be adressed separately.
- The original SEP (see github history) suggested a step-wise introduction starting with SBOL 2.2. This would have led to a hybrid data model where both usage patterns could co-exist and was eventually considered too complex. Instead, the SEP is considered as a clean backward-incompatible change for SBOL v 3.0.
The following SEPs make complementary suggestions for further simplification of the SBOL data model:
- SEP 15 (Issue) -- rename Component -> SubPart and ComponentDefinition -> Component
- SEP 25 (Issue) -- merge Module(Definition) with Component(Definition) and remove FunctionalComponent
None.
To the extent possible under law,
SBOL developers
has waived all copyright and related or neighboring rights to
SEP 010.
This work is published from:
United States.