Skip to content

Make comment[technical replicate] and characteristics[biological replicate] columns optional in SDRF-Proteomics 1.1.0 #775

@ypriverol

Description

@ypriverol

Background

The current SDRF-Proteomics specification requires two mandatory columns for replicate annotation:

  • comment[technical replicate]
  • characteristics[biological replicate]

These columns were introduced during the original PSI meeting and SDRF development 1.0.0 to ensure explicit annotation of experimental design. However, after several years of community usage, we have gathered significant feedback suggesting this requirement creates unnecessary friction for data submitters; confusion and most of the time this information can be inferred. Let me explain the problem.

Problem Statement

1. User friction and poor data quality

Many users struggle with these mandatory columns, leading to:

  • Frequent use of "not available" as a value, which defeats the purpose of the annotation
  • User complaints and confusion about how to properly assign replicate numbers
  • Submission delays and support burden.

2. Redundant information

The replicate information can often be inferred from the source name (sample accession) following these logical rules:

Scenario Interpretation
Different source name Different biological replicate (different sample)
Same source name, different file Technical replicate of the same sample
Same source name, same file, different fraction Fractionation of the same technical replicate

This means the explicit columns often duplicate information that is already implicitly encoded in the experimental design.

3. Cognitive overhead

For straightforward experimental designs, requiring users to manually number replicates adds cognitive overhead without proportional benefit. Users must think through numbering schemes that tools could automatically derive.

Proposed Solution

Make comment[technical replicate] and characteristics[biological replicate] columns OPTIONAL starting in SDRF-Proteomics specification version 1.1.0. and the base template.

Specification changes

  1. Move both columns from the "mandatory" to the "optional" section
  2. Add clear documentation on how SDRF readers/parsers should interpret data when these columns are absent
  3. Maintain backward compatibility (files with explicit replicate columns remain valid)

Interpretation rules for parsers (when columns are absent)

The specification should document the following interpretation logic for SDRF readers:

When `characteristics[biological replicate]` column is ABSENT:
  - Each unique `source name` represents a distinct biological replicate
  - Rows sharing the same `source name` belong to the same biological sample

When `comment[technical replicate]` column is ABSENT:
  - Rows with the same `source name` but different raw files represent technical replicates
  - Rows with the same `source name` AND same raw file but different `comment[fraction identifier]` 
    represent fractions of the same technical replicate

When explicit columns are still recommended

The specification should note that explicit replicate columns remain recommended for:

  • Complex experimental designs where inference rules may be ambiguous
  • Datasets where the same biological sample was processed on different days/batches
  • Studies where explicit replicate tracking is important for downstream analysis
  • Cases where submitters want to ensure unambiguous interpretation

Benefits

  1. Reduced submission friction: Users with straightforward designs can submit without struggling with replicate numbering
  2. Better data quality: Fewer "not available" entries; implicit encoding via source names is more reliable than confused manual annotation
  3. Maintained flexibility: Users who need explicit control can still provide the columns
  4. Clear semantics: Documented inference rules ensure consistent interpretation across tools

Considerations and Discussion Points

Potential concerns

  1. Loss of explicit information: Some may argue that implicit inference is less reliable than explicit annotation. However, confused or "not available" annotations provide even less value.

  2. Parser complexity: SDRF readers will need to implement inference logic. However, this is straightforward and likely already implemented in many tools that handle missing values.

  3. Edge cases: Are there experimental designs where the inference rules would produce incorrect results? We should identify and document these.

Questions for the community

Please give your comments on how this will affect your workflow. If you are in favor of making it optional, please comment "Agreed"; if you want to say why, let us know; if you don't want to remove it let us know why.

We welcome community feedback on this proposal. Please share your experiences with the current mandatory requirement and any concerns about the proposed change.

@bigbio/sdrf-community @bigbio/collaborators

Metadata

Metadata

Assignees

Labels

PSI-DiscussionSpecificationSpecification issues related with PRIDE formats, API, etcenhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions