-
Notifications
You must be signed in to change notification settings - Fork 113
Description
Background
The current SDRF-Proteomics specification requires two mandatory columns for replicate annotation:
comment[technical replicate]characteristics[biological replicate]
These columns were introduced during the original PSI meeting and SDRF development 1.0.0 to ensure explicit annotation of experimental design. However, after several years of community usage, we have gathered significant feedback suggesting this requirement creates unnecessary friction for data submitters; confusion and most of the time this information can be inferred. Let me explain the problem.
Problem Statement
1. User friction and poor data quality
Many users struggle with these mandatory columns, leading to:
- Frequent use of "not available" as a value, which defeats the purpose of the annotation
- User complaints and confusion about how to properly assign replicate numbers
- Submission delays and support burden.
2. Redundant information
The replicate information can often be inferred from the source name (sample accession) following these logical rules:
| Scenario | Interpretation |
|---|---|
Different source name |
Different biological replicate (different sample) |
Same source name, different file |
Technical replicate of the same sample |
Same source name, same file, different fraction |
Fractionation of the same technical replicate |
This means the explicit columns often duplicate information that is already implicitly encoded in the experimental design.
3. Cognitive overhead
For straightforward experimental designs, requiring users to manually number replicates adds cognitive overhead without proportional benefit. Users must think through numbering schemes that tools could automatically derive.
Proposed Solution
Make comment[technical replicate] and characteristics[biological replicate] columns OPTIONAL starting in SDRF-Proteomics specification version 1.1.0. and the base template.
Specification changes
- Move both columns from the "mandatory" to the "optional" section
- Add clear documentation on how SDRF readers/parsers should interpret data when these columns are absent
- Maintain backward compatibility (files with explicit replicate columns remain valid)
Interpretation rules for parsers (when columns are absent)
The specification should document the following interpretation logic for SDRF readers:
When `characteristics[biological replicate]` column is ABSENT:
- Each unique `source name` represents a distinct biological replicate
- Rows sharing the same `source name` belong to the same biological sample
When `comment[technical replicate]` column is ABSENT:
- Rows with the same `source name` but different raw files represent technical replicates
- Rows with the same `source name` AND same raw file but different `comment[fraction identifier]`
represent fractions of the same technical replicate
When explicit columns are still recommended
The specification should note that explicit replicate columns remain recommended for:
- Complex experimental designs where inference rules may be ambiguous
- Datasets where the same biological sample was processed on different days/batches
- Studies where explicit replicate tracking is important for downstream analysis
- Cases where submitters want to ensure unambiguous interpretation
Benefits
- Reduced submission friction: Users with straightforward designs can submit without struggling with replicate numbering
- Better data quality: Fewer "not available" entries; implicit encoding via source names is more reliable than confused manual annotation
- Maintained flexibility: Users who need explicit control can still provide the columns
- Clear semantics: Documented inference rules ensure consistent interpretation across tools
Considerations and Discussion Points
Potential concerns
-
Loss of explicit information: Some may argue that implicit inference is less reliable than explicit annotation. However, confused or "not available" annotations provide even less value.
-
Parser complexity: SDRF readers will need to implement inference logic. However, this is straightforward and likely already implemented in many tools that handle missing values.
-
Edge cases: Are there experimental designs where the inference rules would produce incorrect results? We should identify and document these.
Questions for the community
Please give your comments on how this will affect your workflow. If you are in favor of making it optional, please comment "Agreed"; if you want to say why, let us know; if you don't want to remove it let us know why.
We welcome community feedback on this proposal. Please share your experiences with the current mandatory requirement and any concerns about the proposed change.
@bigbio/sdrf-community @bigbio/collaborators