Make comment[technical replicate] and characteristics[biological replicate] columns optional in SDRF-Proteomics 1.1.0

## Background

The current SDRF-Proteomics specification requires two mandatory columns for replicate annotation:
- `comment[technical replicate]`
- `characteristics[biological replicate]`

These columns were introduced during the original PSI meeting and SDRF development 1.0.0 to ensure explicit annotation of experimental design. However, after several years of community usage, we have gathered significant feedback suggesting this requirement creates unnecessary friction for data submitters; confusion and most of the time this information can be inferred. Let me explain the problem. 

## Problem Statement

### 1. User friction and poor data quality

Many users struggle with these mandatory columns, leading to:
- Frequent use of "not available" as a value, which defeats the purpose of the annotation
- User complaints and confusion about how to properly assign replicate numbers
- Submission delays and support burden. 

### 2. Redundant information

The replicate information can often be **inferred from the `source name`** (sample accession) following these logical rules:

| Scenario | Interpretation |
|----------|----------------|
| Different `source name` | Different biological replicate (different sample) |
| Same `source name`, different file | Technical replicate of the same sample |
| Same `source name`, same file, different fraction | Fractionation of the same technical replicate |

This means the explicit columns often duplicate information that is already implicitly encoded in the experimental design.

### 3. Cognitive overhead

For straightforward experimental designs, requiring users to manually number replicates adds cognitive overhead without proportional benefit. Users must think through numbering schemes that tools could automatically derive.

## Proposed Solution

**Make `comment[technical replicate]` and `characteristics[biological replicate]` columns OPTIONAL starting in SDRF-Proteomics specification version 1.1.0.** and the base template. 

### Specification changes

1. Move both columns from the "mandatory" to the "optional" section
2. Add clear documentation on how SDRF readers/parsers should interpret data when these columns are absent
3. Maintain backward compatibility (files with explicit replicate columns remain valid)

### Interpretation rules for parsers (when columns are absent)

The specification should document the following interpretation logic for SDRF readers:

```
When `characteristics[biological replicate]` column is ABSENT:
  - Each unique `source name` represents a distinct biological replicate
  - Rows sharing the same `source name` belong to the same biological sample

When `comment[technical replicate]` column is ABSENT:
  - Rows with the same `source name` but different raw files represent technical replicates
  - Rows with the same `source name` AND same raw file but different `comment[fraction identifier]` 
    represent fractions of the same technical replicate
```

### When explicit columns are still recommended

The specification should note that explicit replicate columns remain **recommended** for:
- Complex experimental designs where inference rules may be ambiguous
- Datasets where the same biological sample was processed on different days/batches
- Studies where explicit replicate tracking is important for downstream analysis
- Cases where submitters want to ensure unambiguous interpretation

## Benefits

1. **Reduced submission friction**: Users with straightforward designs can submit without struggling with replicate numbering
2. **Better data quality**: Fewer "not available" entries; implicit encoding via source names is more reliable than confused manual annotation
3. **Maintained flexibility**: Users who need explicit control can still provide the columns
4. **Clear semantics**: Documented inference rules ensure consistent interpretation across tools

## Considerations and Discussion Points

### Potential concerns

1. **Loss of explicit information**: Some may argue that implicit inference is less reliable than explicit annotation. However, confused or "not available" annotations provide even less value.

2. **Parser complexity**: SDRF readers will need to implement inference logic. However, this is straightforward and likely already implemented in many tools that handle missing values.

3. **Edge cases**: Are there experimental designs where the inference rules would produce incorrect results? We should identify and document these.

### Questions for the community

**Please give your comments on how this will affect your workflow. If you are in favor of making it optional, please comment "Agreed"; if you want to say why, let us know; if you don't want to remove it let us know why.** 

**We welcome community feedback on this proposal.** Please share your experiences with the current mandatory requirement and any concerns about the proposed change.

@bigbio/sdrf-community @bigbio/collaborators 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make comment[technical replicate] and characteristics[biological replicate] columns optional in SDRF-Proteomics 1.1.0 #775

Background

Problem Statement

1. User friction and poor data quality

2. Redundant information

3. Cognitive overhead

Proposed Solution

Specification changes

Interpretation rules for parsers (when columns are absent)

When explicit columns are still recommended

Benefits

Considerations and Discussion Points

Potential concerns

Questions for the community

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Interpretation
Different `source name`	Different biological replicate (different sample)
Same `source name`, different file	Technical replicate of the same sample
Same `source name`, same file, different fraction	Fractionation of the same technical replicate

Make comment[technical replicate] and characteristics[biological replicate] columns optional in SDRF-Proteomics 1.1.0 #775

Description

Background

Problem Statement

1. User friction and poor data quality

2. Redundant information

3. Cognitive overhead

Proposed Solution

Specification changes

Interpretation rules for parsers (when columns are absent)

When explicit columns are still recommended

Benefits

Considerations and Discussion Points

Potential concerns

Questions for the community

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions