[ENH] Update digest to support latest Nipoppy processing status files #166

alyssadai · 2024-12-13T18:12:41Z

Changes proposed in this pull request:

Rename READMEs, schemas, and example input files to remove "bagel" to avoid confusion with files used/generated by Neurobagel / Nipoppy (input files are now just called digest files or inputs)
- Note: this was done for all user-facing files, but the code internally still uses "bagel" in a lot of places to refer to the input data. These will be updated in Rename functions and variables to deprecate "bagel" and reflect non-imaging data #164
Update imaging digest schema to align with latest Nipoppy processing status file schema
- Rename identifier columns to match Nipoppy
- Remove columns no longer needed: PHASE__, STAGE__, HAS_DATATYPE__, HAS_IMAGE, has_mri_data
- Make pipeline_starttime optional
- Add columns: pipeline_step, status
- Remove no-longer-needed IsPrefixedColumn, MissingValue column properties
- The digest still has a few additional columns for now that may be useful in processing status files in future
- See also Consider removing column categories in digest schemas #163
Use combination of pipeline-version-step instead of pipeline-version for tracking completion
- Each pipeline-version-step combination now has distinct plots
Accept TSV instead of CSV inputs (Nipoppy now only produces TSVs)
Regenerate reference example inputs as TSVs, with column names updated according to latest schema
- Reference example imaging digest file now based on https://github.com/neurobagel/neurobagel_examples/blob/main/data-upload/nipoppy_proc_status_synthetic.tsv
Regenerate test example files as TSVs, with column names updated according to latest schema
Update tests (mostly renaming column references)

TODO: Update pre-generated QPN file paths

For reviewer: you can test out the changes in the digest using either of these files:

Checklist

This section is for the PR reviewer

PR has an interpretable title with a prefix ([ENH], [FIX], [REF], [TST], [CI], [MNT], [INF], [MODEL], [DOC]) (see our Contributing Guidelines for more info)
PR links to GitHub issue with mention Closes #XXXX
Tests pass
Checks pass

For new features:

Tests have been added

For bug fixes:

There is at least one test that would fail under the original bug conditions.

- match column names to https://nipoppy.readthedocs.io/en/latest/schemas/index.html#imaging-bagel-file - make pipeline_starttime optional - remove "PHASE__" & "STAGE__" cols, "step" & "status" cols - simplify column descriptions

- "HAS_DATATYPE__", "HAS_IMAGE__"

- remove "IsPrefixedColumn" - rename ID columns & update their descriptions

- refactor out primary session column name into a variable

alyssadai · 2024-12-13T19:18:40Z

Hey @michellewang, requesting your review here on just the digest schema changes as well as text including READMEs and status descriptions. Will tag you on the relevant files/sections under files changed!

README.md

schemas/README.md

alyssadai · 2024-12-13T19:19:46Z

schemas/imaging_digest_schema.json

please review @michellewang

alyssadai · 2024-12-13T19:19:56Z

schemas/phenotypic_digest_schema.json

please review @michellewang

digest/utility.py

michellewang

@alyssadai let me know if I missed anything!

README.md

michellewang · 2024-12-19T22:15:06Z

README.md

Can't comment on actual line because it hasn't been changed in the PR but at one point this README refers to https://github.com/neurodatascience/nipoppy-qpn which gives me a 404 error

digest/utility.py

schemas/README.md

michellewang · 2024-12-19T22:28:17Z

schemas/imaging_digest_schema.json

+            "IsRequired": false
+        },
+        "session_id": {
+            "Description": "Participant session identifier.", 


Is this Participant session on purpose or should it be only Session? For bids_session_id it just says session

Suggested change

"Description": "Participant session identifier.",

"Description": "Session identifier.",

michellewang · 2024-12-20T15:04:46Z

schemas/imaging_digest_schema.json

+            "IsRequired": true
+        },
+        "status": {
+            "Description": "Completion status of the pipeline run or step for the subject-session pair. 'SUCCESS': All output files are present. 'FAIL': At least one output file is missing. 'INCOMPLETE': Parent pipeline has not been run for the subject session. 'UNAVAILABLE': Relevant MRI modality for pipeline not available for subject session.",


INCOMPLETE should/will be about the pipeline itself, not a parent pipeline. What you have in digest/utility.py is good I think

Suggested change

"Description": "Completion status of the pipeline run or step for the subject-session pair. 'SUCCESS': All output files are present. 'FAIL': At least one output file is missing. 'INCOMPLETE': Parent pipeline has not been run for the subject session. 'UNAVAILABLE': Relevant MRI modality for pipeline not available for subject session.",

"Description": "Completion status of the pipeline run or step for the subject-session pair. 'SUCCESS': All output files are present. 'FAIL': At least one output file is missing. 'INCOMPLETE': Pipeline has not been run for the subject session. 'UNAVAILABLE': Relevant MRI modality for pipeline not available for subject session.",

michellewang · 2024-12-20T15:09:07Z

schemas/imaging_digest_schema.json

+            "IsRequired": true
+        },
+        "bids_participant_id": {
+            "Description": "BIDS-compliant participant identifier.",


Maybe we could explicitly say that this should include the sub- prefix?

michellewang · 2024-12-20T15:09:32Z

schemas/imaging_digest_schema.json

+            "IsRequired": true
+        },
+        "bids_session_id": {
+            "Description": "BIDS-compliant session identifier.", 


Same comment as above but for ses- prefix

michellewang · 2024-12-20T15:09:55Z

schemas/phenotypic_digest_schema.json

-        "bids_id": {
-            "Description": "BIDS dataset identifier for a participant, if available/different from the participant_id.", 
+        "bids_participant_id": {
+            "Description": "BIDS-compliant participant identifier.", 


Same comment as for imaging bagel: maybe we could explicitly say that this should include the sub- prefix?

michellewang · 2024-12-20T15:10:57Z

schemas/phenotypic_digest_schema.json

General question: is it on purpose that this file doesn't have bids_session_id? I guess that column is not very important since the pheno file cannot be used as input to other Neurobagel things

surchs

Thanks @alyssadai. Changes look good, I did flag one thing I believe is an accidental change/bug with pd.read_csv to pd.read_tsv. I haven't tried this locally, but I'm pretty sure that doesn't work. Even if I'm wrong about that, I'd say add a test that covers the load_file_from_path route in utility, because it doesn't seem tested yet.

surchs · 2025-01-07T21:39:31Z

digest/utility.py

    if not file_path.exists():
        return None, "File not found."

-    bagel = pd.read_csv(file_path)
+    bagel = pd.read_tsv(file_path, sep="\t")


👀 I don't think read_tsv exists in pandas. Was this an accidental replace-all?

sidenote: using rename symbol in vscode can be a safer way to do rename-refactors: https://code.visualstudio.com/docs/editor/refactoring#_rename-symbol

You're likely not catching this with tests because you are only testing the load_file_from_contents route. At some point it would make sense to use an e2e test library like cypress or playwright to user-test the whole thing. But even now it should be possible to do a simple smoke test with a temp .tsv file here

surchs · 2025-01-07T22:03:08Z

digest/app.py

@@ -188,7 +189,7 @@ def process_bagel(upload_contents, available_digest_nclicks, filenames):
        {"type": schema, "data": overview_df.to_dict("records")},
        pipelines_dict,
        None,
-        "csv",
+        "csv",  # NOTE: "tsv" is not an option for export_format


Suggested change

"csv", # NOTE: "tsv" is not an option for export_format

"csv", # NOTE: the dash_table.DataTable object does not support "tsv" as an option for export_format

Co-authored-by: Michelle Wang <tomichellewang@gmail.com>

alyssadai added 26 commits December 9, 2024 01:24

update proc status file schema

3802068

- match column names to https://nipoppy.readthedocs.io/en/latest/schemas/index.html#imaging-bagel-file - make pipeline_starttime optional - remove "PHASE__" & "STAGE__" cols, "step" & "status" cols - simplify column descriptions

remove columns about raw imaging data and IsPrefixedColumn property

38bff3a

- "HAS_DATATYPE__", "HAS_IMAGE__"

require TSV inputs instead of CSVs

4dd3d1f

align phenotypic file schema with proc status file changes

1b14759

- remove "IsPrefixedColumn" - rename ID columns & update their descriptions

update id column references and handling to reflect new names

669adda

- refactor out primary session column name into a variable

update comment

5880361

fix imports

84dda4a

update references to renamed 'pipeline-complete' column

403ed1f

remove MissingValue column property from schema

e7df57d

refactor id column extraction

53047f0

update docstrings

4745d7f

update README

07c11d3

update README

e53c76d

update README

3e10487

update schema README

99cfdf5

update schema README

b25d14e

rename schemas

902947a

fix sneaky outdated session column reference

1f72a13

update columns in test data digests and convert to TSVs

b2dde0e

update missing column example based on revised schema

037e5dc

rename test data files

745483a

update docstring

a18a2d6

rename reference example inputs and update symlinks

cab23fb

update tests

3d492a6

replace csv with tsv in docstrings

e622585

rename PRIMARY_SESSION var

08a5812

alyssadai mentioned this pull request Dec 13, 2024

Update the digest to work with the latest imaging processing status file #158

Open

alyssadai marked this pull request as ready for review December 13, 2024 19:16

alyssadai requested a review from michellewang December 13, 2024 19:17

alyssadai commented Dec 13, 2024

View reviewed changes

README.md Show resolved Hide resolved

alyssadai commented Dec 13, 2024

View reviewed changes

schemas/README.md Show resolved Hide resolved

alyssadai commented Dec 13, 2024

View reviewed changes

schemas/imaging_digest_schema.json

Copy link

Collaborator Author

alyssadai Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review @michellewang

alyssadai commented Dec 13, 2024

View reviewed changes

schemas/phenotypic_digest_schema.json

Copy link

Collaborator Author

alyssadai Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review @michellewang

alyssadai commented Dec 13, 2024

View reviewed changes

digest/utility.py Show resolved Hide resolved

michellewang reviewed Dec 20, 2024

View reviewed changes

surchs self-requested a review January 7, 2025 21:35

surchs requested changes Jan 7, 2025

View reviewed changes

Update schema README

8615e20

Co-authored-by: Michelle Wang <tomichellewang@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Update digest to support latest Nipoppy processing status files #166

[ENH] Update digest to support latest Nipoppy processing status files #166

alyssadai commented Dec 13, 2024 •

edited by michellewang

Loading

alyssadai commented Dec 13, 2024

alyssadai Dec 13, 2024

alyssadai Dec 13, 2024

michellewang left a comment

michellewang Dec 19, 2024

michellewang Dec 19, 2024

michellewang Dec 20, 2024

michellewang Dec 20, 2024

michellewang Dec 20, 2024

michellewang Dec 20, 2024

michellewang Dec 20, 2024

surchs left a comment

surchs Jan 7, 2025

surchs Jan 7, 2025

surchs Jan 7, 2025

	"Description": "Participant session identifier.",
	"Description": "Session identifier.",

	"Description": "Completion status of the pipeline run or step for the subject-session pair. 'SUCCESS': All output files are present. 'FAIL': At least one output file is missing. 'INCOMPLETE': Parent pipeline has not been run for the subject session. 'UNAVAILABLE': Relevant MRI modality for pipeline not available for subject session.",
	"Description": "Completion status of the pipeline run or step for the subject-session pair. 'SUCCESS': All output files are present. 'FAIL': At least one output file is missing. 'INCOMPLETE': Pipeline has not been run for the subject session. 'UNAVAILABLE': Relevant MRI modality for pipeline not available for subject session.",

	"csv", # NOTE: "tsv" is not an option for export_format
	"csv", # NOTE: the dash_table.DataTable object does not support "tsv" as an option for export_format

[ENH] Update digest to support latest Nipoppy processing status files #166

Are you sure you want to change the base?

[ENH] Update digest to support latest Nipoppy processing status files #166

Conversation

alyssadai commented Dec 13, 2024 • edited by michellewang Loading

Checklist

alyssadai commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michellewang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

surchs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyssadai commented Dec 13, 2024 •

edited by michellewang

Loading