Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Update digest to support latest Nipoppy processing status files #166

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

alyssadai
Copy link
Collaborator

@alyssadai alyssadai commented Dec 13, 2024

Changes proposed in this pull request:

  • Rename READMEs, schemas, and example input files to remove "bagel" to avoid confusion with files used/generated by Neurobagel / Nipoppy (input files are now just called digest files or inputs)
  • Update imaging digest schema to align with latest Nipoppy processing status file schema
    • Rename identifier columns to match Nipoppy
    • Remove columns no longer needed: PHASE__, STAGE__, HAS_DATATYPE__, HAS_IMAGE, has_mri_data
    • Make pipeline_starttime optional
    • Add columns: pipeline_step, status
    • Remove no-longer-needed IsPrefixedColumn, MissingValue column properties
    • The digest still has a few additional columns for now that may be useful in processing status files in future
    • See also Consider removing column categories in digest schemas #163
  • Use combination of pipeline-version-step instead of pipeline-version for tracking completion
    • Each pipeline-version-step combination now has distinct plots
  • Accept TSV instead of CSV inputs (Nipoppy now only produces TSVs)
  • Regenerate reference example inputs as TSVs, with column names updated according to latest schema
  • Regenerate test example files as TSVs, with column names updated according to latest schema
  • Update tests (mostly renaming column references)

TODO: Update pre-generated QPN file paths

For reviewer: you can test out the changes in the digest using either of these files:

Checklist

This section is for the PR reviewer

  • PR has an interpretable title with a prefix ([ENH], [FIX], [REF], [TST], [CI], [MNT], [INF], [MODEL], [DOC]) (see our Contributing Guidelines for more info)
  • PR links to GitHub issue with mention Closes #XXXX
  • Tests pass
  • Checks pass

For new features:

  • Tests have been added

For bug fixes:

  • There is at least one test that would fail under the original bug conditions.

- match column names to https://nipoppy.readthedocs.io/en/latest/schemas/index.html#imaging-bagel-file
- make pipeline_starttime optional
- remove "PHASE__" & "STAGE__" cols, "step" & "status" cols
- simplify column descriptions
- remove "IsPrefixedColumn"
- rename ID columns & update their descriptions
- refactor out primary session column name into a variable
@alyssadai
Copy link
Collaborator Author

Hey @michellewang, requesting your review here on just the digest schema changes as well as text including READMEs and status descriptions. Will tag you on the relevant files/sections under files changed!

README.md Show resolved Hide resolved
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review @michellewang

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review @michellewang

Copy link
Collaborator

@michellewang michellewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alyssadai let me know if I missed anything!

README.md Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on actual line because it hasn't been changed in the PR but at one point this README refers to https://github.com/neurodatascience/nipoppy-qpn which gives me a 404 error

digest/utility.py Show resolved Hide resolved
schemas/README.md Outdated Show resolved Hide resolved
"IsRequired": false
},
"session_id": {
"Description": "Participant session identifier.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Participant session on purpose or should it be only Session? For bids_session_id it just says session

Suggested change
"Description": "Participant session identifier.",
"Description": "Session identifier.",

"IsRequired": true
},
"status": {
"Description": "Completion status of the pipeline run or step for the subject-session pair. 'SUCCESS': All output files are present. 'FAIL': At least one output file is missing. 'INCOMPLETE': Parent pipeline has not been run for the subject session. 'UNAVAILABLE': Relevant MRI modality for pipeline not available for subject session.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INCOMPLETE should/will be about the pipeline itself, not a parent pipeline. What you have in digest/utility.py is good I think

Suggested change
"Description": "Completion status of the pipeline run or step for the subject-session pair. 'SUCCESS': All output files are present. 'FAIL': At least one output file is missing. 'INCOMPLETE': Parent pipeline has not been run for the subject session. 'UNAVAILABLE': Relevant MRI modality for pipeline not available for subject session.",
"Description": "Completion status of the pipeline run or step for the subject-session pair. 'SUCCESS': All output files are present. 'FAIL': At least one output file is missing. 'INCOMPLETE': Pipeline has not been run for the subject session. 'UNAVAILABLE': Relevant MRI modality for pipeline not available for subject session.",

"IsRequired": true
},
"bids_participant_id": {
"Description": "BIDS-compliant participant identifier.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could explicitly say that this should include the sub- prefix?

"IsRequired": true
},
"bids_session_id": {
"Description": "BIDS-compliant session identifier.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above but for ses- prefix

"bids_id": {
"Description": "BIDS dataset identifier for a participant, if available/different from the participant_id.",
"bids_participant_id": {
"Description": "BIDS-compliant participant identifier.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for imaging bagel: maybe we could explicitly say that this should include the sub- prefix?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: is it on purpose that this file doesn't have bids_session_id? I guess that column is not very important since the pheno file cannot be used as input to other Neurobagel things

@surchs surchs self-requested a review January 7, 2025 21:35
Copy link
Contributor

@surchs surchs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alyssadai. Changes look good, I did flag one thing I believe is an accidental change/bug with pd.read_csv to pd.read_tsv. I haven't tried this locally, but I'm pretty sure that doesn't work. Even if I'm wrong about that, I'd say add a test that covers the load_file_from_path route in utility, because it doesn't seem tested yet.

if not file_path.exists():
return None, "File not found."

bagel = pd.read_csv(file_path)
bagel = pd.read_tsv(file_path, sep="\t")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 I don't think read_tsv exists in pandas. Was this an accidental replace-all?

sidenote: using rename symbol in vscode can be a safer way to do rename-refactors: https://code.visualstudio.com/docs/editor/refactoring#_rename-symbol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're likely not catching this with tests because you are only testing the load_file_from_contents route. At some point it would make sense to use an e2e test library like cypress or playwright to user-test the whole thing. But even now it should be possible to do a simple smoke test with a temp .tsv file here

@@ -188,7 +189,7 @@ def process_bagel(upload_contents, available_digest_nclicks, filenames):
{"type": schema, "data": overview_df.to_dict("records")},
pipelines_dict,
None,
"csv",
"csv", # NOTE: "tsv" is not an option for export_format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"csv", # NOTE: "tsv" is not an option for export_format
"csv", # NOTE: the dash_table.DataTable object does not support "tsv" as an option for export_format

Co-authored-by: Michelle Wang <tomichellewang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment