Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Update digest to support latest Nipoppy processing status files #166

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
3802068
update proc status file schema
alyssadai Dec 9, 2024
38bff3a
remove columns about raw imaging data and IsPrefixedColumn property
alyssadai Dec 9, 2024
4dd3d1f
require TSV inputs instead of CSVs
alyssadai Dec 9, 2024
1b14759
align phenotypic file schema with proc status file changes
alyssadai Dec 9, 2024
669adda
update id column references and handling to reflect new names
alyssadai Dec 9, 2024
5880361
update comment
alyssadai Dec 9, 2024
84dda4a
fix imports
alyssadai Dec 9, 2024
403ed1f
update references to renamed 'pipeline-complete' column
alyssadai Dec 9, 2024
e7df57d
remove MissingValue column property from schema
alyssadai Dec 11, 2024
53047f0
refactor id column extraction
alyssadai Dec 11, 2024
4745d7f
update docstrings
alyssadai Dec 11, 2024
07c11d3
update README
alyssadai Dec 13, 2024
e53c76d
update README
alyssadai Dec 13, 2024
3e10487
update README
alyssadai Dec 13, 2024
99cfdf5
update schema README
alyssadai Dec 13, 2024
b25d14e
update schema README
alyssadai Dec 13, 2024
902947a
rename schemas
alyssadai Dec 13, 2024
1f72a13
fix sneaky outdated session column reference
alyssadai Dec 13, 2024
b2dde0e
update columns in test data digests and convert to TSVs
alyssadai Dec 13, 2024
037e5dc
update missing column example based on revised schema
alyssadai Dec 13, 2024
745483a
rename test data files
alyssadai Dec 13, 2024
a18a2d6
update docstring
alyssadai Dec 13, 2024
cab23fb
rename reference example inputs and update symlinks
alyssadai Dec 13, 2024
3d492a6
update tests
alyssadai Dec 13, 2024
e622585
replace csv with tsv in docstrings
alyssadai Dec 13, 2024
08a5812
rename PRIMARY_SESSION var
alyssadai Dec 13, 2024
8615e20
Update schema README
alyssadai Jan 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 30 additions & 26 deletions README.md
alyssadai marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on actual line because it hasn't been changed in the PR but at one point this README refers to https://github.com/neurodatascience/nipoppy-qpn which gives me a 404 error

Original file line number Diff line number Diff line change
@@ -1,42 +1,46 @@
# Dashboard for neuroimaging and phenotypic dataset exploration
# Descriptive & neuroImaging data Graphical Explorer for Subject Tracking

- [Overview](#overview)
- [Overview](#overview)
- [Preview](#preview)
- [Quickstart](#quickstart)
- [Input schema](#input-schema)
- [Creating a dashboard-ready dataset file](#creating-a-dataset-file-for-the-dashboard-bagelcsv)
- [Creating a dashboard-ready "digest" file](#creating-a-dashboard-ready-digest-file)
- [Running in a Docker container](#running-in-a-docker-container)
- [Local development](#local-development)

## Overview
`digest` is a web dashboard that provides interactive visual summaries and subject-level querying based on neuroimaging derivatives and phenotypic variables available for a dataset.
`digest` is a web dashboard for exploring subject-level availability of pipeline derivatives and phenotypic variables in a neuroimaging dataset.
It provides user-friendly options for querying data availability, along with interactive visual summaries.

A `digest` dashboard can be generated for any tabular dataset file that follows a data modality-specific [schema](/schemas/), which we refer to as a "bagel" file.
The dashboard is compatible with the processing status `bagel.csv` files automatically generated by the [Nipoppy framework for neuroimaging dataset organization and processing](https://github.com/neurodatascience/nipoppy).

For more information on how to use `digest` with the Nipoppy project, also see the official [Nipoppy documentation](https://neurobagel.org/nipoppy/overview/).

**Quickstart**: https://digest.neurobagel.org/
`digest` supports any dataset TSV file that follows a data modality-specific [schema](/schemas/) (called a "digest" file).
`digest` is also compatible with the processing status files generated by [Nipoppy](https://nipoppy.readthedocs.io/en/stable/).

## Preview
![alt text](img/ui_overview_table.png?raw=true)
![alt text](img/ui_overview_plots.png?raw=true)

## Quickstart
Try out `digest` at https://digest.neurobagel.org/!

You can find correctly formatted example input files [here](/example_bagels/) to test out dashboard functionality.

## Input schema
Input files to the dashboard contain long format data that must be formatted according to the [bagel schema](/schemas/) (see also the schemas [README](https://github.com/neurobagel/digest/tree/main/schemas#readme) for more info). A single file is expected to correspond to one dataset, but may contain status information for multiple processing pipelines for that dataset.

### Try it out
You can view and download correctly formatted, minimal input tabular files from [here](/example_bagels/) to test out dashboard functionality.

## Creating a dashboard-ready dataset file (`bagel.csv`)
While `digest` works on any input CSV compliant with a [bagel schema](/schemas/), the easiest way to generate a dashboard-ready file for a dataset's neuroimaging processing info is to follow the [Nipoppy](https://neurobagel.org/nipoppy/overview/) standard structure for organizing raw MRI data and processed outputs (data derivatives).
`Nipoppy` offers scripts that can use this standardized dataset organization to automatically extract info about the raw imaging files and any pipelines that have been run, which is then stored in a dashboard-ready `bagel.csv`.

Detailed instructions to get started using `Nipoppy` can be found in their [documentation](https://neurobagel.org/nipoppy/overview/).
In brief, generating a `bagel.csv` for your dataset can be as simple as:
1. Installing `Nipoppy` to generate a dataset directory tree for your dataset (see [Installation](https://neurobagel.org/nipoppy/installation/) section of docs) that you can populate with your existing data
2. Update `Nipoppy` configuration to reflect the pipeline versions you are using (for tracking purposes), and augment your participant spreadsheet according to `Nipoppy` requirements (see [Configs](https://neurobagel.org/nipoppy/configs/) section of docs)
3. Run the tracker ([run_tracker.py](https://github.com/neurodatascience/nipoppy/blob/main/trackers/run_tracker.py)) for the relevant pipeline(s) for your dataset to generate a comprehensive `bagel.csv`
- To see help text for this script: `python run_tracker.py --help`
- This step can be repeated as needed to update the `bagel.csv` with newly processed subjects
`digest` supports long format TSVs that contain the columns specified in the [digest schemas](/schemas/) (see also the schema [README](https://github.com/neurobagel/digest/tree/main/schemas#readme)).
At the moment, each digest file is expected to correspond to one dataset.

## Creating a dashboard-ready "digest" file
While `digest` accepts any TSV compliant with one of the [digest schemas](/schemas/), the easiest way to obtain dashboard-ready files for pipeline derivative availability is to use the [Nipoppy](https://neurobagel.org/nipoppy/overview/) specification for organizing your neuroimaging dataset.
Nipoppy provides dataset [trackers](https://nipoppy.readthedocs.io/en/stable/user_guide/tracking.html) that can automatically extract subjects' imaging data and pipeline output availability, producing `digest`-compatible processing status files.

For detailed instructions to get started using Nipoppy, see the [documentation](https://nipoppy.readthedocs.io/en/stable/).

In brief, the (mostly automated!) Nipoppy steps to generate a processing status file can be as simple as:
1. Initializing an empty, Nipoppy-compliant dataset directory tree for your dataset
2. Updating your Nipoppy configuration with the pipeline versions you are using, and creating a manifest spreadsheet of all available participants and sessions
2. Populating the directory tree with any existing data and pipeline outputs _*_
alyssadai marked this conversation as resolved.
Show resolved Hide resolved
3. Running the tracker for the relevant pipeline(s) to generate a processing status file

_*Nipoppy also provides a protocol for running processing pipelines from raw imaging data._

## Running in a Docker container

Expand Down
35 changes: 19 additions & 16 deletions digest/app.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""
Serves Dash app for viewing and filtering participant (meta)data for imaging and phenotypic tasks from a given dataset.
App accepts and parses a user-uploaded bagel.csv file (assumed to be generated by mr_proc) as input.
Serves Dash app for viewing and filtering participant (meta)data for imaging and phenotypic data events from a provided dataset.
App accepts and parses a user-uploaded digest TSV file as input.
"""

import dash_bootstrap_components as dbc
Expand All @@ -11,6 +11,7 @@
from . import plotting as plot
from . import utility as util
from .layout import DEFAULT_DATASET_NAME, construct_layout, upload_buttons
from .utility import PRIMARY_SESSION_COL

EMPTY_FIGURE_PROPS = {"data": [], "layout": {}, "frames": []}

Expand Down Expand Up @@ -116,7 +117,7 @@ def set_was_upload_used_flag(upload_contents, available_digest_nclicks):
)
def process_bagel(upload_contents, available_digest_nclicks, filenames):
"""
From the contents of a correctly-formatted uploaded .csv file, parse and store (1) the pipeline overview data as a dataframe,
From the contents of a correctly-formatted uploaded TSV file, parse and store (1) the pipeline overview data as a dataframe,
and (2) pipeline-specific metadata as individual dataframes within a dict.
Returns any errors encountered during input file processing as a user-friendly message.
"""
Expand Down Expand Up @@ -156,9 +157,9 @@ def process_bagel(upload_contents, available_digest_nclicks, filenames):
# TODO: Any existing NaNs will currently be turned into "nan". (See open issue https://github.com/pandas-dev/pandas/issues/25353)
# Another side effect of allowing NaN sessions is that if this column has integer values, they will be read in as floats
# (before being converted to str) if there are NaNs in the column.
# This should not be a problem after we disallow NaNs value in "participant_id" and "session" columns, https://github.com/neurobagel/digest/issues/20
bagel["session"] = bagel["session"].astype(str)
session_list = bagel["session"].unique().tolist()
# This should not be a problem after we disallow NaNs value in "participant_id" and "session_id" columns, https://github.com/neurobagel/digest/issues/20
bagel[PRIMARY_SESSION_COL] = bagel[PRIMARY_SESSION_COL].astype(str)
session_list = bagel[PRIMARY_SESSION_COL].unique().tolist()

overview_df = util.get_pipelines_overview(
bagel=bagel, schema=schema
Expand Down Expand Up @@ -188,7 +189,7 @@ def process_bagel(upload_contents, available_digest_nclicks, filenames):
{"type": schema, "data": overview_df.to_dict("records")},
pipelines_dict,
None,
"csv",
"csv", # NOTE: "tsv" is not an option for export_format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"csv", # NOTE: "tsv" is not an option for export_format
"csv", # NOTE: the dash_table.DataTable object does not support "tsv" as an option for export_format

)


Expand All @@ -203,7 +204,7 @@ def reset_upload_buttons(memory_filename):

Upload components need to be manually replaced to clear contents,
otherwise previously uploaded imaging/pheno bagels cannot be re-uploaded
(e.g. if a user uploads pheno_bagel.csv, then imaging_bagel.csv, then pheno_bagel.csv again)
(e.g. if a user uploads pheno_bagel.tsv, then imaging_bagel.tsv, then pheno_bagel.tsv again)
see https://github.com/plotly/dash-core-components/issues/816
"""
return upload_buttons()
Expand Down Expand Up @@ -293,7 +294,7 @@ def update_session_filter(parsed_data, session_list):
)
def create_pipeline_status_dropdowns(pipelines_dict, parsed_data):
"""
Generates a dropdown filter with status options for each unique pipeline in the input csv,
Generates a dropdown filter with status options for each unique pipeline in the input TSV,
and disables the native datatable filter UI for the corresponding columns in the datatable.
"""
pipeline_dropdowns = []
Expand Down Expand Up @@ -418,7 +419,7 @@ def update_matching_rows(columns, virtual_data):
)
def reset_selections(filename):
"""
If file contents change (i.e., selected new CSV for upload), reset displayed file name and selection values related to data filtering or plotting.
If file contents change (i.e., selected new TSV for upload), reset displayed file name and selection values related to data filtering or plotting.
Reset will occur regardless of whether there is an issue processing the selected file.
"""
return f"Input file: {filename}", "", "", None, False
Expand All @@ -436,8 +437,10 @@ def reset_selections(filename):
)
def generate_overview_status_fig_for_participants(parsed_data, session_list):
"""
If new dataset uploaded, generate stacked bar plot of pipeline_complete statuses per session,
grouped by pipeline. Provides overview of the number of participants with each status in a given session,
When a new dataset is uploaded, generate stacked bar plots of pipeline statuses per session,
grouped in subplots corresponding to each pipeline.

Provides overview of the number of participants with each status in a given session,
per processing pipeline.
"""
if parsed_data is not None and parsed_data.get("type") != "phenotypic":
Expand Down Expand Up @@ -467,7 +470,7 @@ def generate_overview_status_fig_for_participants(parsed_data, session_list):
def update_overview_status_fig_for_records(data, pipelines_dict, parsed_data):
"""
When visible data in the overview datatable is updated (excluding built-in frontend datatable filtering
but including custom component filtering), generate stacked bar plot of pipeline_complete statuses aggregated
but including custom component filtering), generate stacked bar plot of pipeline statuses aggregated
by pipeline. Counts of statuses in plot thus correspond to unique records (unique participant-session
combinations).
"""
Expand All @@ -479,7 +482,7 @@ def update_overview_status_fig_for_records(data, pipelines_dict, parsed_data):
if not data_df.empty:
status_counts = (
plot.transform_active_data_to_long(data_df)
.groupby(["pipeline_name", "pipeline_complete"])
.groupby(["pipeline_name", "status"])
.size()
.reset_index(name="records")
)
Expand Down Expand Up @@ -512,7 +515,7 @@ def display_phenotypic_column_dropdown(parsed_data):
# exclude unique participant identifier columns from visualization
if column not in [
"participant_id",
"bids_id",
"bids_participant_id",
]: # TODO: Consider storing these column names in a constant
column_options.append({"label": column, "value": column})

Expand Down Expand Up @@ -552,7 +555,7 @@ def plot_phenotypic_column(
data_to_plot = virtual_data

if session_switch_value:
color = "session"
color = PRIMARY_SESSION_COL
else:
color = None

Expand Down
6 changes: 3 additions & 3 deletions digest/layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def upload_buttons() -> list:
upload_imaging = dcc.Upload(
id={"type": "upload-data", "index": "imaging", "btn_idx": 0},
children=dbc.Button(
"Select imaging CSV file...",
"Select imaging TSV file...",
color="light",
),
multiple=False,
Expand All @@ -110,7 +110,7 @@ def upload_buttons() -> list:
upload_phenotypic = dcc.Upload(
id={"type": "upload-data", "index": "phenotypic", "btn_idx": 1},
children=dbc.Button(
"Select phenotypic CSV file...",
"Select phenotypic TSV file...",
color="light",
),
multiple=False,
Expand Down Expand Up @@ -266,7 +266,7 @@ def status_legend_card():
"These are the recommended status definitions for processing progress. For more details, see the ",
html.A(
"schema for an imaging digest file",
href="https://github.com/neurobagel/digest/blob/main/schemas/bagel_schema.json",
href="https://github.com/neurobagel/digest/blob/main/schemas/imaging_digest_schema.json",
target="_blank",
),
],
Expand Down
25 changes: 13 additions & 12 deletions digest/plotting.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import plotly.graph_objects as go

from . import utility as util
from .utility import PRIMARY_SESSION_COL

CMAP = px.colors.qualitative.Bold
STATUS_COLORS = {
Expand Down Expand Up @@ -37,7 +38,7 @@ def transform_active_data_to_long(data: pd.DataFrame) -> pd.DataFrame:
data,
id_vars=util.get_id_columns(data),
var_name="pipeline_name",
value_name="pipeline_complete",
value_name="status",
)


Expand All @@ -60,28 +61,28 @@ def plot_pipeline_status_by_participants(
) -> go.Figure:
status_counts = (
transform_active_data_to_long(data)
.groupby(["pipeline_name", "pipeline_complete", "session"])
.groupby(["pipeline_name", "status", PRIMARY_SESSION_COL])
.size()
.reset_index(name="participants")
)

fig = px.bar(
status_counts,
x="session",
x=PRIMARY_SESSION_COL,
y="participants",
color="pipeline_complete",
color="status",
text_auto=True,
facet_col="pipeline_name",
category_orders={
"pipeline_complete": util.PIPE_COMPLETE_STATUS_SHORT_DESC.keys(),
"session": session_list,
"status": util.PIPE_COMPLETE_STATUS_SHORT_DESC.keys(),
PRIMARY_SESSION_COL: session_list,
},
color_discrete_map=STATUS_COLORS,
labels={
"pipeline_name": "Pipeline",
"participants": "Participants (n)",
"pipeline_complete": "Processing status",
"session": "Session",
"status": "Processing status",
PRIMARY_SESSION_COL: "Session",
},
title="All participant pipeline statuses by session",
)
Expand All @@ -97,10 +98,10 @@ def plot_pipeline_status_by_records(status_counts: pd.DataFrame) -> go.Figure:
status_counts,
x="pipeline_name",
y="records",
color="pipeline_complete",
color="status",
text_auto=True,
category_orders={
"pipeline_complete": util.PIPE_COMPLETE_STATUS_SHORT_DESC.keys(),
"status": util.PIPE_COMPLETE_STATUS_SHORT_DESC.keys(),
"pipeline_name": status_counts["pipeline_name"]
.drop_duplicates()
.sort_values(),
Expand All @@ -109,7 +110,7 @@ def plot_pipeline_status_by_records(status_counts: pd.DataFrame) -> go.Figure:
labels={
"pipeline_name": "Pipeline",
"records": "Records (n)",
"pipeline_complete": "Processing status",
"status": "Processing status",
},
title="Pipeline statuses of records matching filter (default: all)",
)
Expand All @@ -124,7 +125,7 @@ def populate_empty_records_pipeline_status_plot(
"""Returns dataframe of counts representing 0 matching records in the datatable, i.e., 0 records with each pipeline status."""
status_counts = pd.DataFrame(
list(product(pipelines, statuses)),
columns=["pipeline_name", "pipeline_complete"],
columns=["pipeline_name", "status"],
)
status_counts["records"] = 0

Expand Down
Loading
Loading