neurobagel · alyssadai · Dec 9, 2024 · Dec 9, 2024 · Dec 9, 2024 · Dec 9, 2024
diff --git a/README.md b/README.md
@@ -1,42 +1,46 @@
-# Dashboard for neuroimaging and phenotypic dataset exploration
+# Descriptive & neuroImaging data Graphical Explorer for Subject Tracking
 
-- [Overview](#overview)  
+- [Overview](#overview)
 - [Preview](#preview)  
+- [Quickstart](#quickstart)  
 - [Input schema](#input-schema)  
-- [Creating a dashboard-ready dataset file](#creating-a-dataset-file-for-the-dashboard-bagelcsv)  
+- [Creating a dashboard-ready "digest" file](#creating-a-dashboard-ready-digest-file)
+- [Running in a Docker container](#running-in-a-docker-container)
 - [Local development](#local-development)
 
 ## Overview
-`digest` is a web dashboard that provides interactive visual summaries and subject-level querying based on neuroimaging derivatives and phenotypic variables available for a dataset.
+`digest` is a web dashboard for exploring subject-level availability of pipeline derivatives and phenotypic variables in a neuroimaging dataset.
+It provides user-friendly options for querying data availability, along with interactive visual summaries.
 
-A `digest` dashboard can be generated for any tabular dataset file that follows a data modality-specific [schema](/schemas/), which we refer to as a "bagel" file.
-The dashboard is compatible with the processing status `bagel.csv` files automatically generated by the [Nipoppy framework for neuroimaging dataset organization and processing](https://github.com/neurodatascience/nipoppy).
-
-For more information on how to use `digest` with the Nipoppy project, also see the official [Nipoppy documentation](https://neurobagel.org/nipoppy/overview/).
-
-**Quickstart**: https://digest.neurobagel.org/
+`digest` supports any dataset TSV file that follows a data modality-specific [schema](/schemas/) (called a "digest" file).
+`digest` is also compatible with the processing status files generated by [Nipoppy](https://nipoppy.readthedocs.io/en/stable/).
 
 ## Preview
 ![alt text](img/ui_overview_table.png?raw=true)
 ![alt text](img/ui_overview_plots.png?raw=true)
 
+## Quickstart
+Try out `digest` at https://digest.neurobagel.org/!
+
+You can find correctly formatted example input files [here](/example_bagels/) to test out dashboard functionality.
+
 ## Input schema
-Input files to the dashboard contain long format data that must be formatted according to the [bagel schema](/schemas/) (see also the schemas [README](https://github.com/neurobagel/digest/tree/main/schemas#readme) for more info). A single file is expected to correspond to one dataset, but may contain status information for multiple processing pipelines for that dataset.
-
-### Try it out
-You can view and download correctly formatted, minimal input tabular files from [here](/example_bagels/) to test out dashboard functionality.
-
-## Creating a dashboard-ready dataset file (`bagel.csv`)
-While `digest` works on any input CSV compliant with a [bagel schema](/schemas/), the easiest way to generate a dashboard-ready file for a dataset's neuroimaging processing info is to follow the [Nipoppy](https://neurobagel.org/nipoppy/overview/) standard structure for organizing raw MRI data and processed outputs (data derivatives). 
-`Nipoppy` offers scripts that can use this standardized dataset organization to automatically extract info about the raw imaging files and any pipelines that have been run, which is then stored in a dashboard-ready `bagel.csv`.
-
-Detailed instructions to get started using `Nipoppy` can be found in their [documentation](https://neurobagel.org/nipoppy/overview/). 
-In brief, generating a `bagel.csv` for your dataset can be as simple as:
-1. Installing `Nipoppy` to generate a dataset directory tree for your dataset (see [Installation](https://neurobagel.org/nipoppy/installation/) section of docs) that you can populate with your existing data
-2. Update `Nipoppy` configuration to reflect the pipeline versions you are using (for tracking purposes), and augment your participant spreadsheet according to `Nipoppy` requirements (see [Configs](https://neurobagel.org/nipoppy/configs/) section of docs)
-3. Run the tracker ([run_tracker.py](https://github.com/neurodatascience/nipoppy/blob/main/trackers/run_tracker.py)) for the relevant pipeline(s) for your dataset to generate a comprehensive `bagel.csv`
-    - To see help text for this script: `python run_tracker.py --help`
-    - This step can be repeated as needed to update the `bagel.csv` with newly processed subjects
+`digest` supports long format TSVs that contain the columns specified in the [digest schemas](/schemas/) (see also the schema [README](https://github.com/neurobagel/digest/tree/main/schemas#readme)). 
+At the moment, each digest file is expected to correspond to one dataset.
+
+## Creating a dashboard-ready "digest" file
+While `digest` accepts any TSV compliant with one of the [digest schemas](/schemas/), the easiest way to obtain dashboard-ready files for pipeline derivative availability is to use the [Nipoppy](https://neurobagel.org/nipoppy/overview/) specification for organizing your neuroimaging dataset.
+Nipoppy provides dataset [trackers](https://nipoppy.readthedocs.io/en/stable/user_guide/tracking.html) that can automatically extract subjects' imaging data and pipeline output availability, producing `digest`-compatible processing status files.
+
+For detailed instructions to get started using Nipoppy, see the [documentation](https://nipoppy.readthedocs.io/en/stable/). 
+
+In brief, the (mostly automated!) Nipoppy steps to generate a processing status file can be as simple as:
+1. Initializing an empty, Nipoppy-compliant dataset directory tree for your dataset
+2. Updating your Nipoppy configuration with the pipeline versions you are using, and creating a manifest spreadsheet of all available participants and sessions
+2. Populating the directory tree with any existing data and pipeline outputs _*_
+3. Running the tracker for the relevant pipeline(s) to generate a processing status file
+
+_*Nipoppy also provides a protocol for running processing pipelines from raw imaging data._
 
 ## Running in a Docker container
 

diff --git a/digest/app.py b/digest/app.py
@@ -1,6 +1,6 @@
 """
-Serves Dash app for viewing and filtering participant (meta)data for imaging and phenotypic tasks from a given dataset.
-App accepts and parses a user-uploaded bagel.csv file (assumed to be generated by mr_proc) as input.
+Serves Dash app for viewing and filtering participant (meta)data for imaging and phenotypic data events from a provided dataset.
+App accepts and parses a user-uploaded digest TSV file as input.
 """
 
 import dash_bootstrap_components as dbc
@@ -11,6 +11,7 @@
 from . import plotting as plot
 from . import utility as util
 from .layout import DEFAULT_DATASET_NAME, construct_layout, upload_buttons
+from .utility import PRIMARY_SESSION_COL
 
 EMPTY_FIGURE_PROPS = {"data": [], "layout": {}, "frames": []}
 
@@ -116,7 +117,7 @@ def set_was_upload_used_flag(upload_contents, available_digest_nclicks):
 )
 def process_bagel(upload_contents, available_digest_nclicks, filenames):
     """
-    From the contents of a correctly-formatted uploaded .csv file, parse and store (1) the pipeline overview data as a dataframe,
+    From the contents of a correctly-formatted uploaded TSV file, parse and store (1) the pipeline overview data as a dataframe,
     and (2) pipeline-specific metadata as individual dataframes within a dict.
     Returns any errors encountered during input file processing as a user-friendly message.
     """
@@ -156,9 +157,9 @@ def process_bagel(upload_contents, available_digest_nclicks, filenames):
             # TODO: Any existing NaNs will currently be turned into "nan". (See open issue https://github.com/pandas-dev/pandas/issues/25353)
             # Another side effect of allowing NaN sessions is that if this column has integer values, they will be read in as floats
             # (before being converted to str) if there are NaNs in the column.
-            # This should not be a problem after we disallow NaNs value in "participant_id" and "session" columns, https://github.com/neurobagel/digest/issues/20
-            bagel["session"] = bagel["session"].astype(str)
-            session_list = bagel["session"].unique().tolist()
+            # This should not be a problem after we disallow NaNs value in "participant_id" and "session_id" columns, https://github.com/neurobagel/digest/issues/20
+            bagel[PRIMARY_SESSION_COL] = bagel[PRIMARY_SESSION_COL].astype(str)
+            session_list = bagel[PRIMARY_SESSION_COL].unique().tolist()
 
             overview_df = util.get_pipelines_overview(
                 bagel=bagel, schema=schema
@@ -188,7 +189,7 @@ def process_bagel(upload_contents, available_digest_nclicks, filenames):
         {"type": schema, "data": overview_df.to_dict("records")},
         pipelines_dict,
         None,
-        "csv",
+        "csv",  # NOTE: "tsv" is not an option for export_format
-        "csv",  # NOTE: "tsv" is not an option for export_format
+        "csv",  # NOTE: the dash_table.DataTable object does not support "tsv" as an option for export_format
-        "csv",  # NOTE: "tsv" is not an option for export_format
+        "csv",  # NOTE: the dash_table.DataTable object does not support "tsv" as an option for export_format
     )
 
 
@@ -203,7 +204,7 @@ def reset_upload_buttons(memory_filename):
 
     Upload components need to be manually replaced to clear contents,
     otherwise previously uploaded imaging/pheno bagels cannot be re-uploaded
-    (e.g. if a user uploads pheno_bagel.csv, then imaging_bagel.csv, then pheno_bagel.csv again)
+    (e.g. if a user uploads pheno_bagel.tsv, then imaging_bagel.tsv, then pheno_bagel.tsv again)
     see https://github.com/plotly/dash-core-components/issues/816
     """
     return upload_buttons()
@@ -293,7 +294,7 @@ def update_session_filter(parsed_data, session_list):
 )
 def create_pipeline_status_dropdowns(pipelines_dict, parsed_data):
     """
-    Generates a dropdown filter with status options for each unique pipeline in the input csv,
+    Generates a dropdown filter with status options for each unique pipeline in the input TSV,
     and disables the native datatable filter UI for the corresponding columns in the datatable.
     """
     pipeline_dropdowns = []
@@ -418,7 +419,7 @@ def update_matching_rows(columns, virtual_data):
 )
 def reset_selections(filename):
     """
-    If file contents change (i.e., selected new CSV for upload), reset displayed file name and selection values related to data filtering or plotting.
+    If file contents change (i.e., selected new TSV for upload), reset displayed file name and selection values related to data filtering or plotting.
     Reset will occur regardless of whether there is an issue processing the selected file.
     """
     return f"Input file: {filename}", "", "", None, False
@@ -436,8 +437,10 @@ def reset_selections(filename):
 )
 def generate_overview_status_fig_for_participants(parsed_data, session_list):
     """
-    If new dataset uploaded, generate stacked bar plot of pipeline_complete statuses per session,
-    grouped by pipeline. Provides overview of the number of participants with each status in a given session,
+    When a new dataset is uploaded, generate stacked bar plots of pipeline statuses per session,
+    grouped in subplots corresponding to each pipeline.
+
+    Provides overview of the number of participants with each status in a given session,
     per processing pipeline.
     """
     if parsed_data is not None and parsed_data.get("type") != "phenotypic":
@@ -467,7 +470,7 @@ def generate_overview_status_fig_for_participants(parsed_data, session_list):
 def update_overview_status_fig_for_records(data, pipelines_dict, parsed_data):
     """
     When visible data in the overview datatable is updated (excluding built-in frontend datatable filtering
-    but including custom component filtering), generate stacked bar plot of pipeline_complete statuses aggregated
+    but including custom component filtering), generate stacked bar plot of pipeline statuses aggregated
     by pipeline. Counts of statuses in plot thus correspond to unique records (unique participant-session
     combinations).
     """
@@ -479,7 +482,7 @@ def update_overview_status_fig_for_records(data, pipelines_dict, parsed_data):
     if not data_df.empty:
         status_counts = (
             plot.transform_active_data_to_long(data_df)
-            .groupby(["pipeline_name", "pipeline_complete"])
+            .groupby(["pipeline_name", "status"])
             .size()
             .reset_index(name="records")
         )
@@ -512,7 +515,7 @@ def display_phenotypic_column_dropdown(parsed_data):
         # exclude unique participant identifier columns from visualization
         if column not in [
             "participant_id",
-            "bids_id",
+            "bids_participant_id",
         ]:  # TODO: Consider storing these column names in a constant
             column_options.append({"label": column, "value": column})
 
@@ -552,7 +555,7 @@ def plot_phenotypic_column(
         data_to_plot = virtual_data
 
     if session_switch_value:
-        color = "session"
+        color = PRIMARY_SESSION_COL
     else:
         color = None
 

diff --git a/digest/layout.py b/digest/layout.py
@@ -101,7 +101,7 @@ def upload_buttons() -> list:
     upload_imaging = dcc.Upload(
         id={"type": "upload-data", "index": "imaging", "btn_idx": 0},
         children=dbc.Button(
-            "Select imaging CSV file...",
+            "Select imaging TSV file...",
             color="light",
         ),
         multiple=False,
@@ -110,7 +110,7 @@ def upload_buttons() -> list:
     upload_phenotypic = dcc.Upload(
         id={"type": "upload-data", "index": "phenotypic", "btn_idx": 1},
         children=dbc.Button(
-            "Select phenotypic CSV file...",
+            "Select phenotypic TSV file...",
             color="light",
         ),
         multiple=False,
@@ -266,7 +266,7 @@ def status_legend_card():
                                     "These are the recommended status definitions for processing progress. For more details, see the ",
                                     html.A(
                                         "schema for an imaging digest file",
-                                        href="https://github.com/neurobagel/digest/blob/main/schemas/bagel_schema.json",
+                                        href="https://github.com/neurobagel/digest/blob/main/schemas/imaging_digest_schema.json",
                                         target="_blank",
                                     ),
                                 ],

diff --git a/digest/plotting.py b/digest/plotting.py
@@ -7,6 +7,7 @@
 import plotly.graph_objects as go
 
 from . import utility as util
+from .utility import PRIMARY_SESSION_COL
 
 CMAP = px.colors.qualitative.Bold
 STATUS_COLORS = {
@@ -37,7 +38,7 @@ def transform_active_data_to_long(data: pd.DataFrame) -> pd.DataFrame:
         data,
         id_vars=util.get_id_columns(data),
         var_name="pipeline_name",
-        value_name="pipeline_complete",
+        value_name="status",
     )
 
 
@@ -60,28 +61,28 @@ def plot_pipeline_status_by_participants(
 ) -> go.Figure:
     status_counts = (
         transform_active_data_to_long(data)
-        .groupby(["pipeline_name", "pipeline_complete", "session"])
+        .groupby(["pipeline_name", "status", PRIMARY_SESSION_COL])
         .size()
         .reset_index(name="participants")
     )
 
     fig = px.bar(
         status_counts,
-        x="session",
+        x=PRIMARY_SESSION_COL,
         y="participants",
-        color="pipeline_complete",
+        color="status",
         text_auto=True,
         facet_col="pipeline_name",
         category_orders={
-            "pipeline_complete": util.PIPE_COMPLETE_STATUS_SHORT_DESC.keys(),
-            "session": session_list,
+            "status": util.PIPE_COMPLETE_STATUS_SHORT_DESC.keys(),
+            PRIMARY_SESSION_COL: session_list,
         },
         color_discrete_map=STATUS_COLORS,
         labels={
             "pipeline_name": "Pipeline",
             "participants": "Participants (n)",
-            "pipeline_complete": "Processing status",
-            "session": "Session",
+            "status": "Processing status",
+            PRIMARY_SESSION_COL: "Session",
         },
         title="All participant pipeline statuses by session",
     )
@@ -97,10 +98,10 @@ def plot_pipeline_status_by_records(status_counts: pd.DataFrame) -> go.Figure:
         status_counts,
         x="pipeline_name",
         y="records",
-        color="pipeline_complete",
+        color="status",
         text_auto=True,
         category_orders={
-            "pipeline_complete": util.PIPE_COMPLETE_STATUS_SHORT_DESC.keys(),
+            "status": util.PIPE_COMPLETE_STATUS_SHORT_DESC.keys(),
             "pipeline_name": status_counts["pipeline_name"]
             .drop_duplicates()
             .sort_values(),
@@ -109,7 +110,7 @@ def plot_pipeline_status_by_records(status_counts: pd.DataFrame) -> go.Figure:
         labels={
             "pipeline_name": "Pipeline",
             "records": "Records (n)",
-            "pipeline_complete": "Processing status",
+            "status": "Processing status",
         },
         title="Pipeline statuses of records matching filter (default: all)",
     )
@@ -124,7 +125,7 @@ def populate_empty_records_pipeline_status_plot(
     """Returns dataframe of counts representing 0 matching records in the datatable, i.e., 0 records with each pipeline status."""
     status_counts = pd.DataFrame(
         list(product(pipelines, statuses)),
-        columns=["pipeline_name", "pipeline_complete"],
+        columns=["pipeline_name", "status"],
     )
     status_counts["records"] = 0