This repository contains tools to process, clean, and filter the mPower Parkinson's Disease dataset. The pipeline extracts raw audio from Synapse cache structures and filters participants based on strictly defined clinical criteria for Machine Learning tasks.
This project utilizes data from the mPower study, a clinical observational study of Parkinson's Disease (PD) conducted purely through an iPhone app interface. Launched in March 2015, the study used Apple's ResearchKit library to collect frequent sensor-based recordings and surveys from participants with and without PD.
The goal of the study was to evaluate the feasibility of remotely collecting frequent information about daily changes in symptom severity and sensitivity to medication.
This repository specifically focuses on the Voice Activity portion of the mPower dataset.
-
The Task: Participants were instructed to say "Aaaaah" into their microphone at a steady volume for up to 10 seconds.
-
The Data: The activity recorded audio files for the 10 seconds of phonation as well as the 5-second countdown leading up to the task.
-
Medication States: To track symptom fluctuation, participants with a professional diagnosis of PD were asked to perform tasks at three specific times:
-
Immediately before taking their medication.
-
After taking their medication (when feeling at their best).
-
At another time.
-
-
scripts/extract_raw_audio.py: Flattens the nested Synapse cache structure (.tmpfiles) into a usable directory of audio files. -
notebooks/mPower_filtering.ipynb: Filters the dataset into "People with Parkinson's" (PwPD) and "Healthy Controls" (HC).
The filtering logic aligns with standard voice analysis protocols:
-
Age Range: 50–70 years old.
-
Exclusions: Participants with Depression, Anxiety, Schizophrenia, Bipolar disorder, Asthma, Stroke, or COPD are removed.
-
Group Definitions:
-
PwPD: Professional diagnosis =
True, medication usage confirmed, no Deep Brain Stimulation (DBS). -
HC: Professional diagnosis =
False, explicitly states "I don't take Parkinson medications".
-
-
Medication Timing: For PwPD, only recordings taken "Immediately before medication" or "Another time" are kept (OFF state). "Just after medication" (ON state) is excluded.
-
Uniqueness: Ensures only one unique recording per participant (
healthCode) is retained.
-
Install dependencies:
pip install -r requirements.txt
-
Extract Audio: The raw data from Synapse is often downloaded into a nested cache structure (e.g.,
data(.synapseCache)/.../*.tmp). Run the extraction script to flatten this structure, rename the files to.wav, and organize them into a cleandata/directory.python scripts/extract_raw_audio.py
-
Run Notebook: Open the Jupyter notebook to perform the clinical cohort filtering (Age, Diagnosis, Medication) and generate the training/testing splits for your machine learning model.
jupyter notebook notebooks/mPower_filtering.ipynb
Note: This repository does not contain the mPower dataset. Due to the sensitive nature of health data, the dataset is hosted on Synapse and is subject to strict governance to protect participant privacy.
To run the processing scripts in this repository, you must obtain the data directly from the mPower Public Researcher Portal.
Requirements for Access: Researchers interested in accessing the data must complete the following steps:
-
Create a Synapse Account: Register at synapse.org.
-
Profile Validation: Have your User Profile validated by the Synapse Access and Compliance Team (ACT).
-
Certification: Become a Certified User by passing a short quiz on research ethics.
-
Intended Data Use: Submit a statement describing how you intend to use the data.
-
Conditions for Use: Agree to terms including a commitment not to re-identify participants and to keep data confidential.
Once approved, you can download the raw data (including audio files and demographic tables) to use as inputs for the preprocessing pipeline in this repository.
If you use this dataset in your work, please cite the original mPower paper:
Bot, B. M. et al. The mPower Study, Parkinson Disease Mobile Data Collected Using ResearchKit. Sci. Data 3:160011 doi: 10.1038/sdata.2016.11 (2016).