This directory contains the ridge regression pipeline for brain encoding models that predict fMRI responses to narrative video stimuli using LLM features. It implements the core methodology from the paper "Unveiling Multi-level and Multi-modal Semantic Representations in the Human Brain using Large Language Models" (Nakagi et al., EMNLP 2024).
This implementation is adapted from the following repositories:
- drama2brain - Multi-level and multi-modal brain encoding framework (paper repository)
- fMRI_Narrative_movie - Original fMRI analysis pipeline (dataset repository)
The code in this directory integrates utility functions and ridge regression workflows from these original implementations while adding a new data module that works with the available dataset, and adapting other parts of the code for incompatibilities and missing details.
This pipeline reconstructs brain responses by fitting ridge regression models that map LLM feature representations to fMRI activity recorded while participants watch narrative videos. The implementation uses the Himalaya library for efficient ridge regression with cross-validation.
Train a ridge regression model on a single feature type:
The main training script does the following:
- Loads fMRI and stimulus features via
StimRespCVDataModule - Builds a ridge regression pipeline with
GroupRidgeCVfrom Himalaya - Performs leave-one-run-out cross-validation
- Evaluates predictions using
$R^2$ and Pearson correlation - Applies FDR correction for multiple comparison correction
- Saves results as pickle files
python src/main.py \
-b gpu \
-d cuda:0 \
-s 4 \
--model GPT2large \
--feat story \
--layer 36 \
--time-delays 8 10Model and Hardware:
-b, --backend {cpu|gpu|cupy}: Computation backend (default: gpu)-d, --device: CUDA device string or device ID (default: cuda:0)
Experimental Setup:
-s, --subject: Subject ID (default: 4)--model {GPT2large|Mistral}: LLM model to use (required)--feat {story|object|speech}: Feature annotation type (required)story: Story-level semantic understandingobject: Object annotations (50-character objectiveAnnot)speech: Speech transcription features
--layer: Layer number to extract (required)- GPT-2 Large: 9, 18, 27, 36
- Mistral: 8, 16, 24, 32
Ridge Regression Parameters:
--time-delays: Start and end time delays in TRs (default: 8 10)-bs, --batch-size: Batch size for ridge solver (default: 1000)-abs, --alphas-batch-size: Batch size for alpha search (default: 1000)-i, --iterations: Number of iterations for random search solver (default: 1000)--n-alphas: Number of alpha values to search (default: 8)--no-zscore-stim: Disable z-scoring of stimulus features
Story features with GPT-2 Large:
python src/main.py \
-b gpu \
-d cuda:0 \
-s 4 \
-bs 1000 \
-i 100 \
--model GPT2large \
--feat story \
--layer 36Use results_aggregation.ipynb to analyze and aggregate ridge regression results, and generate figures used in the report.
The notebook loads all pickle files from the results directory and produces aggregated statistics across subjects and experimental conditions in a dataframe that can be used as needed.
src/
├── main.py # Entry point for ridge regression training
├── results_aggregation.ipynb # Jupyter notebook for analyzing results
├── data_cv.py # Data loading and cross-validation setup
├── util/
│ ├── config__drama_data.yaml # Configuration file with data paths
│ ├── drama2brain_utils.py # Utility functions (cross-val, FDR correction)
│ ├── util_dataload.py # fMRI and stimulus data loading
│ ├── util_ridge.py # Ridge model evaluation and scoring
│ ├── util_feat.py # Feature processing utilities
│ ├── util_visualization.py # Visualization helpers
│ └── util_pycortex.py # Brain surface visualization
├── cache/ # Cached data for fast iteration
├── results/ # Output models and results
└── dataInfo.mat # MATLAB file with video/subject metadata
Edit src/util/config__drama_data.yaml to set data paths:
path:
dataInfo: "/path/to/dataInfo.mat"
cache_dir_path: "/path/to/cache/"
cache_active: True
dir:
derivative: "/path/to/ds005531-1.0.0" # Dataset root
ridge: "{derivative_dir}/derivatives/ridge_saba/" # Output rootRequired paths:
derivative: Root directory of the OpenNeuro dataset (ds005531)- Subdirectories must contain:
derivatives/preprocessed_data/: fMRI data (NIFTI format)derivatives/feature/: Extracted LLM features (NPY format)derivatives/annotation/: Annotation files
- Original Paper: EMNLP 2024 - Unveiling Multi-level and Multi-modal Semantic Representations
- Dataset: OpenNeuro ds005531 - fMRI responses to narrative movies
- Original Code Repositories: