Skip to content

Savaw/CS662-Drama2Brain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reproduction of Brain Encoding Models using LLM Features

This directory contains the ridge regression pipeline for brain encoding models that predict fMRI responses to narrative video stimuli using LLM features. It implements the core methodology from the paper "Unveiling Multi-level and Multi-modal Semantic Representations in the Human Brain using Large Language Models" (Nakagi et al., EMNLP 2024).

Code Attribution

This implementation is adapted from the following repositories:

  • drama2brain - Multi-level and multi-modal brain encoding framework (paper repository)
  • fMRI_Narrative_movie - Original fMRI analysis pipeline (dataset repository)

The code in this directory integrates utility functions and ridge regression workflows from these original implementations while adding a new data module that works with the available dataset, and adapting other parts of the code for incompatibilities and missing details.

Overview

This pipeline reconstructs brain responses by fitting ridge regression models that map LLM feature representations to fMRI activity recorded while participants watch narrative videos. The implementation uses the Himalaya library for efficient ridge regression with cross-validation.

Quick Start

Basic Ridge Regression

Train a ridge regression model on a single feature type:

The main training script does the following:

  1. Loads fMRI and stimulus features via StimRespCVDataModule
  2. Builds a ridge regression pipeline with GroupRidgeCV from Himalaya
  3. Performs leave-one-run-out cross-validation
  4. Evaluates predictions using $R^2$ and Pearson correlation
  5. Applies FDR correction for multiple comparison correction
  6. Saves results as pickle files
python src/main.py \
  -b gpu \
  -d cuda:0 \
  -s 4 \
  --model GPT2large \
  --feat story \
  --layer 36 \
  --time-delays 8 10

Command-Line Arguments

Model and Hardware:

  • -b, --backend {cpu|gpu|cupy}: Computation backend (default: gpu)
  • -d, --device: CUDA device string or device ID (default: cuda:0)

Experimental Setup:

  • -s, --subject: Subject ID (default: 4)
  • --model {GPT2large|Mistral}: LLM model to use (required)
  • --feat {story|object|speech}: Feature annotation type (required)
    • story: Story-level semantic understanding
    • object: Object annotations (50-character objectiveAnnot)
    • speech: Speech transcription features
  • --layer: Layer number to extract (required)
    • GPT-2 Large: 9, 18, 27, 36
    • Mistral: 8, 16, 24, 32

Ridge Regression Parameters:

  • --time-delays: Start and end time delays in TRs (default: 8 10)
  • -bs, --batch-size: Batch size for ridge solver (default: 1000)
  • -abs, --alphas-batch-size: Batch size for alpha search (default: 1000)
  • -i, --iterations: Number of iterations for random search solver (default: 1000)
  • --n-alphas: Number of alpha values to search (default: 8)
  • --no-zscore-stim: Disable z-scoring of stimulus features

Example Commands

Story features with GPT-2 Large:

python src/main.py \
    -b gpu \
    -d cuda:0 \
    -s 4 \
    -bs 1000 \
    -i 100 \
    --model GPT2large \
    --feat story \
    --layer 36

Results Aggregation and Analysis

Use results_aggregation.ipynb to analyze and aggregate ridge regression results, and generate figures used in the report.

The notebook loads all pickle files from the results directory and produces aggregated statistics across subjects and experimental conditions in a dataframe that can be used as needed.

Directory Structure

src/
├── main.py                          # Entry point for ridge regression training
├── results_aggregation.ipynb        # Jupyter notebook for analyzing results
├── data_cv.py                       # Data loading and cross-validation setup
├── util/
│   ├── config__drama_data.yaml      # Configuration file with data paths
│   ├── drama2brain_utils.py         # Utility functions (cross-val, FDR correction)
│   ├── util_dataload.py             # fMRI and stimulus data loading
│   ├── util_ridge.py                # Ridge model evaluation and scoring
│   ├── util_feat.py                 # Feature processing utilities
│   ├── util_visualization.py        # Visualization helpers
│   └── util_pycortex.py             # Brain surface visualization
├── cache/                           # Cached data for fast iteration
├── results/                         # Output models and results
└── dataInfo.mat                     # MATLAB file with video/subject metadata

Configuration

Edit src/util/config__drama_data.yaml to set data paths:

path:
  dataInfo: "/path/to/dataInfo.mat"
  cache_dir_path: "/path/to/cache/"
  cache_active: True

dir:
  derivative: "/path/to/ds005531-1.0.0" # Dataset root
  ridge: "{derivative_dir}/derivatives/ridge_saba/" # Output root

Required paths:

  • derivative: Root directory of the OpenNeuro dataset (ds005531)
  • Subdirectories must contain:
    • derivatives/preprocessed_data/: fMRI data (NIFTI format)
    • derivatives/feature/: Extracted LLM features (NPY format)
    • derivatives/annotation/: Annotation files

References and Resources

About

Reproduction of Nakagi et al., EMNLP 2024. Aligning human fMRI responses to LLM extracted embeddings of movie annotations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors