Reproduction of Brain Encoding Models using LLM Features

This directory contains the ridge regression pipeline for brain encoding models that predict fMRI responses to narrative video stimuli using LLM features. It implements the core methodology from the paper "Unveiling Multi-level and Multi-modal Semantic Representations in the Human Brain using Large Language Models" (Nakagi et al., EMNLP 2024).

Code Attribution

This implementation is adapted from the following repositories:

drama2brain - Multi-level and multi-modal brain encoding framework (paper repository)
fMRI_Narrative_movie - Original fMRI analysis pipeline (dataset repository)

The code in this directory integrates utility functions and ridge regression workflows from these original implementations while adding a new data module that works with the available dataset, and adapting other parts of the code for incompatibilities and missing details.

Overview

This pipeline reconstructs brain responses by fitting ridge regression models that map LLM feature representations to fMRI activity recorded while participants watch narrative videos. The implementation uses the Himalaya library for efficient ridge regression with cross-validation.

Quick Start

Basic Ridge Regression

Train a ridge regression model on a single feature type:

The main training script does the following:

Loads fMRI and stimulus features via StimRespCVDataModule
Builds a ridge regression pipeline with GroupRidgeCV from Himalaya
Performs leave-one-run-out cross-validation
Evaluates predictions using $R^2$ and Pearson correlation
Applies FDR correction for multiple comparison correction
Saves results as pickle files

python src/main.py \
  -b gpu \
  -d cuda:0 \
  -s 4 \
  --model GPT2large \
  --feat story \
  --layer 36 \
  --time-delays 8 10

Command-Line Arguments

Model and Hardware:

-b, --backend {cpu|gpu|cupy}: Computation backend (default: gpu)
-d, --device: CUDA device string or device ID (default: cuda:0)

Experimental Setup:

-s, --subject: Subject ID (default: 4)
--model {GPT2large|Mistral}: LLM model to use (required)
--feat {story|object|speech}: Feature annotation type (required)
- story: Story-level semantic understanding
- object: Object annotations (50-character objectiveAnnot)
- speech: Speech transcription features
--layer: Layer number to extract (required)
- GPT-2 Large: 9, 18, 27, 36
- Mistral: 8, 16, 24, 32

Ridge Regression Parameters:

--time-delays: Start and end time delays in TRs (default: 8 10)
-bs, --batch-size: Batch size for ridge solver (default: 1000)
-abs, --alphas-batch-size: Batch size for alpha search (default: 1000)
-i, --iterations: Number of iterations for random search solver (default: 1000)
--n-alphas: Number of alpha values to search (default: 8)
--no-zscore-stim: Disable z-scoring of stimulus features

Example Commands

Story features with GPT-2 Large:

python src/main.py \
    -b gpu \
    -d cuda:0 \
    -s 4 \
    -bs 1000 \
    -i 100 \
    --model GPT2large \
    --feat story \
    --layer 36

Results Aggregation and Analysis

Use results_aggregation.ipynb to analyze and aggregate ridge regression results, and generate figures used in the report.

The notebook loads all pickle files from the results directory and produces aggregated statistics across subjects and experimental conditions in a dataframe that can be used as needed.

Directory Structure

src/
├── main.py                          # Entry point for ridge regression training
├── results_aggregation.ipynb        # Jupyter notebook for analyzing results
├── data_cv.py                       # Data loading and cross-validation setup
├── util/
│   ├── config__drama_data.yaml      # Configuration file with data paths
│   ├── drama2brain_utils.py         # Utility functions (cross-val, FDR correction)
│   ├── util_dataload.py             # fMRI and stimulus data loading
│   ├── util_ridge.py                # Ridge model evaluation and scoring
│   ├── util_feat.py                 # Feature processing utilities
│   ├── util_visualization.py        # Visualization helpers
│   └── util_pycortex.py             # Brain surface visualization
├── cache/                           # Cached data for fast iteration
├── results/                         # Output models and results
└── dataInfo.mat                     # MATLAB file with video/subject metadata

Configuration

Edit src/util/config__drama_data.yaml to set data paths:

path:
  dataInfo: "/path/to/dataInfo.mat"
  cache_dir_path: "/path/to/cache/"
  cache_active: True

dir:
  derivative: "/path/to/ds005531-1.0.0" # Dataset root
  ridge: "{derivative_dir}/derivatives/ridge_saba/" # Output root

Required paths:

derivative: Root directory of the OpenNeuro dataset (ds005531)
Subdirectories must contain:
- derivatives/preprocessed_data/: fMRI data (NIFTI format)
- derivatives/feature/: Extracted LLM features (NPY format)
- derivatives/annotation/: Annotation files

References and Resources

Original Paper: EMNLP 2024 - Unveiling Multi-level and Multi-modal Semantic Representations
Dataset: OpenNeuro ds005531 - fMRI responses to narrative movies
Original Code Repositories:
- fMRI_Narrative_movie
- drama2brain

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
src		src
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproduction of Brain Encoding Models using LLM Features

Code Attribution

Overview

Quick Start

Basic Ridge Regression

Command-Line Arguments

Example Commands

Results Aggregation and Analysis

Directory Structure

Configuration

References and Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reproduction of Brain Encoding Models using LLM Features

Code Attribution

Overview

Quick Start

Basic Ridge Regression

Command-Line Arguments

Example Commands

Results Aggregation and Analysis

Directory Structure

Configuration

References and Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages