From 3a0287a8d1c2b47e76f14f63d893278ebe943447 Mon Sep 17 00:00:00 2001 From: "Victor I. Afolabi" Date: Thu, 19 Sep 2024 13:23:43 +0100 Subject: [PATCH] Update README.md --- README.md | 226 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 226 insertions(+) diff --git a/README.md b/README.md index 9b19cc4..1f3f29c 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ # EEG Thought Decoder [![CI](https://github.com/victor-iyi/eeg-thought-decoder/actions/workflows/ci.yaml/badge.svg)](https://github.com/victor-iyi/eeg-thought-decoder/actions/workflows/ci.yaml) +[![pre-commit](https://github.com/victor-iyi/eeg-thought-decoder/actions/workflows/pre-commit.yml/badge.svg)](https://github.com/victor-iyi/eeg-thought-decoder/actions/workflows/pre-commit.yml) [![formatter | docformatter](https://img.shields.io/badge/%20formatter-docformatter-fedcba.svg)](https://github.com/PyCQA/docformatter) [![style | google](https://img.shields.io/badge/%20style-google-3666d6.svg)](https://google.github.io/styleguide/pyguide.html#s3.8-comments-and-docstrings) @@ -16,6 +17,231 @@ can be decoded from EEG signals using a sophisticated Transformer-based AI model I incorporate elements from Graph Neural Networks (GNNs), expert models, and agentic models to enhance the model's specialization and accuracy. +## Mathematical Formulation + +Let: + +- $\mathbf{X} \in \mathbb{R}^{N \times T}$ denote the EEG data matrix, +where $N$ is the number of electrodes (channels) and $T$ is the number of time steps. +- $\mathbf{y} \in \mathbb{R}^C$ represent the target thought or cognitive state +encoded as a one-hot vector over $C$ possible classes. + +Our goal is to find a function $f: \mathbb{R}^{N \times T} \rightarrow \mathbb{R}^C$ +such that: + +$\hat{\mathbf{y}} = f(\mathbf{X}),$ + +where $\hat{\mathbf{y}}$ is the model's prediction of the thought corresponding +to EEG input $\mathbf{X}$. + +## Model Architecture + +The proposed AI model integrates several advanced components: + +1. **Transformer Encoder** for *Temporal Dynamics* +2. **Graph Neural Network** for *Spatial Relationships* +3. **Mixture of Experts** for *Specialization* +4. **Agentic Learning** for *Dynamic Adaptation* + +### 1. Transformer Encoder for Temporal Dynamics + +#### Input Embedding + +Each EEG channel signal is embedded into a higher-dimensional space: + +$\mathbf{E} = \text{Embedding}(\mathbf{X}) \in \mathbb{R}^{N \times T \times d_{\text{model}}},$ + +where $d_{\text{model}}$ is the model dimension. + +#### Positional Encoding + +To incorporate temporal information + +$\mathbf{E}_{\text{pos}} = \mathbf{E} + \mathbf{P},$ + +where $\mathbf{P} \in \mathbb{R}^{N \times T \times d_{\text{model}}}$ is the +positional encoding matrix defined as: + +$$\mathbf{P}{(n,t,2k)} = \sin\left( \frac{t}{10000^{2k/d{\text{model}}}} \right), + +\quad +\mathbf{P}{(n,t,2k+1)} = \cos\left( \frac{t}{10000^{2k/d{\text{model}}}} \right).$$ + +#### Multi-Head Self-Attention + +For each head $h$ and layer $l$: + +- **Query**: $\mathbf{Q}h^{(l)} = \mathbf{E}{\text{pos}}^{(l)} \mathbf{W}_h^{Q(l)}$ +- **Key**: $\mathbf{K}h^{(l)} = \mathbf{E}{\text{pos}}^{(l)} \mathbf{W}_h^{K(l)}$ +- **Value**: $\mathbf{V}h^{(l)} = \mathbf{E}{\text{pos}}^{(l)} \mathbf{W}_h^{V(l)}$ + +Compute attention weights: + + +$\mathbf{A}_h^{(l)} = \text{softmax}\left( \frac{\mathbf{Q}_h^{(l)} (\mathbf{K}_h^{(l)})^\top}{\sqrt{d_k}} \right).$ + +Update embeddings: + +$\mathbf{Z}_h^{(l)} = \mathbf{A}_h^{(l)} \mathbf{V}_h^{(l)}.$ + +Concatenate heads and apply linear transformation: + + +$\mathbf{Z}^{(l)} = \text{Concat}\left( \mathbf{Z}_1^{(l)}, \dots, \mathbf{Z}_H^{(l)} \right) \mathbf{W}^{O(l)}.$ + +- $\mathbf{Z}^{(l)}$: The output of the multi-head attention mechanism at layer +$l$. +- $Concat(…)$ : A function that concatenates the outputs from all $H$ attention heads. +- $\mathbf{Z}_1^{(l)}, \mathbf{Z}_2^{(l)}, …, \mathbf{Z}_H^{(l)}$: The outputs +from each of the $H$ attention heads at layer. +- $\mathbf{W}^{O(l)}$: The output weight matrix at layer $l$. + +#### Feed-Forward Network + +Apply position-wise feed-forward network: + + +$\mathbf{F}^{(l)} = \text{ReLU}\left( \mathbf{Z}^{(l)} \mathbf{W}_1^{(l)} + \mathbf{b}_1^{(l)} \right) \mathbf{W}_2^{(l)} + \mathbf{b}_2^{(l)}.$ + +#### Layer Normalization and Residual Connections + +Each sub-layer includes residual connections and layer normalization: + + +$\mathbf{E}{\text{pos}}^{(l+1)} = \text{LayerNorm}\left( \mathbf{E}{\text{pos}}^{(l)} + \mathbf{F}^{(l)} \right).$ + +### 2. Graph Neural Network for Spatial Relationships + +#### Graph Construction + +Construct a graph $G = (V, E)$ where: + +- $V$ represents the set of EEG electrodes. +- $E$ represents edges based on physical proximity or functional connectivity. + +#### Graph Laplacian + +Compute the normalized graph Laplacian: + +$\mathbf{L} = \mathbf{I} - \mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2},$ + +where $\mathbf{A}$ is the adjacency matrix and $\mathbf{D}$ is the degree matrix. + +#### Graph Convolution + +Apply GNN to capture spatial dependencies: + +$\mathbf{H}^{(k+1)} = \sigma\left( \mathbf{L} \mathbf{H}^{(k)} \mathbf{W}^{(k)} \right),$ + +with $\mathbf{H}^{(0)} = \mathbf{E}_{\text{pos}}^{(L)}$ (output from Transformer +encoder), and $\sigma$ is an activation function (e.g., ReLU). + +### 3. Mixture of Experts for Specialization + +#### Expert Models + +Define $M$ expert models ${f_m}_{m=1}^M$, each specializing in different aspects +(e.g., frequency bands, cognitive tasks). + +#### Gating Mechanism + +Learn gating functions $g_m(\mathbf{X})$ to weight each expert's contribution: + +$\hat{\mathbf{y}} = \sum_{m=1}^M g_m(\mathbf{X}) f_m(\mathbf{X}),$ + +subject to $\sum_{m=1}^M g_m(\mathbf{X}) = 1$ and $g_m(\mathbf{X}) \geq 0$. + +### 4. Agentic Learning for Dynamic Adaptation + +Incorporate an agent that interacts with the environment and adapts based on feedback. + +#### Policy Network + +Define a policy $\pi_\theta(a | \mathbf{X})$ where $a$ is an action +(e.g., adjusting model parameters). + +#### Reward Function + +Define a reward $R(\hat{\mathbf{y}}, \mathbf{y})$ based on decoding accuracy. + +#### Optimization Objective + +Maximize expected reward: + + +$\max_\theta \mathbb{E}_{\mathbf{X}, \mathbf{y}} \left[ R\left( \hat{\mathbf{y}}, \mathbf{y} \right) \right].$ + +Update parameters using policy gradients: + +$\theta \leftarrow \theta + \eta \nabla_\theta \mathbb{E}\left[ R \right].$ + +### Training Procedure + +1. **Loss Function** + Use cross-entropy loss: + $L = -\frac{1}{M} \sum_{i=1}^M \mathbf{y}^{(i)^\top} \log \hat{\mathbf{y}}^{(i)}.$ + +2. **Regularization** + Include regularization terms to prevent overfitting: + + $L_{\text{total}} = L + \lambda \left( \| \theta \|^2 + \sum_{m=1}^M \| g_m \|^2 \right).$ + +3. **Optimization Algorithm** + Use Adam optimizer with gradients computed via backpropagation. + +--- + +#### Mathematical Proof of Decoding Capability + +#### Universal Approximation Theorem for Transformers + +Transformers are capable of approximating any sequence-to-sequence function, +given sufficient model capacity. + +- **Existence of Function** $f$**:** + There exists a function $f$ such that: + + $\mathbf{y} = f(\mathbf{X}) = f_{\text{agent}} \left( f_{\text{experts}} \left( f_{\text{GNN}} \left( f_{\text{Transformer}}(\mathbf{X}) \right) \right) \right).$ + +- **Approximation by the Model:** + Given the model's capacity and proper training, $f$ can be approximated + arbitrarily well. + +#### Proof Sketch + +1. **Transformer Encoder:** + Captures temporal dependencies, approximating temporal mappings in EEG data. + +2. **Graph Neural Network:** + Models spatial relationships, capturing the spatial structure of EEG electrodes. + +3. **Mixture of Experts:** + Enhances specialization, allowing the model to approximate complex functions + by combining simpler ones. + +4. **Agentic Model:** + Adapts dynamically, refining the approximation based on feedback. + +--- + +#### Verification and Evaluation + +1. **Cross-Validation** + Implement k-fold cross-validation to assess generalization. + +2. **Performance Metrics** + + - **Accuracy:** $\text{Accuracy} = \frac{1}{M} \sum_{i=1}^M \mathbf{1}\{\hat{\mathbf{y}}^{(i)} = \mathbf{y}^{(i)}\}$. + - **Precision, Recall, F1-Score:** Calculated per class. + +3. **Statistical Significance** + Perform hypothesis testing (e.g., permutation tests) to confirm that decoding + performance is significantly better than chance. + +4. **Ablation Studies** + Evaluate the impact of each component (Transformer, GNN, experts, agentic + learning) by systematically removing them and observing performance changes. + ## Contribution You are very welcome to modify and use them in your own projects.