Hello! I made this video to help fellow data scientists and developers better understand the Transformer architecture that underlies the GPT models and most modern large language models! I was struggling until I finally just created a spreadsheet with a toy example matrix and worked out each matrix transformation one step at a time. It was tough going, but once I was done the concepts finally "clicked" for me. If you are on the same journey of understanding, I hope this helps you too!
Resources: Andrej Kaparthy’s YouTube video:
- Let’s build GPT: from scratch, in code, spelled out. Andrej Kaparthy’s Colab Notebook:
- Building a GPT The famous paper with the Transformer Architecture:
- Attention is All You Need GPT Papers
- Language Models are Few Shot Learners (GPT-3)
- Language Models are Unsupervised Multitask Learners (GPT-2) YouTube video explaining Self-Attention:
- Intuition Behind Self Attention in Transformer Networks Helpful blog post with matrix diagrams:
- Step-by-Step Illustrated Explanations for Transformer Stanford lecture on word vectors:
- Stanford CS224N: NLP with Deep Learning | Winter 2021 | Lecture 1 - Intro & Word Vectors