As a first exploration I looked into different implementations of word2vec. Quickly found 'gensim', which as far as i know, is the first implementation of the original algorithm we explored in class for INST0075. It also seems relatively widely used and quickly found many example tutorials. Using the dialouge from 27 seasons of The Simpson's seemed like a silly enough way to get started without much plan. This project could be broken down into three parts:
- following the original tutorial (referenced later) to clean, initialize and train the model on all dialouges
- looking into new ways to understand the results
- aggregating dialoges by character to understand how characters are represented by what they say.
Python 3.8.15 (gensim does not support the newest versions of python at the time of writing)
- gensim 4.3.0
- scikit-learn 1.2.1
- plotly 5.13.0
- spacy 3.5.0
- seaborn 0.12.2
Much of the first half directly comes from this tutorial. It also points to the used dataset, which is available here.
The whole code is ran through the jupyter notebook, which also explains everything as it is ran. The original visualisation is referenced from another tutorial and the last exploration is largely my own code.
Simpsons picture from Wikipedia