An experiment to see if a pre-trained chat model could be taught to mimic the distinctive speech patterns and personality of Homer Simpson solely by exposure to relevant example dialog from the show, without explicitly informing it that the goal was to imitate Homer.
To provide the necessary data, I sourced transcripts of all episodes of The Simpsons and extracted over 28,000 lines spoken by Homer directly in response to another character. I separated these into prompts and responses, taking care to retain the metadata linking related pairs together.
I then utilized semantic vector embeddings to encapsulate the contextual meaning of each prompt and response line in a format conducive to similarity searches. This allowed me to automatically surface prompt-response examples from the dataset that were relevant to new input queries.
When interacting with the AI, I offered brief guidance to mimic a man named Homer based on the provided examples, while inferring personality and speech patterns appropriately. I peppered in relevant prompt-response pairs from Homer as illustrations.
Remarkably, this lightweight prompting and exposure to representative dialog snippets enabled the model to adopt Homer's distinctive voice, tone, and speech cadence with no outright instruction that it was specifically emulating Homer Simpson. It astutely extracted salient patterns purely from the examples provided.
While a relatively simple experiment, it highlights the impressive capabilities of large language models to implicitly learn nuanced human speech dynamics when supplied with suitable source material and incentives. The AI picked up on subtle attributes from the examples that would likely elude an explicit rule-based approach.
While I didn't directly inform the model that it was Homer Simpson, I did mention that it was a man named Homer. This was due to some dialog lines containing his name. I tested the prompt both with and without the examples, and the model fabricated information about "Homer" without the examples. However, I'm concerned that the examples might have made it obvious that I was referring to Homer Simpson, potentially allowing the model to draw from its pre-trained data.
For the next character, I plan to alter the name in the transcript/data set to something completely unrelated. This way, I can inform the model of the character's name without it being able to pull from pre-trained data.
The main area where the model fell short was in mimicking Homer's typing style. I believe this is because the example dataset, despite containing Homer's dialogues, isn't written in his style. It's transcribed as if a professional were documenting each episode. In future tests, I'll aim to provide examples that better reflect the character's unique typing style, or more lazily, I'll pick a character where this isn't a problem.