A SLM trained to generate Visual Novel scripts similar to Doki Doki Literature Club.
The model consists of 4 transformer decoder-only layers and a total of 2,553,331 parameters.
The SLM undergoes Curriculum Learning, a training strategy where the model is firstly introduced to simple examples and graudally introduced to more complex examples.
The model is firstly taught on Tiny Stories - a collection of short fairy tales generated by GPT3.5 - to allow the model to get an understanding of basic grammar and sentence structure.
Then the model is training on generic Visual Novels - a large text file containing scripts from a variety of Visual Novels - to allow the model to gain an understanding of the basic structure of VN scripts.
Finally the model is trained on the entirety of the Doki Doki Literature Club VN script - to allow the model to understand how to structure its final outputs.
tokenise.py gathers relevant lines from the input text files, trains the BPE tokeniser and tokenises each line into a vector of size INPUT_LENGTH.
model.py contains the Torch model definition for SmallTransformer.
train.py uses the Curriculum Learning strategy to train the model on Tiny Stories, generic visual novel text then Doki Doki text.
early_stop.py forces the model to stop training once the validation loss begins to increase.
generate.py generates output from the trained model, using top-k and repitition penalty in an attempt to diversify the output.
<User>: "Ah , I think that's true."
<User>: "I mean..."
<User>: "No problem!"
<User>: "Not at all."
<User>: "It's a surprise that you can ' t wait for someone to make some tea..."
Natsuki: "I can't wait !"
Natsuki: "I don't do anything , but I don't have any of your work at all."
Natsuki: "Eh? What?"
Natsuki: "I can't take it seriously to your house in the first place."
<User>: "Yeah , I'll go ahead and get started soon."
More unformatted outputs can be found in generated_examples.txt.
