GitHub - Amber-Abuah/Small-Language-Model: Small Language Model trained to generate specialised visual novel scripts.

Small Language Model

A SLM trained to generate Visual Novel scripts similar to Doki Doki Literature Club.
The model consists of 4 transformer decoder-only layers and a total of 2,553,331 parameters.

Training

The SLM undergoes Curriculum Learning, a training strategy where the model is firstly introduced to simple examples and graudally introduced to more complex examples.

The model is firstly taught on Tiny Stories - a collection of short fairy tales generated by GPT3.5 - to allow the model to get an understanding of basic grammar and sentence structure.

Then the model is training on generic Visual Novels - a large text file containing scripts from a variety of Visual Novels - to allow the model to gain an understanding of the basic structure of VN scripts.

Finally the model is trained on the entirety of the Doki Doki Literature Club VN script - to allow the model to understand how to structure its final outputs.

Python Scripts

tokenise.py gathers relevant lines from the input text files, trains the BPE tokeniser and tokenises each line into a vector of size INPUT_LENGTH.
model.py contains the Torch model definition for SmallTransformer.
train.py uses the Curriculum Learning strategy to train the model on Tiny Stories, generic visual novel text then Doki Doki text.
early_stop.py forces the model to stop training once the validation loss begins to increase.
generate.py generates output from the trained model, using top-k and repitition penalty in an attempt to diversify the output.

Example Generated Output

<User>: "Ah , I think that's true." 
<User>: "I mean..." 
<User>: "No problem!" 
<User>: "Not at all." 
<User>: "It's a surprise that you can ' t wait for someone to make some tea..." 
Natsuki: "I can't wait !" 
Natsuki: "I don't do anything , but I don't have any of your work at all." 
Natsuki: "Eh? What?" 
Natsuki: "I can't take it seriously to your house in the first place." 
<User>: "Yeah , I'll go ahead and get started soon."

More unformatted outputs can be found in generated_examples.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
training_text		training_text
.gitignore		.gitignore
README.md		README.md
doki_slm.pth		doki_slm.pth
early_stop.py		early_stop.py
generate.py		generate.py
generated_examples.txt		generated_examples.txt
model.py		model.py
tokenise.py		tokenise.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Small Language Model

Training

Python Scripts

Example Generated Output

About

Uh oh!

Languages

Amber-Abuah/Small-Language-Model

Folders and files

Latest commit

History

Repository files navigation

Small Language Model

Training

Python Scripts

Example Generated Output

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages