Data generation / retrieval for Training #14

JakubSchwenkbeck · 2024-12-21T19:19:31Z

Motivation

Implemted a fully scalable and reliable mechanism to retrieve datasets from a raw text and tokenize it. Also added new structs which hold all values (matrices) concerned by the training/loss optimization
Resolves #13

Changes

Added a new generation.rs file which can read a raw text file and process the data, then generate a input and target pair for training
tokenizer is reworked to have some more public functions and only need to take the raw input texts as sentences (vec<&str>)

Test

Run with an example of choice and observe outputs

…model

JakubSchwenkbeck added 8 commits December 21, 2024 19:13

tokenzizer implemented <UNK>, case insentivity and better REGEX

fa9a4b8

started a dataset definition and adjusted embedding-weights sizes

1cba942

implemented new model of tokenizer

152ee22

Full implementation of data generation with tokenizer rework

c37d01f

Implemented IO for file reading and then data generation

47dccc4

Integrated learnable matrices with fitting size parameters for whole …

8599517

…model

implemented staircase approach for TrainingsData

041966f

migration from Vec<usize> to &[usize]

e8d0ffb

JakubSchwenkbeck merged commit 31d3369 into main Dec 21, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data generation / retrieval for Training #14

Data generation / retrieval for Training #14

Uh oh!

JakubSchwenkbeck commented Dec 21, 2024

Uh oh!

Uh oh!

Uh oh!

Data generation / retrieval for Training #14

Data generation / retrieval for Training #14

Uh oh!

Conversation

JakubSchwenkbeck commented Dec 21, 2024

Motivation

Changes

Test

Uh oh!

Uh oh!

Uh oh!