Skip to content

Data generation / retrieval for Training #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 21, 2024

Conversation

JakubSchwenkbeck
Copy link
Owner

Motivation

Implemted a fully scalable and reliable mechanism to retrieve datasets from a raw text and tokenize it. Also added new structs which hold all values (matrices) concerned by the training/loss optimization
Resolves #13

Changes

  • Added a new generation.rs file which can read a raw text file and process the data, then generate a input and target pair for training
  • tokenizer is reworked to have some more public functions and only need to take the raw input texts as sentences (vec<&str>)

Test

Run with an example of choice and observe outputs

@JakubSchwenkbeck JakubSchwenkbeck merged commit 31d3369 into main Dec 21, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prepare Data for Machine Learning
1 participant