EDIT: This project stands as a testament to the significant advancements in the field of machine learning, particularly with the development of large language models such as OpenAI's GPT series (e.g., GPT-3.5, GPT-4), Google's Gemini, and Meta's LLaMA. While these models address similar challenges, this repository offers a unique approach using different techniques.
The goal of this project is to explore the feasibility of creating artificial conversational agents, or chatbots, utilizing novel sequence-to-sequence methods inspired by progress in natural language processing and neural machine translation (NMT).
The NMT model is trained on a dataset comprising comments from Reddit, which encompasses every publicly available comment posted on the platform since 2005. The dataset can be accessed through the provided link. An NMT will be trained on a dataset of comments from Reddit provided by this link. This repository contains every publicly available comment posted to reddit since 2005.
The hypothesis driving this project is that by feeding comment-response pairs to the NMT, it will learn to associate similar responses with their corresponding comments. With a sufficiently large dataset and adequate computational resources, the model is expected to generate coherent responses to any given input.
- Dataframe Creation: Generate a dataframe upon request, pickle its tensor representation, and save tokenizers for future use. If not requested, utilize existing resources.
- Training Parameter Initialization: Set training parameters including the number of epochs, buffer size, batch size, embedding dimension, and the number of hidden units for both encoding and decoding. Also, determine the vocabulary sizes for the input and output corpora.
- Tensor Batching: Create batches of tensors from the dataset for training.
- Model Components Initialization: Initialize the encoder, decoder, and optimizer.
- Checkpoint System: Implement a checkpoint system to save model states during training, allowing for recovery in case of interruptions or for comparative analysis of model iterations.
The project is extensibly commented so any additional information can be found in the code itself.
The repository contains a model trained with the following charactersistics:
- number of epochs = 30
- batch size = 10
- embedding dimension = 256
- number of hidden recurrent units = 256
- optimizer = Adam optimizer
- max word length of sentences = 10
- final loss = 0.9.
Note: More sophisticated models were developed but were too large to be hosted on this GitHub repository.