This project implements a text generation model using the GPT-2 architecture. It loads text data from multiple folders, preprocesses the data, trains a custom model, and allows for generating responses based on user input.
- Load and preprocess text data from multiple directories.
- Train a GPT-2 model on the custom dataset.
- Generate text responses based on user prompts.
- Save the trained model and tokenizer for future use.
- Python 3.6 or higher and less than 3.13
- PyTorch
- Transformers library from Hugging Face
To set up the environment, you can use the following commands:
pip install torch
pip install transformers
The script loads text data from a specified root folder containing subfolders. Update the root_folder
variable in config.py to point to your data directory.
The loaded text data is preprocessed by joining all texts into a single string, separated by newlines.
The script initializes the GPT-2 tokenizer and model, setting the padding token to be the same as the end-of-sequence token.
The model is trained on the preprocessed text data. You can specify the number of epochs and batch size in the train_model
function.
After training, the model and tokenizer are saved to a specified directory.
The script allows for interactive text generation. You can input prompts, and the model will generate responses until you type 'end' to exit.
To run the script, execute the following command in your terminal:
python MultipleFiles/Q&A.py
After the training part, you will be able to discuss with him. Don't forget it's a Home made IA and it's not very powerfull like CHATGPT or Mistral, and the result can change a lot depending of the training data.
Contributions are welcome! Please feel free to submit a pull request or open an issue for any suggestions or improvements.
This project is licensed under the MIT License. See the LICENSE file for more details.
- Hugging Face Transformers for providing the pre-trained models and tokenizers.
- PyTorch for the deep learning framework.