Note: The most recent model was trained on CPU using 1200 (query, positive, negative) samples. The training loss is visualized in loss_1200.jpeg
.
pip install -r requirements.txt
This command loads a subset of the dataset, performs cleaning, flattening, and tokenization, and then trains the model on 10 samples to demonstrate pipeline functionality, decreasing loss, and similarity matrix. To train on a larger dataset, subset_size
value in src/config.py
can be updated.
python -m src.run
An initially empty folder which later stores tokenized_msmacro_dataset\
and cleaned_dataset.json
Main source code directory with modular components.
Unified training pipeline that handles data import, cleaning, preprocessing, and model training using configurations defined in config.py
.
Holds training hyperparameters like dataset path, batch size, learning rate, number of epochs, etc. Subset size currently set to 10
to check if the pipeline is working properly.
import_data.py
- Loads raw data and extracts query, positive, and negative passages.preprocess.py
- Flattens and tokenizes the dataset into triplets and saves them using Hugging FaceDataset.save_to_disk
.collator.py
– Defines a custom data collator that pads and batches query, positive, and negative inputs for training.dataloader.py
- Creates a PyTorch DataLoader for the tokenized dataset using a custom collator.
embeddingmodel.py
– Wraps the Hugging Face transformer model (SmolLM2-135M) to produce sentence-level embeddings using the last token representation.dualencoder.py
– Combines the same embedding model for query, positive, and negative inputs. Performs normalization and returns embeddings.tokenizer.py
– Loads and returns the tokenizer used for preprocessing.
utils.py
– Includessave_checkpoint()
to save model periodically.trainer.py
– Containstrain_embedding_model()
andtrain_one_epoch()
methods to train the model.
Shows the contrastive loss trend over epochs during training on 1200 samples.
Contains:
- Logs from early experimentation on 500 samples.
- Saved similarity matrix and loss values for inspection.