Skip to content

Trained an embedding model based on a causal language model (HuggingFaceTB/SmolLM2-135M) using contrastive loss with in-batch and hard negatives.

Notifications You must be signed in to change notification settings

shrxyo/embedding_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Causal Dual Encoder

Note: The most recent model was trained on CPU using 1200 (query, positive, negative) samples. The training loss is visualized in loss_1200.jpeg.

Section 1: Steps to run the code

Install dependencies:

pip install -r requirements.txt

To run the entire pipeline:

This command loads a subset of the dataset, performs cleaning, flattening, and tokenization, and then trains the model on 10 samples to demonstrate pipeline functionality, decreasing loss, and similarity matrix. To train on a larger dataset, subset_size value in src/config.py can be updated.

python -m src.run

Section 2: Project Structure & File Details

data/

An initially empty folder which later stores tokenized_msmacro_dataset\ and cleaned_dataset.json

src/

Main source code directory with modular components.

run.py

Unified training pipeline that handles data import, cleaning, preprocessing, and model training using configurations defined in config.py.

config.py

Holds training hyperparameters like dataset path, batch size, learning rate, number of epochs, etc. Subset size currently set to 10 to check if the pipeline is working properly.

src/data_prep/

  • import_data.py - Loads raw data and extracts query, positive, and negative passages.
  • preprocess.py - Flattens and tokenizes the dataset into triplets and saves them using Hugging Face Dataset.save_to_disk.
  • collator.py – Defines a custom data collator that pads and batches query, positive, and negative inputs for training.
  • dataloader.py - Creates a PyTorch DataLoader for the tokenized dataset using a custom collator.

src/model/

  • embeddingmodel.py – Wraps the Hugging Face transformer model (SmolLM2-135M) to produce sentence-level embeddings using the last token representation.
  • dualencoder.py – Combines the same embedding model for query, positive, and negative inputs. Performs normalization and returns embeddings.
  • tokenizer.py – Loads and returns the tokenizer used for preprocessing.

src/training/

  • utils.py – Includes save_checkpoint() to save model periodically.
  • trainer.py – Contains train_embedding_model() and train_one_epoch() methods to train the model.

Section 3: Additional Folders

loss_1200.jpeg

Shows the contrastive loss trend over epochs during training on 1200 samples.

train_experiments.ipynb

Contains:

  • Logs from early experimentation on 500 samples.
  • Saved similarity matrix and loss values for inspection.

About

Trained an embedding model based on a causal language model (HuggingFaceTB/SmolLM2-135M) using contrastive loss with in-batch and hard negatives.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published