Causal Dual Encoder

Note: The most recent model was trained on CPU using 1200 (query, positive, negative) samples. The training loss is visualized in loss_1200.jpeg.

Section 1: Steps to run the code

Install dependencies:

pip install -r requirements.txt

To run the entire pipeline:

This command loads a subset of the dataset, performs cleaning, flattening, and tokenization, and then trains the model on 10 samples to demonstrate pipeline functionality, decreasing loss, and similarity matrix. To train on a larger dataset, subset_size value in src/config.py can be updated.

python -m src.run

Section 2: Project Structure & File Details

`data/`

An initially empty folder which later stores tokenized_msmacro_dataset\ and cleaned_dataset.json

`src/`

Main source code directory with modular components.

`run.py`

Unified training pipeline that handles data import, cleaning, preprocessing, and model training using configurations defined in config.py.

`config.py`

Holds training hyperparameters like dataset path, batch size, learning rate, number of epochs, etc. Subset size currently set to 10 to check if the pipeline is working properly.

`src/data_prep/`

import_data.py - Loads raw data and extracts query, positive, and negative passages.
preprocess.py - Flattens and tokenizes the dataset into triplets and saves them using Hugging Face Dataset.save_to_disk.
collator.py – Defines a custom data collator that pads and batches query, positive, and negative inputs for training.
dataloader.py - Creates a PyTorch DataLoader for the tokenized dataset using a custom collator.

`src/model/`

embeddingmodel.py – Wraps the Hugging Face transformer model (SmolLM2-135M) to produce sentence-level embeddings using the last token representation.
dualencoder.py – Combines the same embedding model for query, positive, and negative inputs. Performs normalization and returns embeddings.
tokenizer.py – Loads and returns the tokenizer used for preprocessing.

`src/training/`

utils.py – Includes save_checkpoint() to save model periodically.
trainer.py – Contains train_embedding_model() and train_one_epoch() methods to train the model.

Section 3: Additional Folders

`loss_1200.jpeg`

Shows the contrastive loss trend over epochs during training on 1200 samples.

`train_experiments.ipynb`

Contains:

Logs from early experimentation on 500 samples.
Saved similarity matrix and loss values for inspection.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
eval		eval
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
ReadMe.md		ReadMe.md
loss_1200.jpeg		loss_1200.jpeg
requirements.txt		requirements.txt
train_experiments.ipynb		train_experiments.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Causal Dual Encoder

Section 1: Steps to run the code

Install dependencies:

To run the entire pipeline:

Section 2: Project Structure & File Details

`data/`

`src/`

`run.py`

`config.py`

`src/data_prep/`

`src/model/`

`src/training/`

Section 3: Additional Folders

`loss_1200.jpeg`

`train_experiments.ipynb`

About

Uh oh!

Releases

Packages

Languages

shrxyo/embedding_model

Folders and files

Latest commit

History

Repository files navigation

Causal Dual Encoder

Section 1: Steps to run the code

Install dependencies:

To run the entire pipeline:

Section 2: Project Structure & File Details

data/

src/

run.py

config.py

src/data_prep/

src/model/

src/training/

Section 3: Additional Folders

loss_1200.jpeg

train_experiments.ipynb

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`data/`

`src/`

`run.py`

`config.py`

`src/data_prep/`

`src/model/`

`src/training/`

`loss_1200.jpeg`

`train_experiments.ipynb`

Packages