Ancient Greek Variant SBERT

An embedding model for Ancient Greek biblical verse similarity and variant clustering.
It produces 768‑dimensional sentence embeddings (SentenceTransformers) that work well for:

semantic similarity / duplicate & near-duplicate detection
clustering verse variants
semantic search over verse corpora

Base model: pranaydeeps/Ancient-Greek-BERT

Model on Hugging Face: Paulanerus/AncientGreekVariantSBERT

Repository

This repo contains:

a data preparation pipeline (src/prepare.py) that builds train/dev/eval splits and training pairs
a training script (src/train.py) using MultipleNegativesRankingLoss
a small inference script (src/infer.py) that prints cosine similarities for sentence pairs
a convenience entrypoint script run.sh

Setup

Make the helper script executable, create a virtual environment, and install dependencies:

# Make the main helper script executable
chmod +x run.sh

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install packages (example CUDA build; adjust to your system)
pip install torch==2.9.1+cu126 \
  transformers==4.57.1 \
  sentence-transformers==5.1.2 \
  numpy==2.3.5 \
  pandas==2.3.3 \
  scipy==1.16.3 \
  datasets==4.4.1 \
  accelerate==1.12.0 \
    --extra-index-url https://download.pytorch.org/whl/cu126

You can then use run.sh:

./run.sh prepare – data prep (required before training)
./run.sh train – fine-tune the model
./run.sh infer – run inference using the trained model in model/

Data preparation (`src/prepare.py`)

Dataset

Download verses.csv from Version 4 (released July 2, 2025) of this Zenodo dataset.

Download verses.csv
Place it in data/ (path: data/verses.csv)
Run ./run.sh prepare

This creates an intermediate (cleaned/normalized) dataset and then generates the train/dev/eval splits and training pairs used for training.

prepare.py reads the raw verses CSV configured in src/config.py (RAW_VERSES, default data/verses.csv) with these columns:

verse_id: verse identifier (used for stable ordering + de-duplication)
text: the verse text
nkv: verse identifier by NKV scheme (used as the grouping key indicating which verses belong to the same underlying verse)

It first generates data/temp_verses.csv (see TEMP_VERSES in src/config.py) where nkv is converted to an integer nkv_group.

What it does:

Reads data/verses.csv, de-duplicates rows, and writes data/temp_verses.csv
Drops rows with missing text / nkv_group
Normalizes text (accent-stripping + lowercasing)
Removes groups with fewer than 2 samples
Splits by group id into train/dev/eval (defaults: 95% / 2.5% / 2.5%)
Writes split CSVs (data/processed/{train,dev,eval}.csv) and also generates training pairs (data/processed/{train,dev,eval}_pairs.csv)

Important: running prepare is required before training, because train.py reads the prepared pairs/splits.

Pair generation (positives only)

The generated *_pairs.csv files contain only positive pairs: both sentences come from the same nkv_group (i.e., they are variants of the same underlying verse). For each verse text in a group, prepare.py samples one other text from that same group to form a pair.

No explicit negative pairs are written. During training, MultipleNegativesRankingLoss uses in-batch negatives: for a given anchor sentence, the other sentences in the batch are treated as negatives.

For dev/eval, the InformationRetrievalEvaluator is built from the split CSVs: it treats items with the same nkv_group as relevant documents for a query.

Training (`src/train.py`)

Training uses SentenceTransformers with:

Loss: MultipleNegativesRankingLoss
Evaluator: InformationRetrievalEvaluator built from the dev/eval splits
Normalization: performed in prepare.py (accent stripping + lowercase)
Saving: save_best_model=True

Training/data configuration (paths, split ratios, batch size, epochs, LR, etc.) lives in src/config.py.

Hardware note: training for the released model was run on a single NVIDIA A100 80GB PCIe.

Run:

./run.sh train

Outputs:

model artifacts written to model/ (see MODEL_DIR in src/config.py)

Inference (`src/infer.py`)

infer.py loads the model and prints cosine similarities between paired lists of verses, after normalizing the text.

Run:

./run.sh infer

This is useful for quickly sanity-checking that:

near-identical verses score very high
verses with additions/omissions still score relatively high (depending on overlap)
unrelated verses score lower

Citation

If you use this model, please cite it as a model release:

@misc{ancient-greek-variant-sbert,
  author = {Fröhlich, Paul},
  title = {Ancient Greek Variant SBERT: Fine-tuned Embeddings for Biblical text verses in Ancient Greek},
  year = {2026},
  howpublished = {\url{https://huggingface.co/Paulanerus/AncientGreekVariantSBERT}},
  note = {Model release}
}

Acknowledgments

This work builds on:

pranaydeeps/Ancient-Greek-BERT (Singh, Rutten, and Lefever, 2021)
the SentenceTransformers ecosystem: https://www.sbert.net

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) 513300936.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ancient Greek Variant SBERT

Repository

Setup

Data preparation (`src/prepare.py`)

Dataset

Pair generation (positives only)

Training (`src/train.py`)

Inference (`src/infer.py`)

Citation

Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ancient Greek Variant SBERT

Repository

Setup

Data preparation (src/prepare.py)

Dataset

Pair generation (positives only)

Training (src/train.py)

Inference (src/infer.py)

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Data preparation (`src/prepare.py`)

Training (`src/train.py`)

Inference (`src/infer.py`)