🧭 edu-alias-mapper-e5

Semantic Mapping of Institution Aliases → Canonical Names Using Sentence Transformers + FAISS

🚀 Overview

Alias–Label Retriever is a GPU-accelerated pipeline that learns semantic relationships between institution aliases and canonical names. It fine-tunes a SentenceTransformer model on alias↔label pairs (e.g., "UdeG" → "University of Guadalajara") and builds FAISS indexes for instant bi-directional retrieval.
Built and trained on a DGX (Sparx) node, the system delivers lightning-fast semantic lookup across tens of thousands of entries.

🧩 Architecture

/alias-label-retriever
├── data/
│ ├── raw/ # Source datasets (.parquet, .csv)
│ ├── processed/ # Cleaned, merged, or chunked data
│ └── cache/ # Temporary embeddings, staging data
│
├── models/
│ ├── base/ # Pretrained weights (e.g., intfloat/e5-large-v2)
│ └── trained/ # Fine-tuned SentenceTransformer checkpoints
│
├── faiss/ # Vector indexes for alias/label retrieval
│ ├── labels.index
│ └── aliases.index
│
├── logs/ # Training and lookup logs
│ └── metrics/ # Epoch-level performance data
│
├── scripts/ # Executable Python modules
│ ├── gpu_check.py
│ ├── train_alias_label.py
│ └── test_lookup.py
│
└── notebooks/ # Optional Jupyter exploration

⚙️ Setup

1️⃣ Clone the repo

git clone https://github.com/craigtrim/edu-alias-mapper-e5
cd edu-alias-mapper-e5

2️⃣ Create and activate environment

conda create -n sparx python=3.11 -y
conda activate sparx

3️⃣ Install dependencies

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install sentence-transformers faiss-cpu datasets accelerate pandas pyarrow

(If your GPU supports CUDA, replace faiss-cpu with faiss-gpu)

🧠 Training

Train the semantic model on your dataset:

python scripts/train_alias_label.py

Outputs:

Fine-tuned model → models/trained/alias_label_e5/
FAISS indexes → faiss/aliases.index and faiss/labels.index
Logs → logs/training.log

🔍 Inference / Lookup

Alias → Canonical Label

python scripts/test_lookup.py --query "UdeG" --direction alias-to-label

Canonical Label → Likely Aliases

python scripts/test_lookup.py --query "University of Guadalajara" --direction label-to-alias

Example output:

🔎 Query: UdeG
📈 Top matches:
  1. University of Guadalajara                         (0.9761)
  2. Universidad de Guadalajara                        (0.9518)
  3. UdG                                              (0.9423)

🧮 GPU Verification

Before training, confirm CUDA readiness:

python scripts/gpu_check.py

Example log:

🧮 Detected 1 CUDA device(s)
✅ GPU[0] NVIDIA GB10 — 79.1 GB free / 119.7 GB total
🎯 GPU environment verified successfully

🧱 Core Stack

Component	Purpose
PyTorch	Model training and GPU compute
SentenceTransformers	Fine-tuning & embedding generation
FAISS	Fast vector similarity search
Hugging Face Datasets	Efficient data streaming
Pandas + Parquet	Data I/O and preprocessing
Accelerate	GPU orchestration backend

🧾 Example Dataset Schema

Column	Type	Description
`qid`	str	Wikidata or DBpedia ID
`label`	str	Canonical institution name
`alias`	str	Known alternative name or abbreviation
`description`	str	Optional metadata
`website`	str	Official site

📊 Performance Snapshot

Metric	Value
Epochs	3
Runtime	~6.8 min
Throughput	160 samples/sec
Final Loss	0.388
GPU	NVIDIA GB10 (119 GB)

🧠 Future Extensions

⚡ Batch embedding + streaming loader for 10M+ entries
🧩 ONNX export for low-latency inference
🌎 Multilingual alias handling (e.g., English ↔ Spanish)
🔁 Continuous fine-tuning from new institutional data

👨‍💻 Author

Craig Trim
⚙️ AI / Data Engineering – Maryville University
📍 Built and trained on local DGX (Sparx)

🪪 License

MIT License — feel free to fork, modify, and extend.

🧭 Quick Start Summary

Step	Command
🧩 Verify GPU	`python scripts/gpu_check.py`
🧠 Train Model	`python scripts/train_alias_label.py`
🔍 Query Index	`python scripts/test_lookup.py --query "UdeG"`
🧮 Monitor GPU	`watch -n 1 nvidia-smi`

⚙️ Technical Deep Dive

🧠 Model and Embedding Configuration

Base Model: intfloat/e5-large-v2 (Sentence Transformers)
Embedding Dimension: 1024
Fine-tuning Objective: MultipleNegativesRankingLoss
Training Duration: ~6.8 minutes (3 epochs)
Final Loss: 0.3884
Batch Size: 64
Framework: PyTorch 2.3.0 + CUDA
Precision: Mixed-precision (AMP enabled)
Device: NVIDIA GB10 (119 GB VRAM)

This setup uses bi-encoder embeddings—alias and label strings are independently encoded into the same vector space, allowing cosine similarity to rank candidate matches. The MultipleNegativesRankingLoss encourages the model to bring correct alias–label pairs closer together and push unrelated pairs apart.

⚙️ FAISS Indexing

Two independent FAISS indexes are generated post-training:

Index	Purpose	File Path
`labels.index`	Enables alias → label lookup	`/faiss/labels.index`
`aliases.index`	Enables label → alias lookup	`/faiss/aliases.index`

Both use IndexFlatIP (inner-product similarity), which is equivalent to cosine similarity for normalized embeddings.
Each index is fully memory-resident for sub-millisecond retrieval on GPU or CPU.

🔩 Training Flow Overview

Load dataset from Parquet → Pandas (data/raw/dbpedia_schools.parquet)
Construct alias↔label pairs
Initialize pretrained E5 model
Fine-tune for 3 epochs using MultipleNegativesRankingLoss
Persist fine-tuned model → models/trained/alias_label_e5/
Encode all aliases + labels → dense embeddings
Write FAISS indexes for instant semantic search

🧮 Query Flow

User enters query (alias or label).
Model encodes query → 1024-D embedding.
FAISS performs nearest-neighbor search in the relevant index.
Returns ranked candidates with cosine similarity scores.

Example:

Query: "UdeG"
Top-1 Match: "University of Guadalajara" (0.9761)

🧰 Libraries and Versions

Library	Version	Role
`torch`	≥ 2.3.0	GPU compute
`sentence-transformers`	≥ 3.0	Model fine-tuning
`faiss-gpu`	≥ 1.8.0	Vector similarity search
`datasets`	≥ 4.2.0	Data streaming backend
`accelerate`	≥ 0.26.0	Trainer orchestration
`pandas` / `pyarrow`	latest	Data I/O

🧩 Performance Considerations

⚡ Encoding Speed: ~30 k sentences/minute on GB10 GPU
💾 Index Size: ~400 MB per 50 k unique entries (float32)
🧠 Scalability: Extendable to 10 M+ records via FAISS IVF or HNSW variants
🔁 Retraining: Fine-tuning can resume from any checkpoint in /models/trained/

🛡️ Reproducibility

To reproduce identical results:

python scripts/gpu_check.py
python scripts/train_alias_label.py
python scripts/test_lookup.py --query "UdeG"

Model artifacts, logs, and indexes are version-controlled under their respective folders to ensure deterministic behavior across reruns.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/raw		data/raw
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧭 edu-alias-mapper-e5

🚀 Overview

🧩 Architecture

⚙️ Setup

1️⃣ Clone the repo

2️⃣ Create and activate environment

3️⃣ Install dependencies

🧠 Training

🔍 Inference / Lookup

🧮 GPU Verification

🧱 Core Stack

🧾 Example Dataset Schema

📊 Performance Snapshot

🧠 Future Extensions

👨‍💻 Author

🪪 License

🧭 Quick Start Summary

⚙️ Technical Deep Dive

🧠 Model and Embedding Configuration

⚙️ FAISS Indexing

🔩 Training Flow Overview

🧮 Query Flow

🧰 Libraries and Versions

🧩 Performance Considerations

🛡️ Reproducibility

About

Uh oh!

Releases

Packages

Languages

craigtrim/edu-alias-mapper-e5

Folders and files

Latest commit

History

Repository files navigation

🧭 edu-alias-mapper-e5

🚀 Overview

🧩 Architecture

⚙️ Setup

1️⃣ Clone the repo

2️⃣ Create and activate environment

3️⃣ Install dependencies

🧠 Training

🔍 Inference / Lookup

🧮 GPU Verification

🧱 Core Stack

🧾 Example Dataset Schema

📊 Performance Snapshot

🧠 Future Extensions

👨‍💻 Author

🪪 License

🧭 Quick Start Summary

⚙️ Technical Deep Dive

🧠 Model and Embedding Configuration

⚙️ FAISS Indexing

🔩 Training Flow Overview

🧮 Query Flow

🧰 Libraries and Versions

🧩 Performance Considerations

🛡️ Reproducibility

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages