Semantic Mapping of Institution Aliases → Canonical Names Using Sentence Transformers + FAISS
Alias–Label Retriever is a GPU-accelerated pipeline that learns semantic relationships between institution aliases and canonical names. It fine-tunes a SentenceTransformer model on alias↔label pairs (e.g., "UdeG" → "University of Guadalajara") and builds FAISS indexes for instant bi-directional retrieval.
Built and trained on a DGX (Sparx) node, the system delivers lightning-fast semantic lookup across tens of thousands of entries.
/alias-label-retriever
├── data/
│ ├── raw/ # Source datasets (.parquet, .csv)
│ ├── processed/ # Cleaned, merged, or chunked data
│ └── cache/ # Temporary embeddings, staging data
│
├── models/
│ ├── base/ # Pretrained weights (e.g., intfloat/e5-large-v2)
│ └── trained/ # Fine-tuned SentenceTransformer checkpoints
│
├── faiss/ # Vector indexes for alias/label retrieval
│ ├── labels.index
│ └── aliases.index
│
├── logs/ # Training and lookup logs
│ └── metrics/ # Epoch-level performance data
│
├── scripts/ # Executable Python modules
│ ├── gpu_check.py
│ ├── train_alias_label.py
│ └── test_lookup.py
│
└── notebooks/ # Optional Jupyter exploration
git clone https://github.com/craigtrim/edu-alias-mapper-e5
cd edu-alias-mapper-e5conda create -n sparx python=3.11 -y
conda activate sparxpip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install sentence-transformers faiss-cpu datasets accelerate pandas pyarrow(If your GPU supports CUDA, replace faiss-cpu with faiss-gpu)
Train the semantic model on your dataset:
python scripts/train_alias_label.pyOutputs:
- Fine-tuned model → models/trained/alias_label_e5/
- FAISS indexes → faiss/aliases.index and faiss/labels.index
- Logs → logs/training.log
Alias → Canonical Label
python scripts/test_lookup.py --query "UdeG" --direction alias-to-labelCanonical Label → Likely Aliases
python scripts/test_lookup.py --query "University of Guadalajara" --direction label-to-aliasExample output:
🔎 Query: UdeG
📈 Top matches:
1. University of Guadalajara (0.9761)
2. Universidad de Guadalajara (0.9518)
3. UdG (0.9423)
Before training, confirm CUDA readiness:
python scripts/gpu_check.pyExample log:
🧮 Detected 1 CUDA device(s)
✅ GPU[0] NVIDIA GB10 — 79.1 GB free / 119.7 GB total
🎯 GPU environment verified successfully| Component | Purpose |
|---|---|
| PyTorch | Model training and GPU compute |
| SentenceTransformers | Fine-tuning & embedding generation |
| FAISS | Fast vector similarity search |
| Hugging Face Datasets | Efficient data streaming |
| Pandas + Parquet | Data I/O and preprocessing |
| Accelerate | GPU orchestration backend |
| Column | Type | Description |
|---|---|---|
qid |
str | Wikidata or DBpedia ID |
label |
str | Canonical institution name |
alias |
str | Known alternative name or abbreviation |
description |
str | Optional metadata |
website |
str | Official site |
| Metric | Value |
|---|---|
| Epochs | 3 |
| Runtime | ~6.8 min |
| Throughput | 160 samples/sec |
| Final Loss | 0.388 |
| GPU | NVIDIA GB10 (119 GB) |
- ⚡ Batch embedding + streaming loader for 10M+ entries
- 🧩 ONNX export for low-latency inference
- 🌎 Multilingual alias handling (e.g., English ↔ Spanish)
- 🔁 Continuous fine-tuning from new institutional data
Craig Trim
⚙️ AI / Data Engineering – Maryville University
📍 Built and trained on local DGX (Sparx)
MIT License — feel free to fork, modify, and extend.
| Step | Command |
|---|---|
| 🧩 Verify GPU | python scripts/gpu_check.py |
| 🧠 Train Model | python scripts/train_alias_label.py |
| 🔍 Query Index | python scripts/test_lookup.py --query "UdeG" |
| 🧮 Monitor GPU | watch -n 1 nvidia-smi |
- Base Model:
intfloat/e5-large-v2(Sentence Transformers) - Embedding Dimension: 1024
- Fine-tuning Objective:
MultipleNegativesRankingLoss - Training Duration: ~6.8 minutes (3 epochs)
- Final Loss: 0.3884
- Batch Size: 64
- Framework: PyTorch 2.3.0 + CUDA
- Precision: Mixed-precision (AMP enabled)
- Device: NVIDIA GB10 (119 GB VRAM)
This setup uses bi-encoder embeddings—alias and label strings are independently encoded into the same vector space, allowing cosine similarity to rank candidate matches. The MultipleNegativesRankingLoss encourages the model to bring correct alias–label pairs closer together and push unrelated pairs apart.
Two independent FAISS indexes are generated post-training:
| Index | Purpose | File Path |
|---|---|---|
labels.index |
Enables alias → label lookup | /faiss/labels.index |
aliases.index |
Enables label → alias lookup | /faiss/aliases.index |
Both use IndexFlatIP (inner-product similarity), which is equivalent to cosine similarity for normalized embeddings.
Each index is fully memory-resident for sub-millisecond retrieval on GPU or CPU.
- Load dataset from Parquet → Pandas (
data/raw/dbpedia_schools.parquet) - Construct alias↔label pairs
- Initialize pretrained E5 model
- Fine-tune for 3 epochs using
MultipleNegativesRankingLoss - Persist fine-tuned model →
models/trained/alias_label_e5/ - Encode all aliases + labels → dense embeddings
- Write FAISS indexes for instant semantic search
- User enters query (alias or label).
- Model encodes query → 1024-D embedding.
- FAISS performs nearest-neighbor search in the relevant index.
- Returns ranked candidates with cosine similarity scores.
Example:
Query: "UdeG"
Top-1 Match: "University of Guadalajara" (0.9761)
| Library | Version | Role |
|---|---|---|
torch |
≥ 2.3.0 | GPU compute |
sentence-transformers |
≥ 3.0 | Model fine-tuning |
faiss-gpu |
≥ 1.8.0 | Vector similarity search |
datasets |
≥ 4.2.0 | Data streaming backend |
accelerate |
≥ 0.26.0 | Trainer orchestration |
pandas / pyarrow |
latest | Data I/O |
- ⚡ Encoding Speed: ~30 k sentences/minute on GB10 GPU
- 💾 Index Size: ~400 MB per 50 k unique entries (float32)
- 🧠 Scalability: Extendable to 10 M+ records via FAISS IVF or HNSW variants
- 🔁 Retraining: Fine-tuning can resume from any checkpoint in
/models/trained/
To reproduce identical results:
python scripts/gpu_check.py
python scripts/train_alias_label.py
python scripts/test_lookup.py --query "UdeG"Model artifacts, logs, and indexes are version-controlled under their respective folders to ensure deterministic behavior across reruns.