Skip to content

A high-performance, GPU-accelerated entity resolution pipeline.

License

Notifications You must be signed in to change notification settings

andrewjordan3/entity-resolver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

176 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ GPU-Accelerated Entity Resolver

Python Version CUDA License Code style: ruff

A high-performance, production-ready Python package for entity resolution and deduplication at scale. Built on NVIDIA RAPIDS (cudf, cuML, cupy), the project keeps the entire pipeline on the GPU to minimize transfers and maximize throughput.

πŸ“‹ Table of Contents

✨ Features

Core Capabilities

  • πŸƒβ€β™‚οΈ GPU-Accelerated Pipeline: End-to-end GPU execution minimizes data transfer bottlenecks
  • πŸ” Intelligent Address Processing: Leverages libpostal for robust address parsing and standardization
  • πŸ’Ύ Model Persistence: Save and load fitted models for production deployment
  • πŸ“Š Confidence Scoring: Multi-factor scoring system for match quality assessment
  • πŸ”„ Incremental Processing: Transform new data using pre-fitted models

Multi-Stream Vectorization

Creates rich entity embeddings by combining three complementary approaches:

Stream Method Purpose Key Technique
Syntactic TF-IDF Character-level patterns N-gram analysis (3–5 chars)
Phonetic Metaphone Sound-based matching Phonetic encoding
Semantic Transformers Contextual meaning bge-base-en-v1.5 embeddings

Advanced Clustering

  • HDBSCAN + SNN Ensemble: Combines density-based clustering with graph-based rescue mechanism
  • UMAP Manifold Learning: Ensemble of models with randomized hyperparameters for robust dimensionality reduction
  • Automatic Chain Detection: Identifies and numbers entities with multiple locations (e.g., β€œStarbucks – 1”, β€œStarbucks – 2”)

⚑ Performance

  • Designed for hundreds of thousands to millions of rows on a single GPU.
  • Memory scales primarily with TF-IDF vocabulary size and UMAP neighbor graphs.
  • Performance depends on hardware and configuration. On an NVIDIA A100 40GB, end-to-end runs with the default settings typically process ~200k rows in tens of minutes to a few hours. Lighter settings (smaller vocabularies / fewer UMAP epochs) run faster; aggressive settings (larger vocabularies, more ensemble members) improve quality at the cost of runtime and memory.

πŸ’» Requirements

System Requirements

  • GPU: NVIDIA GPU with Compute Capability 7.0+ (e.g., V100, T4, A100, RTX 2080+, Colab GPUs)
  • CUDA: Version 11.5+ or 12.x (CUDA 12.x recommended)
  • Memory:
    • GPU: Minimum 8 GB VRAM (16 GB+ recommended for large datasets)
    • System RAM: ~2Γ— your dataset size

Software Dependencies

Core Libraries:

  • Python 3.11+
  • pandas, numpy, pydantic, pyyaml

GPU Libraries (auto-installed with pip method):

  • RAPIDS Suite: cudf, cuml, cugraph, cupy, rmm

Specialized Libraries:

  • sentence-transformers - Semantic embeddings
  • phonetics - Phonetic encoding
  • postal - Address parsing (libpostal)
  • unicodedata2 - Unicode normalization

Development Tools (optional):

  • ruff - Linting and formatting
  • mypy - Static type checking
  • pytest - Testing framework

πŸ“¦ Installation

The package supports two installation methods: pip (recommended for Colab and simpler setups) or conda (recommended for local development).


Method 1: Pip Installation (Recommended for Google Colab)

For CUDA 12.x (most recent systems):

# Clone the repository
git clone https://github.com/andrewjordan3/entity-resolver.git
cd entity-resolver

# Install with GPU dependencies for CUDA 12
pip install -e ".[gpu-cu12]"

For CUDA 11.x (older systems):

# Clone the repository
git clone https://github.com/andrewjordan3/entity-resolver.git
cd entity-resolver

# Install with GPU dependencies for CUDA 11
pip install -e ".[gpu-cu11]"

Google Colab Quick Setup:

# In a Colab notebook cell:
!git clone https://github.com/andrewjordan3/entity-resolver.git
%cd entity-resolver
!pip install -e ".[gpu-cu12]"

# Restart runtime after installation
# Go to: Runtime > Restart runtime

Method 2: Conda Installation (Recommended for Local Development)

Step 1: Create RAPIDS Environment

# Create a new environment with RAPIDS
conda create -n entity-resolver -c rapidsai -c conda-forge -c nvidia \
    rapids=24.12 python=3.11 cuda-version=11.8

# Activate the environment
conda activate entity-resolver

For more detailed instructions, see the RAPIDS installation guide.

Step 2: Install the Package

# Clone the repository
git clone https://github.com/andrewjordan3/entity-resolver.git
cd entity-resolver

# Install the package (CPU dependencies only, RAPIDS already installed)
pip install -e .

Verify Installation

import entity_resolver
import cudf

print(f"Entity Resolver version: {entity_resolver.__version__}")
print(f"cuDF version: {cudf.__version__}")

πŸš€ Quick Start

import pandas as pd
from entity_resolver import EntityResolver

# Load your data
df = pd.DataFrame({
    'company_name': [
        'Acme Corp', 'ACME Corporation', 'Acme Corp.',
        'TechStart Inc', 'Tech Start LLC'
    ],
    'address': [
        '123 Main St, Chicago IL 60601',
        '123 Main Street, Chicago, IL',
        '123 Main St Chicago IL',
        '456 Oak Ave, Hoffman Estates, IL',
        '456 Oak Avenue Hoffman Estates IL'
    ]
})

# Initialize the resolver.
# This uses the default settings. For custom settings, provide a path
# to your YAML file, e.g., EntityResolver(config_path='my_config.yaml')
resolver = EntityResolver()

# Fit and transform in one step
resolved_df = resolver.fit_transform(df)

# Examine results
print(resolved_df[['company_name', 'canonical_name', 'confidence_score']])

# Generate a detailed report
report = resolver.generate_report(df)
print(f"Reduced {report['summary']['unique_entities_before']} entities "
      f"to {report['summary']['unique_entities_after']}")

# Get records flagged for review
review_df = resolved_df[resolved_df['needs_review']]
print(f"Records needing review: {len(review_df)}")

πŸ”§ Advanced Usage

Saving and Loading Models

# Train and save
resolver = EntityResolver(config)
resolver.fit(training_df)
resolver.save_model('models/my_resolver')

# Load and use
resolver = EntityResolver.load_model('models/my_resolver')
new_resolved_df = resolver.transform(new_df)

Custom Configuration

To customize the pipeline, create a config.yaml file and pass its path to the resolver. You only need to specify the parameters you want to override.

Example my_config.yaml:

# my_config.yaml
output:
  review_confidence_threshold: 0.80
  split_address_components: true

clusterer:
  hdbscan_params:
    min_cluster_size: 2
  umap_params:
    n_neighbors: 30

validation:
  name_fuzz_ratio: 85

**Usage in Python:

from entity_resolver import EntityResolver

# Load the resolver with your custom configuration
resolver = EntityResolver(config_path='my_config.yaml')

Processing Large Datasets

# For datasets larger than GPU memory, process in chunks
chunk_size = 100_000
chunks = []

for i in range(0, len(large_df), chunk_size):
    chunk = large_df.iloc[i:i+chunk_size]
    resolved_chunk = resolver.transform(chunk)
    chunks.append(resolved_chunk)

final_df = pd.concat(chunks, ignore_index=True)

Note: The fit() or fit_transform() methods must be trained on a dataset that can fit into GPU memory, as they build global models (e.g., vectorizer vocabularies). Once the resolver is fitted, the transform() method can be used to process new, unseen data in chunks.

βš™οΈ Configuration

The ResolverConfig object controls all aspects of the pipeline:

Component Key Parameters Description
columns entity_col, address_cols Input column mapping
normalization replacements, suffixes_to_remove Text cleaning rules
vectorizer encoders, stream_proportions Feature extraction settings
clusterer umap_params, hdbscan_params Clustering hyperparameters
validation name_fuzz_ratio, address_fuzz_ratio Match validation thresholds
output review_confidence_threshold Output formatting options

πŸ—οΈ Pipeline Architecture

graph TB
    Input[πŸ“„ Input DataFrame<br/>company_name, address] --> Normalizer

    subgraph Preprocessing["πŸ”§ PREPROCESSING"]
        Normalizer[Text Normalizer<br/>β€’ Remove suffixes<br/>β€’ Replace patterns<br/>β€’ Unicode normalization]
        AddressProc[Address Processor<br/>β€’ libpostal parsing<br/>β€’ Component extraction<br/>β€’ Name consolidation]
        Normalizer --> AddressProc
    end

    subgraph Vectorization["🧬 VECTORIZATION"]
        AddressProc --> TfidfStream[Syntactic Stream<br/>TF-IDF n-grams]
        AddressProc --> PhoneticStream[Phonetic Stream<br/>Metaphone encoding]
        AddressProc --> SemanticStream[Semantic Stream<br/>MiniLM embeddings]

        TfidfStream --> ContextVec[Context Vectorizer<br/>β€’ Energy balancing<br/>β€’ Stream concatenation<br/>β€’ SVD reduction]
        PhoneticStream --> ContextVec
        SemanticStream --> ContextVec
    end

    subgraph Clustering["πŸ” CLUSTERING"]
        ContextVec --> UMAP[UMAP Ensemble<br/>β€’ Randomized params<br/>β€’ Manifold learning]
        UMAP --> HDBSCAN[HDBSCAN<br/>Density clustering]
        HDBSCAN --> SNN[SNN Graph Rescue<br/>β€’ Noise recovery<br/>β€’ Consensus voting]
    end

    subgraph Refinement["✨ REFINEMENT"]
        SNN --> Validator[Cluster Validator<br/>β€’ Fuzzy matching<br/>β€’ Address verification]
        Validator --> Merger[Cluster Merger<br/>β€’ Similar clusters<br/>β€’ High-confidence joins]
        Merger --> Refiner[Cluster Refiner<br/>β€’ Canonical selection<br/>β€’ Chain numbering]
    end

    subgraph Output["πŸ“Š OUTPUT"]
        Refiner --> Scorer[Confidence Scorer<br/>β€’ Multi-factor scoring<br/>β€’ Review flagging]
        Scorer --> Reporter[Reporter<br/>β€’ Statistics<br/>β€’ Quality metrics]
        Reporter --> Result[πŸ“‹ Resolved DataFrame<br/>canonical_name, confidence, chain_num]
    end

    style Input fill:#4A90E2,stroke:#2E5C8A,stroke-width:3px,color:#fff
    style Result fill:#50C878,stroke:#2D7A4A,stroke-width:3px,color:#fff
    style Preprocessing fill:#FFB74D,stroke:#E68900,stroke-width:2px,color:#333
    style Vectorization fill:#BA68C8,stroke:#7B1FA2,stroke-width:2px,color:#fff
    style Clustering fill:#4DB6AC,stroke:#00695C,stroke-width:2px,color:#fff
    style Refinement fill:#F06292,stroke:#C2185B,stroke-width:2px,color:#fff
    style Output fill:#64B5F6,stroke:#1976D2,stroke-width:2px,color:#fff
Loading

Pipeline Stages

Preprocessing (Steps 1–2)

  • Text normalization and standardization
  • Address parsing with libpostal
  • Name consolidation for shared addresses

Vectorization (Step 3)

  • TF-IDF character n-grams
  • Phonetic encoding
  • Semantic embeddings
  • Energy balancing across streams

Clustering (Steps 4–5)

  • UMAP ensemble for dimensionality reduction
  • HDBSCAN for core clustering
  • SNN graph for noise rescue
  • Cluster validation and merging

Finalization (Steps 6–7)

  • Canonical entity selection
  • Chain entity numbering
  • Confidence scoring
  • Review flagging

πŸ“š API Reference

Main Classes

  • EntityResolver: Main orchestrator class
  • ResolverConfig: Configuration management
  • Component classes: TextNormalizer, AddressProcessor, MultiStreamVectorizer, EntityClusterer, ClusterValidator, ClusterMerger, ClusterRefiner, ConfidenceScorer

Key Methods

# EntityResolver methods
resolver.fit(df)                 # Train on data
resolver.transform(df)           # Apply to new data
resolver.fit_transform(df)       # Train and transform
resolver.save_model(path)        # Save fitted model
resolver.load_model(path)        # Load saved model
resolver.get_review_dataframe()  # Get review summary
resolver.generate_report(df)     # Generate statistics

πŸ› Troubleshooting

Common Issues

CUDA Out of Memory

# Reduce batch sizes
config.vectorizer.semantic_batch_size = 256
config.clusterer.cosine_consensus_batch_size = 1024

Import Errors

# Verify RAPIDS installation
python -c "import cudf; print(cudf.__version__)"

Poor Clustering Results

# Adjust sensitivity
config.clusterer.hdbscan_params['min_cluster_size'] = 2
config.validation.name_fuzz_ratio = 80

🀝 Contributing

Contributions are welcome!

# Development setup
git clone https://github.com/andrewjordan3/entity-resolver.git
cd entity-resolver

# Install with GPU and development dependencies
pip install -e ".[gpu-cu12,dev]"

# Run linting and formatting
ruff check .
ruff format .

# Run type checking
mypy .

# Run tests (when available)
pytest tests/

πŸ“ Citation

If you use this package in your research, please cite:

@software{entity_resolver,
  author = {Andrew Jordan},
  title = {GPU-Accelerated Entity Resolver},
  year = {2025},
  url = {https://github.com/andrewjordan3/entity-resolver}
}

πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.

πŸ™ Acknowledgments

  • NVIDIA RAPIDS team for the GPU acceleration framework
  • OpenCage for libpostal address parsing
  • Hugging Face for sentence-transformers

Maintained by: Andrew Jordan Contact: andrewjordan3@gmail.com Last Updated: November 2025

About

A high-performance, GPU-accelerated entity resolution pipeline.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages