A high-performance Go library for detecting PII (Personally Identifiable Information) names by Montevive.AI. This is a Go migration of the popular names-dataset Python library, offering 10x faster performance and 6x less memory usage while maintaining the same comprehensive name database.
đ v1.0.0 Released - Production-ready with embedded dataset and out-of-the-box functionality!
This project migrates the Python names-dataset library to Go with significant improvements:
| Feature | Python (Original) | Go (This Project) |
|---|---|---|
| Memory Usage | 3.2 GB | ~500 MB (6x less) |
| Detection Speed | ~50-100ms | 3-9ms (10x faster) |
| Data Format | Pickle files | Protocol Buffers |
| Cultural Rules | Limited | Universal algorithm |
| Surname Support | Single string | Multiple surnames ([]string) |
| Spanish Names | Basic support | Advanced double surname detection |
Built with Protocol Buffers for efficient data storage and fast lookups of 727K first names and 983K surnames from 105 countries.
go get github.com/montevive/go-name-detectorimport "github.com/montevive/go-name-detector/pkg/detector"
d, _ := detector.NewDefault() // Works out of the box!
result := d.DetectPII([]string{"John", "Smith"})
fmt.Printf("Is PII: %v (%.2f confidence)\n", result.IsLikelyName, result.Confidence)- Universal name detection: Works for all cultural naming patterns without hardcoded rules
- Multiple surname support: Handles Spanish double surnames, compound names, etc.
- Confidence scoring: Returns 0.0-1.0 confidence scores with detailed breakdown
- High performance: < 1ms detection time, optimized memory usage
- Protocol Buffers: Efficient binary data format for 727K first names and 983K surnames
- Country prediction: Identifies most likely country of origin
- Gender prediction: Predicts gender based on first names
- CLI tool: Ready-to-use command line interface
The library now includes embedded data and works out of the box:
# Latest version (recommended)
go get github.com/montevive/go-name-detector@latest
# Or pin to v1.0.0
go get github.com/montevive/go-name-detector@v1.0.0That's it! No additional setup, file downloads, or protobuf compilation required. The library is production-ready as of v1.0.0.
If you want to use custom data or the latest dataset, you can follow the advanced setup:
# Install original Python library (for custom data export)
pip install names-dataset
# Clone this Go migration project
git clone https://github.com/montevive/go-name-detector.git
cd go-name-detector
# Install dependencies and generate protobuf code
make generate
# Export data from Python pickle format to Protocol Buffers
make export-data
# Build the CLI tool
make build# Basic name detection
./bin/pii-check "John Smith"
# Output: â Likely PII name (85.3% confidence)
# Spanish names with double surnames
./bin/pii-check "Jose Manuel GarcĂa LĂłpez"
# Output: â Likely PII name (89.0% confidence)
# First names: Jose, Manuel
# Surnames: GarcĂa, LĂłpez
# Spanish names with accents (60% threshold recommended)
./bin/pii-check -threshold 0.6 "JosĂ© GarcĂa"
# Output: â Likely PII name (64.8% confidence)
# Business terminology properly rejected
./bin/pii-check "Informe de Cliente"
# Output: â Not a PII name (26.4% confidence)
# Non-name phrases
./bin/pii-check "The quick brown fox"
# Output: â Not a PII name (12.4% confidence)
# JSON output with accent support
./bin/pii-check -json "MarĂa GarcĂa LĂłpez"
# High precision threshold
./bin/pii-check -threshold 0.8 "JosĂ© Manuel GarcĂa LĂłpez"
# Output: â Likely PII name (89.0% confidence)
# Batch processing
./bin/pii-check -batch names.txt
# Dataset statistics
./bin/pii-check -statsBased on enhanced scoring with popularity-based ranking:
-
70% (Default): Conservative - High confidence, minimal false positives
- Best for: Production systems, automated filtering, compliance workflows
- Example: "JosĂ© Manuel GarcĂa LĂłpez" (89%) â, "Informe de Cliente" (26%) â
- Accuracy: ~95% precision, may miss some legitimate two-word names
-
60%: Balanced - Good detection with reasonable accuracy
- Best for: General purpose detection, Spanish/Latin name datasets
- Example: "JosĂ© GarcĂa" (65%) â, "MarĂa LĂłpez" (62%) â
- Accuracy: ~90% precision, catches most legitimate name patterns
-
50%: Aggressive - Catches more names but higher false positive rate
- Best for: Initial screening, manual review workflows, research
- Example: "MarĂa de GarcĂa" (43%) â despite preposition
- Accuracy: ~80% precision, requires human verification
- Spanish/Latin names: Use 60-65% threshold (two-word names often score 60-70%)
- Anglo/English names: Use 70% threshold (typically score higher due to simpler patterns)
- Mixed international datasets: Use 65% as balanced compromise
- Asian names: Use 70% (usually compound patterns with higher confidence)
The enhanced algorithm distinguishes between:
- â Top-ranked personal names (JosĂ© #1, GarcĂa #1): Score 65-95%
- â Business terms ("Informe de Cliente", "Documento de Identidad"): Score 15-30%
- â Rare/noise entries (Cliente #6,970): Score under 20%
â ïž Prepositions in names ("MarĂa de GarcĂa"): Moderate penalty but detectable at 60%
# High precision (minimal false positives)
./bin/pii-check -threshold 0.8 "JosĂ© Manuel GarcĂa LĂłpez"
# Balanced detection (recommended for most use cases)
./bin/pii-check -threshold 0.65 "JosĂ© GarcĂa"
# High recall (catch more names, review manually)
./bin/pii-check -threshold 0.5 "MarĂa de la Cruz"package main
import (
"fmt"
"log"
"github.com/montevive/go-name-detector/pkg/detector"
)
func main() {
// Create detector with embedded data - works out of the box!
d, err := detector.NewDefault()
if err != nil {
log.Fatal(err)
}
// Detect PII
words := []string{"Jose", "Manuel", "Robles", "Hermoso"}
result := d.DetectPII(words)
fmt.Printf("Is PII: %v\n", result.IsLikelyName)
fmt.Printf("Confidence: %.2f\n", result.Confidence)
fmt.Printf("First names: %v\n", result.Details.FirstNames)
fmt.Printf("Surnames: %v\n", result.Details.Surnames)
fmt.Printf("Country: %s\n", result.Details.TopCountry)
fmt.Printf("Gender: %s\n", result.Details.Gender)
}package main
import (
"fmt"
"github.com/montevive/go-name-detector/pkg/detector"
"github.com/montevive/go-name-detector/pkg/loader"
)
func main() {
// Load dataset from file
l := loader.New()
err := l.LoadFromFile("data/combined_names.pb.gz")
if err != nil {
panic(err)
}
// Create detector
d := detector.New(l.GetDataset())
// Detect PII
words := []string{"Jose", "Manuel", "Robles", "Hermoso"}
result := d.DetectPII(words)
fmt.Printf("Is PII: %v\n", result.IsLikelyName)
fmt.Printf("Confidence: %.2f\n", result.Confidence)
}The detector uses a data-driven approach:
-
All possible splits: For input words, tries every possible combination of first names vs surnames
-
Database lookup: Checks each component against 727K first names and 983K surnames
-
Confidence scoring: Combines multiple factors:
- Database match (base score)
- Name popularity (lower rank = higher confidence)
- Gender consistency across first names
- Country overlap between components
- Multiple valid name bonus
-
Best combination: Returns the split with highest confidence score
Input: ["Jose", "Manuel", "Robles", "Hermoso"]
The algorithm tries:
[Jose] + [Manuel, Robles, Hermoso]â Score: 0.42[Jose, Manuel] + [Robles, Hermoso]â Score: 0.92 â[Jose, Manuel, Robles] + [Hermoso]â Score: 0.35
Returns the best scoring combination with detailed breakdown.
The system uses Protocol Buffer files converted from the original Python pickle format:
- first_names.pb.gz: 727,556 first names with country/gender/rank data
- last_names.pb.gz: 983,826 surnames with country/rank data
- combined_names.pb.gz: Both datasets in a single file
Each name entry contains:
- Country probabilities: Likelihood per country (105 countries supported)
- Gender data: Male/Female probabilities (first names only)
- Popularity ranks: 1-indexed ranking per country (1 = most popular)
| Metric | Python (names-dataset) | Go (This Project) | Improvement |
|---|---|---|---|
| Memory Usage | 3.2 GB | 500 MB | 6.4x less |
| Load Time | ~30-60 seconds | 4.3 seconds | 7-14x faster |
| Detection Speed | 50-100ms | 3-9ms | 10-20x faster |
| File Size | 53MB (pickle) | 54MB (protobuf) | Comparable |
| Batch Processing | ~100 names/sec | 10,000+ names/sec | 100x faster |
BenchmarkDetectPII_Spanish-10 141104 8418 ns/op 14742 B/op 51 allocs/op
BenchmarkDetectPII_English-10 342564 3533 ns/op 7179 B/op 19 allocs/op
BenchmarkDetectPII_NonName-10 226826 5318 ns/op 14374 B/op 32 allocs/op
- Spanish names: 8.4ms per detection
- English names: 3.5ms per detection
- Non-names: 5.3ms per detection
# Run unit tests
make test
# Run benchmarks
make bench
# Test with examples
go test ./examples -vThe scoring algorithm uses enhanced popularity-based scoring and can be customized:
config := detector.ScoreConfig{
BaseMatchScore: 0.25, // Base score for database match
PopularityWeight: 0.35, // HIGH: Popularity is key differentiator
GenderConsistency: 0.1, // Bonus for consistent gender
CountryOverlap: 0.15, // Bonus for country overlap
MultipleNamesBonus: 0.15, // Bonus for multiple names
}
d := detector.NewWithConfig(dataset, config)-
Step-based popularity scoring:
- Top-10 names (JosĂ©, GarcĂa): 1.0 score
- Top-50 names: 0.8 score
- Top-200 names: 0.5 score
- Top-1000 names: 0.2 score
- Rare names (Cliente #6,970): 0.02 score
-
Preposition penalties: Words like "de", "van", "von", "del" heavily penalized
- 70% penalty (Ă0.3) when used as first names
- 30% penalty (Ă0.7) when used as surnames
-
Pattern bonuses:
- Two top-100 names together: 40% boost (Ă1.4)
- Two top-10 names together: 60% boost (Ă1.6)
-
Accent normalization: Automatic handling of "JosĂ©" â "Jose" lookups
The universal algorithm automatically handles:
- English:
John Smith,Mary Jane Doe - Spanish:
Jose Manuel Garcia Lopez(double first + double surname) - Compound names:
Maria del Carmen Rodriguez - Asian patterns: Based on statistical data in the dataset
- Any cultural pattern: No hardcoded rules, purely data-driven
Full support for accented characters and international Unicode names:
- â Accepts any Unicode letters: JosĂ©, MarĂa, François, MĂŒller, ææ, Ù ŰÙ ŰŻ
- â Automatic normalization: Converts accents for database lookup
- â Case insensitive: JOSĂ, josĂ©, JosĂ© all work identically
- â Mixed scripts: Handles Latin, Cyrillic, Arabic, Chinese characters
The library uses intelligent dual lookup:
- Exact match: First tries to find "José" in the database
- Normalized match: Then tries "Jose" (accent removed)
- Best result: Uses whichever version has better popularity ranking
# Spanish names with accents
./bin/pii-check "JosĂ© GarcĂa" # â 64.8% confidence
./bin/pii-check "MarĂa LĂłpez" # â Proper detection
./bin/pii-check "JosĂ© Manuel GarcĂa" # â 81.4% confidence
# French names
./bin/pii-check "François Dupont" # â Handles French accents
# Mixed accent/non-accent
./bin/pii-check "JosĂ© Smith" # â Works with mixed names
./bin/pii-check "Maria GarcĂa" # â One accent, one without- Normalization: Uses Unicode NFD decomposition + combining character removal
- Preservation: Original accented input preserved in results
- Performance: Minimal overhead (~1ÎŒs per normalization)
- Languages: Supports Spanish, French, German, Portuguese, Italian, and more
If migrating from systems that only handle ASCII:
// Old ASCII-only approach â
if isASCII(name) {
result := detector.DetectPII([]string{name})
}
// New Unicode approach â
result := detector.DetectPII([]string{"JosĂ©", "GarcĂa"}) // Just works!This Go implementation uses the same high-quality dataset as the original Python library:
Source: Facebook data leak dataset (533M users) from 105 countries
Original Library: names-dataset by Philippe Remy
What we provide:
- Same 727K first names and 983K surnames
- Same country-specific popularity rankings
- Same gender predictions with statistical confidence
- Same real-world name frequency distributions
- Enhanced: Better support for cultural naming patterns through universal algorithm
Migration benefits:
- Performance: 10x faster detection with 6x less memory
- Format: Protocol Buffers instead of Python pickle
- Algorithm: Universal approach vs hardcoded rules
- Spanish names: Advanced double surname detection
- Production ready: CLI tool, JSON API, batch processing
Copyright 2024 Montevive.AI
Apache License 2.0 - see the LICENSE file for details.
This project is a Go migration of the original names-dataset Python library by Philippe Remy, which is also licensed under Apache 2.0.
We welcome contributions to improve the Go PII Name Detector! This project is maintained by Montevive.AI.
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Run
make testandmake bench - Submit a pull request
For questions or support, visit Montevive.AI or open an issue on GitHub.
If you're migrating from the Python library:
# Python (original)
from names_dataset import NameDataset
nd = NameDataset()
result = nd.search('Jose Manuel')
# Go (this project) - equivalent functionality
words := []string{"Jose", "Manuel", "Garcia", "Lopez"}
result := detector.DetectPII(words)Key differences:
- Multiple surnames: Go version returns
[]stringfor surnames (supports Spanish double surnames) - Universal detection: No need to call separate functions for different name types
- Performance: 10x faster with 6x less memory usage
- Confidence scoring: Enhanced algorithm with detailed breakdown
Data file not found: Make sure to run make export-data first to generate the protobuf files from the original Python dataset.
Memory issues: The full dataset requires ~500MB RAM (vs 3.2GB for Python). Consider using partial loading if needed.
Low confidence scores: The system is conservative by design. Names not in the 533M person dataset will score lower.
Spanish names with accents not detected:
- Two-word Spanish names (JosĂ© GarcĂa) typically score 60-70% with enhanced scoring
- Use threshold 0.6 instead of default 0.7 for Spanish/Latin datasets
- The algorithm prioritizes accuracy over false positives to avoid detecting business terms
- Example:
./bin/pii-check -threshold 0.6 "JosĂ© GarcĂa"â â 64.8% confidence
Business terms being detected as names:
- Enhanced scoring now heavily penalizes business terminology
- "Informe de Cliente" scores ~26% (down from 42% in older versions)
- Prepositions like "de", "del" receive automatic penalties
- If still getting false positives, try increasing threshold to 0.75-0.8
Mixed results with accented vs non-accented names:
- Library uses dual lookup: tries "José" first, then "Jose" as fallback
- Both versions exist in dataset with different popularity rankings
- Results may vary based on which version has better ranking in the database
Released: August 2025
What's New:
- â
Out-of-the-box functionality - No setup required, just
go getand use - â Embedded dataset - 54MB protobuf file with 727K first names and 983K surnames
- â
detector.NewDefault()- Instant initialization with embedded data - â Production ready - Stable API, comprehensive documentation
- â Performance optimized - 10x faster than Python, 6x less memory
- â Universal algorithm - Works with all cultural naming patterns
- â CLI tool included - Ready-to-use command line interface
Breaking Changes: None (initial release)
Migration: This is the first stable release. Previous development versions are not supported.
This project is developed and maintained by Montevive.AI, a company focused on building advanced AI solutions for data privacy and security.
- Privacy-focused AI tools: Specialized solutions for PII detection and data anonymization
- Security AI systems: Advanced threat detection and analysis platforms
- Data intelligence: Smart data processing and insight generation tools
- đ Website: montevive.ai
- đ§ Email: Contact us through our website
- đŒ GitHub: @montevive
- đ Issues: Report bugs or request features via GitHub Issues
Built with â€ïž by the Montevive.AI team