UBS Entity Resolution - Lauzhack 2024

Dignity Team

Daniel López Gala - daniel.lopezgala@epfl.ch
Carlos Hurtado - carloshurtadocomin@gmail.com
Mario Rico Ibáñez - mario.ricoibanez@epfl.ch
Tran Huong Lan - tranhuonglantk@gmail.com

Kaggle Team Name: Dignity

This project focuses on resolving financial entities by leveraging Local Sensitive Hashing (LSH) to efficiently identify and match similar non-UBS entities in transaction records. It is tailored for the financial industry, where precision and scalability are very important.

Key Features

Linear Models for Financial Entity Matching

The system uses LSH for blocking and similarity-based feature engineering, making it linear in terms of complexity:

Scalability: LSH reduces the number of pairwise comparisons, enabling linear performance relative to data size.
Robustness: Features like phonetic encoding (Soundex, Metaphone, NYSIIS), company detection, and custom similarity metrics make the model highly adaptable to diverse financial datasets.

These methods ensure quick and accurate matching, making the approach ideal for financial transactions where entity resolution needs to be precise, scalable, and interpretable.

Workflow

Preprocessing
- Standardizes input datasets by normalizing names, addresses, and phone numbers.
- Maps categorical variables into numerical formats.
Feature Engineering
- Extracts and encodes phonetic features for entity matching.
- Identifies companies using keyword-based detection.
- Splits names into "given_name" and "surname," enabling nuanced comparisons.
LSH for Blocking
- Employs LSH to group similar entities into buckets, reducing the number of comparisons needed.
Similarity Scoring
- Combines phonetic encoding, string similarity (Jaro-Winkler), and attribute weighting for fine-grained matching.
Evaluation
- Measures performance using precision, recall, and F1-score.

Models Used

Local Sensitive Hashing (LSH)

Blocking: Groups entities based on attributes like names, addresses, and phone numbers using n-grams and MinHash.
Efficiency: Significantly reduces computational costs by limiting comparisons to similar buckets.

Phonetic Encodings

Soundex, Metaphone, NYSIIS: Improve matching accuracy for noisy and diverse name datasets.
Adaptability: Handles variations in spelling and cultural differences in entity names.

String Similarity

Jaro-Winkler Distance: Computes robust similarity scores for attributes like names and phone numbers.
Attribute Weighting: Prioritizes key attributes (e.g., IBAN, surname) to align with financial data resolution requirements.

Preprocessing Steps

Lowercase normalization and removal of special characters.
Title stripping for names (e.g., "Dr.", "Mr.").
Standardized phone numbers.

How to Run

Place input files in the data/ directory:
- account_booking_train.csv
- external_parties_train.csv
- account_booking_test.csv
- external_parties_test.csv
Run the main script:
```
python main.py
```
Processed files will be saved in the data/processed/ directory, including:
- external_parties_train.csv
- submission.csv

Evaluation Metrics

Precision: Proportion of correctly identified pairs.
Recall: Coverage of true matches.
F1-Score: Harmonizes precision and recall.

Why This Approach Works for Financial Entity Matching

Linear Scalability: Handles large datasets efficiently, crucial for financial systems with millions of transactions.
Domain-Specific Features: Incorporates financial-specific attributes (e.g., IBAN, company detection).
Accuracy: Combines blocking, similarity, and phonetic features to minimize false positives and negatives.

Created by the Dignity team for the Lauzhack 2024!

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
assets		assets
blocking_utils		blocking_utils
preprocessing_utils		preprocessing_utils
.gitignore		.gitignore
README.md		README.md
demo.ipynb		demo.ipynb
inference.py		inference.py
main.py		main.py
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UBS Entity Resolution - Lauzhack 2024

Dignity Team

Key Features

Linear Models for Financial Entity Matching

Workflow

Models Used

Local Sensitive Hashing (LSH)

Phonetic Encodings

String Similarity

Preprocessing Steps

How to Run

Evaluation Metrics

Why This Approach Works for Financial Entity Matching

About

Releases

Packages

Contributors 4

Languages

Bimo99B9/UBS-Lauzhack-Entity-Resolution

Folders and files

Latest commit

History

Repository files navigation

UBS Entity Resolution - Lauzhack 2024

Dignity Team

Key Features

Linear Models for Financial Entity Matching

Workflow

Models Used

Local Sensitive Hashing (LSH)

Phonetic Encodings

String Similarity

Preprocessing Steps

How to Run

Evaluation Metrics

Why This Approach Works for Financial Entity Matching

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages