- Daniel López Gala - daniel.lopezgala@epfl.ch
- Carlos Hurtado - carloshurtadocomin@gmail.com
- Mario Rico Ibáñez - mario.ricoibanez@epfl.ch
- Tran Huong Lan - tranhuonglantk@gmail.com
Kaggle Team Name: Dignity
This project focuses on resolving financial entities by leveraging Local Sensitive Hashing (LSH) to efficiently identify and match similar non-UBS entities in transaction records. It is tailored for the financial industry, where precision and scalability are very important.
The system uses LSH for blocking and similarity-based feature engineering, making it linear in terms of complexity:
- Scalability: LSH reduces the number of pairwise comparisons, enabling linear performance relative to data size.
- Robustness: Features like phonetic encoding (Soundex, Metaphone, NYSIIS), company detection, and custom similarity metrics make the model highly adaptable to diverse financial datasets.
These methods ensure quick and accurate matching, making the approach ideal for financial transactions where entity resolution needs to be precise, scalable, and interpretable.
-
Preprocessing
- Standardizes input datasets by normalizing names, addresses, and phone numbers.
- Maps categorical variables into numerical formats.
-
Feature Engineering
- Extracts and encodes phonetic features for entity matching.
- Identifies companies using keyword-based detection.
- Splits names into "given_name" and "surname," enabling nuanced comparisons.
-
LSH for Blocking
- Employs LSH to group similar entities into buckets, reducing the number of comparisons needed.
-
Similarity Scoring
- Combines phonetic encoding, string similarity (Jaro-Winkler), and attribute weighting for fine-grained matching.
-
Evaluation
- Measures performance using precision, recall, and F1-score.
- Blocking: Groups entities based on attributes like names, addresses, and phone numbers using n-grams and MinHash.
- Efficiency: Significantly reduces computational costs by limiting comparisons to similar buckets.
- Soundex, Metaphone, NYSIIS: Improve matching accuracy for noisy and diverse name datasets.
- Adaptability: Handles variations in spelling and cultural differences in entity names.
- Jaro-Winkler Distance: Computes robust similarity scores for attributes like names and phone numbers.
- Attribute Weighting: Prioritizes key attributes (e.g., IBAN, surname) to align with financial data resolution requirements.
- Lowercase normalization and removal of special characters.
- Title stripping for names (e.g., "Dr.", "Mr.").
- Standardized phone numbers.
-
Place input files in the
data/
directory:account_booking_train.csv
external_parties_train.csv
account_booking_test.csv
external_parties_test.csv
-
Run the main script:
python main.py
-
Processed files will be saved in the
data/processed/
directory, including:external_parties_train.csv
submission.csv
- Precision: Proportion of correctly identified pairs.
- Recall: Coverage of true matches.
- F1-Score: Harmonizes precision and recall.
- Linear Scalability: Handles large datasets efficiently, crucial for financial systems with millions of transactions.
- Domain-Specific Features: Incorporates financial-specific attributes (e.g., IBAN, company detection).
- Accuracy: Combines blocking, similarity, and phonetic features to minimize false positives and negatives.
Created by the Dignity team for the Lauzhack 2024!