An end-to-end machine learning pipeline to detect "Address Poisoning" attacks on the Ethereum blockchain.
Address poisoning is a deceptive tactic where attackers send small or zero-value transactions from addresses that mimic a user's recent counterparties (often by matching the first and last few characters). The goal is to "poison" the user's transaction history so that they might accidentally copy the attacker's address for future transfers.
This project provides tools to:
- Extract Data: Query a MySQL Ethereum database to collect transaction metadata.
- Engineer Features: Calculate metrics like counterparty frequency and transaction bursts.
- Train & Detect: Utilize a Support Vector Machine (SVM) model to classify addresses as malicious or benign.
- Language: Python 3.7
- Database: MySQL (Ethereum blockchain data)
- Libraries: Pandas, Scikit-learn, Matplotlib, Seaborn
- Environment: Pipenv, Jupyter Notebooks
scripts/: Python and Bash scripts for data collection.gather_addresses_metadata.py: The primary data extraction engine.start_dataset_generation.sh: Wrapper for the extraction process.
address_poisoning_dataset.ipynb: Notebook for data exploration and preprocessing.address_poisining_model.ipynb: Notebook for model training (SVC) and evaluation.docs/: Visual documentation and diagrams.dataset/: (Required) Folder for input/output CSV data.
- Python 3.7 and Pipenv.
- Access to an Ethereum MySQL database.
- Create a
dataset/directory in the root. - Environment Variables: Copy
.env.exampleto.envand update with your database credentials.cp .env.example .env
# Install dependencies
pipenv install
# Enter the virtual environment
pipenv shellUpdate dataset/address_poisoning_addresses_list.csv with the target phishing addresses, then run:
bash scripts/start_dataset_generation.shThis will generate address_poisoning_transactions.csv and use address_poisoning_transactions_checkpoint.txt to track progress.
Launch Jupyter and open the notebooks:
jupyter notebook- Run
address_poisoning_dataset.ipynbto analyze the raw transaction data. - Run
address_poisining_model.ipynbto train the classifier and visualize detection performance.
The classifier relies on several engineered features:
is_repeat_counterparty: Identifies if a transaction pair has been seen before.counterparty_tx_count: The total number of interactions between two addresses.burst_flag: Detects rapid-fire transactions within a short time threshold (5 minutes).
Database credentials are managed via environment variables using python-dotenv. A template is provided in .env.example. Never commit your .env file to version control.