Phishing detection with machine learning

This project is using machine learning methods to test whether the website is phishing or legitimate.

About

Definition

A phishing website is a common social engineering method that mimics trustful uniform resource locators (URLs) and webpages.

Dataset

This project was trained on two datsets:

https://www.kaggle.com/datasets/shashwatwork/phishing-dataset-for-machine-learning/data Some features have been changed to allow real-time inference.
https://huggingface.co/datasets/ealvaradob/phishing-dataset Warning! Webpage code contains numerous viruses and trojans.

Approach

This project contains two methods:

Feature extraction for tabular data classification
Bag-of-words NLP classsification. Both are used and the final answer is combined based on accuracy.

Models

Multiple models have been tested, and eventually, for the sake of accuracy and speed of inference, two models have been choosen:

RandomForestClassifier (sklearn)
GradientBoostingClassifier (lightgbm) Models have been compressed to fit on github. To accelerate inference, either create PhishingDetector object to keep them loaded, or recompress them for yourself.

Metrics

Main metric measured is f1. As models have been trained on different datasets, so separately:

RandomForestClassifier f1 score = 0.974 (97%)
GradientBoostingClassifier f1 score = 0.959 (96%)

Usage

This project uses poetry with python 3.10. You should have python pre-installed. Alternatively, if you have Poetry installed on your main env, it will automatically create it's own venv, so skip steps 3 and 4.

Clone repository

git clone https://github.com/Golgovskiy/Phishing-Detection-ML.git <your folder name>

Enter root folder with shell or open shell in it

cd <your folder path>

Create new VENV

python -m venv <your_folder_path>\.venv

Install poetry

pip install poetry

Then install dependencies

poetry install

Finally, to run the script, run

poetry run python .\console_app.py <your_url>

To run the UI or API, run

poetry run python .\launch_api.py

References

The following sources have been reviewed for research:

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
check.ipynb		check.ipynb
console_app.py		console_app.py
launch_api.py		launch_api.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phishing detection with machine learning

This project is using machine learning methods to test whether the website is phishing or legitimate.

About

Definition

Dataset

Approach

Models

Metrics

Usage

References

About

Languages

License

Golgovskiy/Phishing-Detection-ML

Folders and files

Latest commit

History

Repository files navigation

Phishing detection with machine learning

This project is using machine learning methods to test whether the website is phishing or legitimate.

About

Definition

Dataset

Approach

Models

Metrics

Usage

References

About

Resources

License

Stars

Watchers

Forks

Languages