Malicious URL Detector built utilizing several data mining, machine learning and data science concepts, techniques and algorithms (PAs 1 and 2 from Applied Data Mining course - DCOMP - UFSJ).
All the project dependencies are listed is this section (languages, libraries, package managers, frameworks, ...), as well as the instructions to install each of of them.
./install_dependencies.sh
-
Python3 and pip package manager:
sudo apt install python3 python3-pip build-essential python3-dev
-
Node.JS package manager - npm (Optional):
sudo apt-get install npm
-
scikit-learn library:
pip install -U scikit-learn
-
xgboost library:
pip install xgboost
-
mlxtend library:
pip install mlxtend
-
imbalanced-learn library:
pip install imbalanced-learn
-
pandas library:
pip install pandas
-
joblib library:
pip install joblib
-
Matplotlib library:
pip install matplotlib
-
seaborn library:
pip install seaborn
-
numpy library:
pip install numpy
-
Beautiful Soup library:
pip install beautifulsoup4
-
mechanize library:
pip install mechanize
-
Random User Agents library:
pip install random_user_agent
-
PyCryptodome library:
pip install pycryptodomex
-
To install all GUI dependencies:
npm i
-
Vue.js framework:
npm install -g @vue/cli
-
Bootstrap framework:
npm install bootstrap@4.6.0 --save
-
axios library:
npm i axios
-
Font Awesome tool kit:
npm i --save @fortawesome/free-solid-svg-icons && npm i --save @fortawesome/vue-fontawesome@latest-2
All the instructions for exploring the project functionalities are listed in this section, as well as the commands to execute each application.
You can explore all functionalities (different models, datasets, ...) by just modifying (or uncommenting) few parts of the source code.
python3 main.py
python3 phishing_scraper.py
-
Inside src directory, execute the command using the following template:
python3 predict.py cli <url> <algorithm>
. -
Example with a phishing URL:
python3 predict.py cli https://bujhanginamfb.github.io/taelasos/update-recovry/ XGB
-
Open two terminal instances and execute the following commands in each one of them, respectively.
-
Terminal 1 - Back-end (inside src directory):
python3 predict.py server
-
Terminal 2 - Front-end (inside url-detector directory):
npm run serve
-
You should receive two URLs as outputs (
http://localhost:<port number>
). To visualize it, just open any of them in a browser of your choice. The front-end server (GUI) should be running at:http://localhost:8080
-
Finally, feel free to test the model with your own URLs! 🍾
Due to model training with the Kaggle dataset, the model reliability can suffer a lot depending on the user's inputted URL format. Most of the URLs present in the Kaggle dataset doesn't have its communication protocol specified (HTTP, HTTPS, ...), which could introduce large bias on the results and models trained, making the classifications quite unstable.