This repository contains projects of Data Mining Course. There are four projects.
- first project, is about
preprocessing
that we usepandas
andScikit-learn
for this purpose. The dataset we use isiris-dataset
.- second project, is about creating a
Neural Network
and we train the model for two datasets:make_circles
andfashion_mnist
.- third project, is about
Association Rules
and alsoClustering
, There are two different projects.- and final project, we train a model with a dataset which has more than 70k records and we should decide whether a person has a special disease or not, for this project we use
XGBoost
that is adecision tree
.
This project is about preprocessing
that we use pandas
and Scikit-learn
for this purpose. The dataset we use is iris-dataset
which can be downloaded by this link. i do the following steps for preprocessing iris-dataset
.
- handle
missing values
and find NaN values and fill them with proper values or remove them.- convert categorical features to numerical features by
Label Encoding
andOne Hot Encoding
.- nomalize data frame by the help of
Standard Scalar
.- dimension reduction with
PCA
.visualization
.
The visualization of the final result is:
you can access to project and code by this link.
This project is about creating a Neural Network
and we train the model for two datasets: make_circles
and fashion_mnist
.
for first dataset (make_circles
) i follow these steps:
- make 1000 circles.
- split train and test dataset.
- create a
Neural Network
with two hidden layers.- train model.
- plot loss and accuracy.
for acctivation functions i used relu
for hidden layers and sigmoid
for the output layer and binary_crossentropy
for loss function.
You can access to code of this section by this link.
for second dataset (fashion_mnist
) i follow these steps:
- load dataset.
- split train and test dataset.
- create a
Convolutional Neural Network
with two hidden layers.- train model.
- plot loss and accuracy.
print confusion_matrix
andclassification_report
.
for acctivation functions i used relu
for hidden layers and softmax
for the output layer and categorical_crossentropy
for loss function and adam
for optimizer.
You can access to code of this section by this link.
This project is about Association Rules
and also Clustering
, There are two different projects.
for clustring
project i did these tasks:
- working with
KMeans
library from sklearn.cluster and plotting result.- determining efficient number of clusters with two methods:
elbow
andPCA
.- working with complex datasets and clustering them.
- working with
load_digits
dataset and cluster this dataset.- dimenshion reduction of a picture.
- do
DBSCAN
algorithm for two datasets.- determining efficient value for
MinPts
andepsilon
.- plotting results and comparison results.
You can access to code of this section by this link.
for Association Rules
project i did these tasks:
- working with
Apriori
algorithm.- load this dataset and preprocess it and create dataframe.
- find frequent_itemsets and print them.
- extract association_rules.
You can access to code of this section by this link.
And finally in this project, I train a model with a dataset which has more than 70k records that you can download it by this link and we should decide whether a person has diabete disease or not, for this project we use XGBoost
that is a decision tree
.
Each record has 21 features and with these 21 features we should decide whether a person has diabete or not.
for doing this i did these tasks respectively:
preprocessing data
(load dataset, rename column names, fill Null values with mode, normalizing, convert categorical features to numerical features with OneHotEncoding and Min-Max, split label column for our dataset).build model
(split train and test data, createXGBClassifier
, train model, print accuracy, plot confusion_matrix, plot tree, print precision and recall).parameter tuning
with the help ofGridSearchCV
and determine best parameters.plot metric changes
.
You can access to code of this section by this link.
Project is created with:
- Python version: 3.8