Skip to content

Classification with 9 different algorithms in Hotel Dataset

Notifications You must be signed in to change notification settings

HalukSumen/Classification_Alberghi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Alberghi Classification

Exploratory Data Analysis + Data Visualization + Modelling

1 - Abstract

In this project I made Exploratory Data Analysis, Data Visualisation and lastly Modelling which I did with 9 different models. In Exploratory Data Analysis I cleaned irrelevant data,NaN values and change data types for easiness. In second part, I would like to show data in plots.Such as, number of places according to their types and according to province. Later these processes I looked Pearson Correlation and Spearman Correlation which they gave very similar result as expected. Before modelling I prepared data training and testing. My testing size is 0.33. Then I applied for each model, the algorithms I used for these project are Logistic Regression, K Neighbors Classification, Decision Tree Classification, Random Forest Classification, AdaBoost Classification, Gradient Boosting Classification, XGB Classification, ExtraTrees Classification, Bagging Classification. Finally, Random Forest Classifier gives the best result but tuning with algorithms or cleaning the data more(I believe it will decrease the size of dataset alot) can be effective.

2 - Data

Dataset contains 6775 rows and 25 columns. Description and Type of Each Column;

  • ID int64 - id
  • PROVINCIA object - id of state
  • COMUNE object - name of city
  • LOCALITA object - name of town
  • CAMERE int64 - number or rooms
  • SUITE int64 - number of suites
  • LETTI int64 - number of beds
  • BAGNI int64 - number of bathrooms
  • PRIMA_COLAZIONE int64 - breakfast included or not
  • IN_ABITATO float64 - building or not
  • SUL_LAGO float64 - close to lake or not
  • VICINO_ELIPORTO float64 - close to heliport or not
  • VICINO_AEREOPORTO float64 - close to airport or not
  • ZONA_CENTRALE float64 - in the central or not
  • VICINO_IMP_RISALITA float64 -
  • ZONA_PERIFERICA float64 - suburb or not
  • ZONA_STAZIONE_FS float64 - close to station or not
  • ATTREZZATURE_VARIE object - equipment types( elevator, park, restaurants etc.)
  • CARTE_ACCETTATE object - accepted credit cards (visa,mastercard etc.)
  • LINGUE_PARLATE object - spoken languages by host or hotel
  • SPORT object - sport options (football, table tennis etc.)
  • CONGRESSI object - congress room(s)
  • LATITUDINE float64 - latitude
  • LONGITUDINE float64 - longitude
  • OUTPUT object - types of places

3 - Exploratory Data Analysis

Firslty, I checked data types and number of Nan in each columns. Later this process I decided which columns I will delete and which rows should I delete. So I deleted LOCALITA - SPORT - CONGRESSI - LATITUDINE - LONGITIDUNE columns and I deleted in NaN rows in IN_ABITATO -SUL_LAGO - VICINO_ELIPORTO - VICINO_AEREOPORTO - ZONA_CENTRALE - VICINO_IMP_RISALITA - ZONA_PERIFERICA - ZONA_STAZIONE_FS columns. But I keep 3 columns which contains very high number of NaN values because data they contains could be helpful for future works.

Pearson Correlation

Spearman Correlation

4 - Data Visualization

Number of Places According to Their Types

Number of Hotels According to Province

Number of Room Comparing to Bed

Importances of Columns

5 - Modelling

  • 5.1 - Logistic Regression

is used to predict the categorical dependent variable using a given set of independent variables.

Logistic Regression

  • 5.2 - K Neighbors Classification

non-parametric classification method.

K Neighbors Classification

  • 5.3 Decision Tree Classification

breaks the data smaller subsets in form of tree structure

Decision Tree Classification

  • 5.4 - Random Forest Classification

consist many decision tree but using bagging and randomness. then look at average/voting and gives the result.

Random Forest Classification

  • 5.5 - AdaBoost Classification

is an meta-estimator, that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset.

AdaBoost Classification

  • 5.6 - Gradient Boosting Classification

combine weak learning models to create strong model.

Gradient Boosting Classification

  • 5.7 - XGB Classification

implementation of gradient boosted decision trees but more effective in performance.

XGB Classification

  • 5.8 - ExtraTrees Classification

implements a meta-estimator which fits number of random decision trees on various subsets of dataset and it uses average/voting to improve prediction.

ExtraTrees Classification

  • 5.9 - Bagging Classification

an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.

Bagging Classification

6 - Result & Future Work

  • Logistic Regression Score: 0.6460699681962744
  • K Neighbors Classifier Score: 0.6338028169014085
  • DecisionTree Classifier Score: 0.8391640163562017
  • Random Forest Classifier Score: 0.8727850976828714
  • AdaBoost Classifier Score: 0.5483870967741935
  • Gradient Boosting Classifier Score: 0.8714220808723308
  • XGB Classifier Score: 0.7878237164925034
  • ExtraTree Classifier Score: 0.8632439800090868
  • Bagging Classifier Score: 0.8514311676510677

According the scores,Random Forest Classifier gives best result with 0.872785. Also Gradient Boosting is gives very close to Random Forest Classifier with 0.871422, and finally AdaBoost is give the worst performance with 0.548387. In the end Random Forest Classifier gives the best result but maybe tuning with XGB increase its score.