In this project I made Exploratory Data Analysis, Data Visualisation and lastly Modelling which I did with 9 different models. In Exploratory Data Analysis I cleaned irrelevant data,NaN values and change data types for easiness. In second part, I would like to show data in plots.Such as, number of places according to their types and according to province. Later these processes I looked Pearson Correlation and Spearman Correlation which they gave very similar result as expected. Before modelling I prepared data training and testing. My testing size is 0.33. Then I applied for each model, the algorithms I used for these project are Logistic Regression, K Neighbors Classification, Decision Tree Classification, Random Forest Classification, AdaBoost Classification, Gradient Boosting Classification, XGB Classification, ExtraTrees Classification, Bagging Classification. Finally, Random Forest Classifier gives the best result but tuning with algorithms or cleaning the data more(I believe it will decrease the size of dataset alot) can be effective.
Dataset contains 6775 rows and 25 columns. Description and Type of Each Column;
- ID int64 - id
- PROVINCIA object - id of state
- COMUNE object - name of city
- LOCALITA object - name of town
- CAMERE int64 - number or rooms
- SUITE int64 - number of suites
- LETTI int64 - number of beds
- BAGNI int64 - number of bathrooms
- PRIMA_COLAZIONE int64 - breakfast included or not
- IN_ABITATO float64 - building or not
- SUL_LAGO float64 - close to lake or not
- VICINO_ELIPORTO float64 - close to heliport or not
- VICINO_AEREOPORTO float64 - close to airport or not
- ZONA_CENTRALE float64 - in the central or not
- VICINO_IMP_RISALITA float64 -
- ZONA_PERIFERICA float64 - suburb or not
- ZONA_STAZIONE_FS float64 - close to station or not
- ATTREZZATURE_VARIE object - equipment types( elevator, park, restaurants etc.)
- CARTE_ACCETTATE object - accepted credit cards (visa,mastercard etc.)
- LINGUE_PARLATE object - spoken languages by host or hotel
- SPORT object - sport options (football, table tennis etc.)
- CONGRESSI object - congress room(s)
- LATITUDINE float64 - latitude
- LONGITUDINE float64 - longitude
- OUTPUT object - types of places
Firslty, I checked data types and number of Nan in each columns. Later this process I decided which columns I will delete and which rows should I delete. So I deleted LOCALITA - SPORT - CONGRESSI - LATITUDINE - LONGITIDUNE columns and I deleted in NaN rows in IN_ABITATO -SUL_LAGO - VICINO_ELIPORTO - VICINO_AEREOPORTO - ZONA_CENTRALE - VICINO_IMP_RISALITA - ZONA_PERIFERICA - ZONA_STAZIONE_FS columns. But I keep 3 columns which contains very high number of NaN values because data they contains could be helpful for future works.
Pearson Correlation
Spearman Correlation
Number of Places According to Their Types
Number of Hotels According to Province
Number of Room Comparing to Bed
Importances of Columns
- 5.1 - Logistic Regression
is used to predict the categorical dependent variable using a given set of independent variables.
Logistic Regression
- 5.2 - K Neighbors Classification
non-parametric classification method.
K Neighbors Classification
- 5.3 Decision Tree Classification
breaks the data smaller subsets in form of tree structure
Decision Tree Classification
- 5.4 - Random Forest Classification
consist many decision tree but using bagging and randomness. then look at average/voting and gives the result.
Random Forest Classification
- 5.5 - AdaBoost Classification
is an meta-estimator, that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset.
AdaBoost Classification
- 5.6 - Gradient Boosting Classification
combine weak learning models to create strong model.
Gradient Boosting Classification
- 5.7 - XGB Classification
implementation of gradient boosted decision trees but more effective in performance.
XGB Classification
- 5.8 - ExtraTrees Classification
implements a meta-estimator which fits number of random decision trees on various subsets of dataset and it uses average/voting to improve prediction.
ExtraTrees Classification
- 5.9 - Bagging Classification
an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.
Bagging Classification
- Logistic Regression Score: 0.6460699681962744
- K Neighbors Classifier Score: 0.6338028169014085
- DecisionTree Classifier Score: 0.8391640163562017
- Random Forest Classifier Score: 0.8727850976828714
- AdaBoost Classifier Score: 0.5483870967741935
- Gradient Boosting Classifier Score: 0.8714220808723308
- XGB Classifier Score: 0.7878237164925034
- ExtraTree Classifier Score: 0.8632439800090868
- Bagging Classifier Score: 0.8514311676510677
According the scores,Random Forest Classifier gives best result with 0.872785. Also Gradient Boosting is gives very close to Random Forest Classifier with 0.871422, and finally AdaBoost is give the worst performance with 0.548387. In the end Random Forest Classifier gives the best result but maybe tuning with XGB increase its score.