Skip to content

The code snippet cleans and analyzes a hotel bookings dataset, handling missing values, dropping unnecessary columns, and creating new features. It visualizes the data using various plots and performs feature encoding and selection. It then trains machine learning models to predict hotel booking cancellations.

Notifications You must be signed in to change notification settings

Jishnuadhikary10/ml_hotel_booking_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

ml_hotel_booking_prediction

ML_Hotel_Booking_Prediction

This project appears to analyze hotel booking data and perform various data cleaning and exploratory data analysis tasks. Here's a brief description of the steps performed:

Importing necessary libraries: Pandas, NumPy, Seaborn, and Matplotlib. Reading the hotel booking data from a CSV file into a Pandas DataFrame. Performing data cleaning tasks such as handling missing values, dropping unnecessary columns, and filling missing values with appropriate values. Analyzing the country of origin for the guests and visualizing it on a choropleth map using Plotly. Analyzing the prices of hotel rooms per night using box plots. Identifying the busiest months for resort and city hotels based on the number of guests. Visualizing the monthly trends using line plots. Analyzing the average daily rate (ADR) of hotels across different months using bar plots and box plots. Determining whether bookings were made for weekdays, weekends, or both, and visualizing the data using stacked bar plots. Creating additional features to improve the machine learning model's performance, such as identifying family bookings, calculating the total number of customers and nights, and encoding categorical features. Handling outliers in numerical features using log transformation. Assessing feature importance using Lasso regression and selecting the relevant features. Splitting the data into training and testing sets and applying logistic regression as a machine learning model. Evaluating the model's performance using confusion matrix, accuracy score, and cross-validation. Applying various other classification models, including Naive Bayes, Random Forest, Decision Tree, and K-Nearest Neighbors, and evaluating their performance. Overall, the project aims to analyze and understand hotel booking data, clean and preprocess the data, and build a machine learning model to predict hotel booking cancellations based on the available features.

Documentation

The code written in Python script for data cleaning and analysis of a hotel bookings dataset. It performs various tasks such as data cleaning, missing value imputation, visualization, feature engineering, feature encoding, feature selection, and model training.

Here is a breakdown of the code:

Importing necessary libraries:

pandas: for data manipulation and analysis

numpy: for numerical operations seaborn: for data visualization

matplotlib.pyplot: for creating plots plotly: for interactive visualizations

chart_studio.plotly: for accessing Plotly charts

sort_dataframeby_monthorweek: for sorting data by month or week Reading the dataset:

The code reads the hotel bookings dataset from a CSV file using the pandas read_csv function and assigns it to a DataFrame named df.

Data cleaning:

The code performs several data cleaning steps, including handling missing values and removing irrelevant columns. It drops the 'agent' and 'company' columns using the drop function. It fills the missing values in the 'country' column with the mode value using the fillna function. It fills all other missing values with 0 using the fillna function.

Data analysis and visualization:

The code includes several sections for analyzing and visualizing different aspects of the dataset. It explores the home country of guests by analyzing the 'is_canceled' column and creates a choropleth map using the Plotly library. It analyzes the prices of hotel rooms per night using box plots. It identifies the busiest months for resort and city hotels by analyzing the 'arrival_date_month' column and creates line charts using Plotly. It analyzes the average daily rates (ADR) in hotels by month using bar plots and box plots. It explores whether bookings were made for weekdays, weekends, or both, and creates stacked bar charts to visualize the data.

Feature engineering:

The code creates additional features to improve the machine learning model's performance. It creates a binary feature 'is_family' based on the number of adults, children, and babies. It calculates the total number of customers and total number of nights using the 'adults', 'children', 'babies', 'stays_in_week_nights', and 'stays_in_weekend_nights' columns. It maps the 'deposit_type' column to numerical values using a dictionary. It drops unnecessary columns and combines the categorical and numerical features into a new DataFrame named 'dataframe'.

Feature encoding:

The code performs feature encoding to convert categorical features into numerical format. It calculates the cancellation rate for each category and maps it to the corresponding categories in the DataFrame.

Handling outliers:

The code handles outliers by applying a logarithmic transformation to the 'lead_time' and 'adr' columns.

Feature selection:

The code uses the Lasso method to select important features for the machine learning model. It selects features with non-zero coefficients from the Lasso model and updates the 'x' DataFrame with the selected features.

Model training and evaluation:

The code splits the dataset into training and test sets using the 'train_test_split' function. It trains a logistic regression model on the training data and makes predictions on the test data. It evaluates the model's performance using a confusion matrix and accuracy score. It also performs cross-validation to assess the model's performance using the 'cross_val_score' function. The code then trains and evaluates several other models, including Naive Bayes, Random Forest,

Features

This code snippet performs various data cleaning, analysis, visualization, feature engineering, feature encoding, feature selection, and machine learning techniques on a hotel booking dataset.

The technologies and algorithms used in the code include:

Libraries:

Pandas, NumPy, Seaborn, Matplotlib, Plotly, Chart Studio Data cleaning: Handling missing values, dropping columns Data analysis: Exploring data shape, checking for missing values, filtering data based on conditions, computing value counts, creating new features, performing statistical analysis Data visualization: Choropleth maps, boxplots, bar charts, line charts Feature engineering: Creating new features based on existing data Feature encoding: Converting categorical data into numerical data using mean encoding Feature selection: Selecting relevant features using Lasso regularization Machine learning models: Logistic Regression, Naive Bayes, Random Forest, Decision Tree, K-Nearest Neighbors Model evaluation: Confusion matrix, accuracy score, cross-validation Overall, the code aims to analyze the hotel bookings dataset, gain insights from the data, create useful features, encode categorical features, select relevant features, and train various machine learning models to predict hotel booking cancellations.

General overview:

The code is written in Python and involves data cleaning, analysis, visualization, feature engineering, and machine learning. Here is a breakdown of the code:

  1. Importing necessary libraries: Pandas, NumPy, Seaborn, and Matplotlib.
  2. Reading a CSV file into a DataFrame using Pandas.
  3. Performing data cleaning tasks such as handling missing values, dropping irrelevant columns, and filling missing values with mode or 0.
  4. Analyzing the country of origin for guests and creating a choropleth map using Plotly.
  5. Analyzing the prices of hotels per night using boxplots and grouping the data by different room types and hotels.
  6. Identifying the busiest month based on hotel bookings and creating a line plot using Plotly.
  7. Analyzing the average daily rate (ADR) in hotels by month using bar plots and boxplots.
  8. Determining if bookings were made for weekdays, weekends, or both and creating a bar plot.
  9. Creating additional features for machine learning models, such as the "is_family" feature and calculating total customers and total nights.
  10. Applying feature encoding to categorical data using mean encoding.
  11. Handling outliers by applying log transformation.
  12. Dropping unnecessary features and handling missing values.
  13. Splitting the data into training and testing sets.
  14. Building a logistic regression model, making predictions, and evaluating performance using confusion matrix, accuracy score, and cross-validation.
  15. Testing other classification models such as Naive Bayes, Random Forest, Decision Tree, and K-Nearest Neighbors.

Overall, this code is a data analysis and machine learning pipeline for a hotel booking dataset, including data cleaning, visualization, feature engineering, and model building.

About

The code snippet cleans and analyzes a hotel bookings dataset, handling missing values, dropping unnecessary columns, and creating new features. It visualizes the data using various plots and performs feature encoding and selection. It then trains machine learning models to predict hotel booking cancellations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published