This project revolves around predictive modeling for Seoul's bike rental demand, leveraging a dataset comprising 8,760 rows and 14 columns. The workflow adheres to a systematic and formal structure, commencing with data collection and preliminary analysis to ascertain the dataset's fundamental characteristics, including its dimensions and data types. Subsequently, data filtering and cleaning are executed to enhance data quality by eliminating superfluous columns and addressing missing values.
The project progresses to Exploratory Data Analysis (EDA), where insightful visualizations are generated to illuminate relationships between dependent and independent variables. This phase also encompasses an analysis of mean distributions and correlations between columns. With a well-informed understanding of the data, attention shifts towards data preparation, encompassing feature engineering, encoding, and the division of data into training and testing sets.
Data scaling ensures optimal model performance. Model selection is a deliberate process aimed at choosing the most suitable algorithm. Model evaluation employs various metrics to gauge model performance, with hyperparameter tuning employed to enhance accuracy and mitigate overfitting. Ultimately, a comprehensive comparison between test and train data illuminates the model's performance and errors, ensuring a robust predictive model for Seoul's bike rental demand.
The Seoul bike sharing demand data set is hosted in the UCI Machine Learning Repository. The data set contains the count of the number of bikes rented at each hour in the Seoul bike-sharing system and information regarding weather conditions.
The final product will consist of a model that predicts the number of bicycles rented in any given day based on the hour and other weather-related variables such as rainfall and humidity. The system’s predictions are used to guarantee that available bikes will meet the demand for the service.
The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.
Date : year-month-day
Rented Bike count - Count of bikes rented at each hour
Hour - Hour of he day
Temperature-Temperature in Celsius
Humidity - %
Windspeed - m/s
Visibility - 10m
Dew point temperature - Celsius
Solar radiation - MJ/m2
Rainfall - mm
Snowfall - cm
Seasons - Winter, Spring, Summer, Autumn
Holiday - Holiday/No holiday
Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)
Date : Date feature which is str type is needed to convert it into Datetime format DD/MM/YYYY.The new feature extracted from Date are Day, Month and year
Rented Bike Count : Number of bike rented which is our Dependent variable according to our problem statement which is int type.
Hour: Hour feature which is in 24 hour format which tells us number bike rented per hour is int type.
Temperature: Temperature feature which is in celsius scale(°C) is Float type.
Humidity(%): Feature humidity in air (%) which is int type.
Wind speed (m/s) : Wind Speed feature which is in (m/s) is float type.
Visibility (10m): Visibility feature which is in 10m, is int type.
Dew point temperature(°C): Dew point Temperature in (°C) which tells us temperature at the start of the day is Float type.
Solar Radiation (MJ/m2): Solar radiation or UV radiation is Float type.
Rainfall(mm): Rainfall feature in mm which indicates 1 mm of rainfall which is equal to 1 litre of water per metre square is Float type.
Snowfall (cm): Snowfall in cm is Float type. Seasons: Season, in this feature four seasons are present in data is str type.
Holiday: whether no holiday or holiday can be retrieved from this feature is str type.
Functioning Day: Whether the day is Functioning Day or not can be retrieved from this feature is str type.
Weekend : Weekend extracted from Day 1 when the day is Saturday or Sunday while 0 when weekdays
- Data Collecting
- Data Filtering
- EDA
- Data Preparation
- Data Modeling
- In summer season highest number of bike was rented as compared to other seasons with count touching at 3500 while in winter season lowest number of bike was rented touching the count of close to just 1000. From this we can assume that people tends to rent more bikes in summer as compare to other seasons also people tends to rent less bike in winter season.
- During working day people tend to rent more bikes as around 3500 from this we can assume that on holidays people tends to rent less bike. Also we can see people tends to rent less or no bike during no functioning day.
- In weekend vs Rented Bike count we can see that people tends to rent more bike during weekdays as compared to weekends.
- After applying linear regression model, we got R2 score of 0.755 for training data and R2 score of 0.764 for test data, which signifies that model is optimally fit on both training and test data i.e. no overfitting is seen.
- We also tried Tree based classifiers for our data, we applied Decision Tree Regressor, with that we we got R2 score of 0.906 for training data and 0.849 for test data.
- To get better accuracy on tree based model, we applied Random forest, with that we got R2 score of 0.970 for training data and 0.919 for test data.
- Finally, we applied Gradient boost with parameters selected after grid search which resulted in highest R2 score of 1.000 for training data and 0.924 for test data.