Real-Time Data Extraction and Machine Learning for Optimized Uber Ride Booking with Continuous Learning
This project is aimed at optimizing taxi operations through a dynamic pricing system that utilizes real-time data analysis and machine learning techniques, incorporating a continuous learning framework. By leveraging data extraction methods and predictive modeling, the project seeks to enhance taxi company revenue and operational efficiency by adjusting fares in real-time and providing personalized ride booking recommendations to users. The project explores the significance, methodology, key features, technological components, and future enhancements of this system, with a focus on driving business growth and improving customer satisfaction.
The rise of ride-sharing services like Uber has transformed urban transportation, presenting both opportunities and challenges for taxi companies. This project addresses these challenges by creating a dynamic pricing system that uses real-time data and machine learning to adjust taxi fares in real-time, responding to shifting market conditions.
As urban populations continue to grow, the demand for efficient transportation solutions increases. By optimizing ride pricing and booking times, this project can lead to:
Increased Revenue: Dynamic pricing can help taxi companies maximize revenue by adjusting fares in response to changing demand.
Improved Operational Efficiency: Real-time data and machine learning can help taxi companies optimize their fleet utilization, reducing costs and improving overall efficiency.
Enhanced Customer Experience: By providing users with reliable and personalized predictions, taxi companies can improve customer satisfaction, leading to increased loyalty and retention.
Continuous Real-time Data Collection: Data is scraped from Uber for seven locations, capturing all possible routes at one-hour intervals from 7 AM to 11 PM.
Database Management: A MySQL database is employed for data storage, facilitating continuous collection through job scheduling.
Geolocation API Integration: Latitude and longitude are calculated using the Nominatim API, and distances are computed via the Open Route Service API to improve model scope outside the scraped locations data.
Machine Learning Models: Various ML models, including Random Forest and XGBoost, are trained, tested and tuned on historical data to predict optimal booking times.
Interactive Web Interface: A Streamlit application allows users to select locations and view predictions for future booking times while visualizing maps using Pydeck
Continuous Learning Framework: The system is designed to continuously learn from new data, automatically retraining models to adapt to changing patterns in ride demand and pricing using MlFlow.
Programming Language: Python
IDE: PyCharm and Jupyter notebook
Web Scraping: Selenium
Mapping: Pydeck, Geopy
APIs: Nominatim, Open Route Service
Database: MySQL
Machine Learning Frameworks: Scikit-learn
Version Control: Git, GitHub
Model Logging and Continuous Learning: MlFlow
Data Scraping:
Data was scrapped from the Uber website using Selenium and is continuously scrapped at 1 hour intervals between 7 AM and 11 PM IST for 7 locations in the city of Chennai, Tamil Nadu and all it’s possible routes among them.
Types of Data:
Data collected included
- Ride type (Uber Go, Uber Sedan, Uber XL, Uber Auto, Uber Moto, Uber Premier),
- Maximum ride persons (1,2,3,4,6)
- Route location from
- Route location to
- Ride request date
- Ride request time
- Waiting time (minutes)
- Reaching_time (minutes)
- Ride time (minutes)
- Ride price in Rupees
The following images illustrate a row, total rows x columns and date period of the collected data as of September 19, 2024:
Data Formatting:
Since data was got at hourly intervals, not all data are got at exactly at the hour as selenium takes a while to scrape all the data from these 42 routes (For example starts scrapping at 9:00 AM but may finish at 9:05 AM or 9:10 AM). Thus we round it off to the hour and if there are multiple occurrences of the same data (I.e. same ride type, locations) then the data is averaged among them to give unique values of data for that time slot. Date formatting and time formatting was done as well
Feature Engineering:
New features, including day of the week, and hour of the day, were added to enhance model predictions while removing unwanted columns.
A depiction of the that is shown below.
Exploratory Data Analysis (EDA):
EDA was performed to analyze key metrics such as the distribution of ride times, prices, and waiting times by day of week and hour of day. They are represented below.
Distribution of ride price, ride waiting time, ride time also show us that they are not skewed much and can thus be utilized by the machine learning models. They are shown below.
Correlation analysis among the numerical are shown as below. Although hours, day of week likely affect the price, they are not totally co-related as they are fluctuations among them thus we don't need to drop any columns but proceed with next steps
Model Selection:
For our project we went ahead and chose Linear Regression as our base model and ensemble learning techniques like Random Forest Regressor and XGBoost Regressor as they are able to deal with both normal distribution data and even skewed data and since our model is slightly skewed we went ahead and chose these models. Overall, these ensemble technique work the best for linear regression problems.
Model Evaluation:
Model was then trained after conducting a train-test split with encoding the categorical features and scaling numerical features. Model metrics of Mean Absolute Error (MAE), Mean Squared Error(MSE), and R2 Score were utilized as metics The results obtained were as follows:
From the observations, we can clearly see that Random Forest performs the best, while Linear Regression performs the worst due to it’s nature of not being able to capture non-linearity in data. Model was hyper-tuned using RandomizedSearchCV for different values of no. of estimators, max depth of the model, min samples split, and min samples leaf. Results indicated that the default model performed the best. Thus we can utilize that.
Why?
So why do we need further improvement? The Current model does give us good performance metrics but it is limited to only the 7 locations and the 42 unique routes among them. In order to expand the scope of our model, we need to go beyond just the 42 routes. In order for our model in real world scenario where we can use these data for multiple locations in the city we can consider utilizing geo-locational features
Geo-locational features using APIs:
Thus latitudes and longitudes of these locations and the distances between the 41 routes were calculated using APIs. Nominatim API was used to obtain latitudes, longitudes while Open Route Service API was used to get the distances among these routes. Results are as follows:
Different Train-Test Split:
Train-Test split was not done randomly but by completely hiding two locations and all it’s occurrence among the two and also any occurrence between it and other locations in the proposed train set, thereby these locations and it’s routes become completely new, unseen data to the model.
Training, Testing, and Evaluation: Models trained, tested and evaluated with results at this step as follows:
Random Forest performs the best and although the performance is slightly lower than our model using location names instead of geo-locational features, the model still performs very well and thus we can utilize this model as it increases the scope of our project beyond the limited number of locations to choose from.
Web App:
In order for users to be able to utilize what we have created we leveraged Streamlit for them to be able to use the application in real-time.
Location Restriction:
Since the data collected was only from 7 locations in the city of Chennai, the best method was to restrict the selection of locations for from and to to be within 20 km of the mean of the other 7 locations. That way we can get accurate predictions of our values.
Dynamic geo-locational features:
Since we are using new locations, we need to fetch their corresponding latitudes, longitudes, and distances as well. This was dynamically done using the Nominatim API and Open Route Service API.
Routes Mapping:
Further, routes of selected locations was also mapped to give a visual appearance to the users of the road route using Open Route Service API.
Error Handling:
Errors were also handled if location exceeds the 20km radius and if there are invalid addresses
Prediction of Value:
Upon selection of features, the app generates ride price, ride waiting time, and ride time for the selected date and hour. It also provides the values for the next three hours with percentage change and colour coding to help users with selecting the best ride enabling cost savings, convenience, and satisfaction.
New data = New Model:
To ensure that the model remains effective over time, a continuous learning framework is implemented. This involves:
Scheduled Retraining:
Models are automatically retrained on new data that is collected daily at 9 AM, allowing them to adapt to the latest trends and patterns.
Model Performance Tracking and Best Model Selection:
Using MLFlow, the performance of various models is tracked, while automatically selection the best-performing model.
The project presents multiple avenues for future expansion:
Additional Routes:
Extend the model's functionality to other cities and locations.
Real-Time API Integration:
Establish direct connections with the Uber API for enhanced data accuracy and responsiveness.
Enhanced Continuous Learning:
Incorporate user feedback and behavioural data to refine prediction algorithms further.
This project successfully integrates real-time data extraction and machine learning techniques to optimize Uber ride bookings. The inclusion of a continuous learning framework ensures that the system adapts to evolving patterns in ride demand and pricing. By providing users with predictive insights into ride prices and wait times, it enhances the overall transportation experience. The project highlights the potential of data-driven approaches in urban mobility, paving the way for future innovations.
- https://openrouteservice.org/
- https://nominatim.org/
- https://www.uber.com/in/en/
- https://scikit-learn.org/stable/
- https://mlflow.org/docs/latest/index.html
- https://www.selenium.dev/documentation/
- https://dev.mysql.com/
- https://deckgl.readthedocs.io/en/latest/layer.html
- https://docs.streamlit.io/
-
Install Required Packages:
pip install streamlit pandas scikit-learn geopy mlflow pydeck
-
Clone Repository:
git clone https://github.com/pramodkondur/UberWise-EndtoEnd.git cd uber-price-prediction
-
Configure MySQL Database and scedhuler:
- Set up MySQL to store the scraped Uber data.
- Update database credentials accordingly.
- Run scheduler.py for the job to run on schedule
-
Run Streamlit Application:
streamlit run app.py
You can view the code and in depth in the notebooks
Data Preparation and Model Train/Test/Eval