Predict Doctor's Consultation Fee

Problem Statement

We have all been in situation where we go to a doctor in emergency and find that the consultation fees are too high. As a data scientist we all should do better. What if you have data that records important details about a doctor and you get to build a model to predict the doctor’s consulting fee.?

Libraries

scikit-learn==0.19.2
pandas==0.23.4
matplotlib==2.2.2
seaborn==0.9.0
catboost==0.9.1.1

Data Features

Qualification: Qualification and degrees held by the doctor
Experience: Experience of the doctor in number of years
Rating: Rating given by patients
Profile: Type of the doctor
Miscellaeous_Info: Extra information about the doctor
Place: Area and the city where the doctor is located.
Fees: Fees charged by the doctor

Data Pre-processing and Feature creation

Qualification columns has all the study records of a doctor in a string format by comma separated. E.g. MBBS, MS - Otorhinolaryngology. So, it makes sense to create features out of it like Diploma, Bachelor, Masters, and Extra study, etc., but the problem is we have more than 600 unique study and major combination and we need understanding of all these Medical course to created different features, that's why I have TfIdf vector and TurncatedSVD on these features to create final 20 features out of this Qualification information.
Extracted just number out of Experience.
Removed % from Rating.
More than 3000 records doesn't have Miscellaeous_Info and in majority case it only consists info about Experience and location, which we already have, so I am not using this column in model. I have created an extra column Has_M_Info which indicates if the record has miscellaeous info or not.
Place column has Area and City info comma separated, so I have created two columns Area & City out of it.

Final Fetures for model creation

20 features from Qualification
Experience
Rating
Profile
Has_M_Info
City

I haven't used Area, because it has 800+ unique values, which can overfit our model.

EDA

After doing data analysis it is clear that Rating and Experience has less correlation with target variable Fees.
It was clear from Box plots of Fees for Profile and City that these columns are more important for segregation.
But without adding Education information we can still not find good separation in data groups.

Evaluation metric

Root mean squared log error (RMSLE)
The reason so select this metric is because out target variable is skewd, so we don't want to add more penalty for extream values in out machine learning model.

Model

I have used Catboost Regressor model
Mean squared error (MSE) is used as a loss function and RMSLE as a evalution metric.
Profile and City is used as categorical variables.
Used 5 Fold Cross validation and averaged predictions from each fold for test dataset for final submission.

Final model result

Achieved Validation RMSLE of 0.625 and RMSE of 168.87

How to further improve the result?

If we created more features out of Experience column by understanding of all medical courses, then there is a high chance to improve our model performance.
Try different model like Xgboost, LinearRegression, etc. and do ensembling.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
input		input
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Doctor's Consultation Fee

Problem Statement

Libraries

Data Features

Data Pre-processing and Feature creation

Final Fetures for model creation

EDA

Evaluation metric

Model

Final model result

How to further improve the result?

About

Releases

Packages

Languages

NishantBhavsar/doctor-consultation-fee-prediction

Folders and files

Latest commit

History

Repository files navigation

Predict Doctor's Consultation Fee

Problem Statement

Libraries

Data Features

Data Pre-processing and Feature creation

Final Fetures for model creation

EDA

Evaluation metric

Model

Final model result

How to further improve the result?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages