- Google slides link
- Reason for selected topic
- Description of the source of data
- Database
- Questions to Answer with the Data
- Data Exploration
- ERD
- Machine learning model
- Preliminary data preprocessing, feature engineering and feature selection
- Description of how data was split into training and testing data, and data standardization
- Explanation of model choice, including limitations and benefits
- Explanation of changes in model choice
- Description of how we have trained the model and the additional training
- Description of current accuracy score
- Tools
- Dashboard link
- Tableau Analysis
- Reference Resources
- Summary
- Team Members
In this project, we came out with a hypothesis that if there are any correlation with the higher income areas should received a better health inspection grade. We would like to find out if the incomes level within an area may effect the the health inspection grade for the local restaurant.
We looked at datasets on NYC DOH Restaurant Inspections provided by Kaggle. We settled on this dataset provided at kaggle for it contains many of the variables we were interested in, such as DBA, borough, cuisine description, and grade that we wish to look at in order to figure out the predictability of developing restaurant inspection grade based on these variables.
Data Resources: NYC Restaurant Inspection & NYC- Precovid Restaurant Data
Utilizing pandas, we read in the two CSV files we would be using for our analysis and dropped any columns that were not necessary for our data visualization, analysis, and machine learning. We then dropped any null values and exported the cleaned data into CSV files. The two cleaned datasets were joined to see what it would potentially look like when we joined them in our PostgreSQL database. We then drop duplicates from the database. After joining the two datasets in PostgreSQL we exported the final dataset we would be using for the rest of our analysis. This dataset was similar to the one we created through pandas and contained no null values or duplicates.
Finding correlation of the higher income areas would received a better health inspection grade.
After cleaning the two datasets ['NYC Restaurant Inspection' and 'NYC-Precovid Restaurant Data'] using Python and Jupyter Notebook, the datasets were connected over to pgAdmin4 via SQLAlchemy to be stored in postgreSQL.
- A database containing 'NYC Restaurant Inspection' data was created with the following columns: 'DBA', 'BOROUGH', 'STREET', 'ZIPCODE', 'CUISINE_DESCRIPTION', 'SCORE', and "GRADE'.
- A second database using the 'NYC-Precovid Restaurant' dataset was created with the following columns: 'DBA', 'STREET', and 'INCOME_LEVEL'. The two databases were then inner joined via the 'DBA' and 'STREET' columns as presented in the ERD diagram. After the merge, duplicate rows were removed.
- The final dataset was generated with the following specific attributes: 'DBA', 'STREET', 'INCOME_LEVEL', 'BOROUGH', 'ZIPCODE', 'CUISINE_DESCRIPTION', 'SCORE', and 'GRADE'.
- We use sqlAlchemy to connect our database in postgres and pull the joined data frame to use in machine learning model.
The dataset has been preprocessed, split, and has been used to train and test supervised machine learning models in ML_model_NormalizedStandardized_0.
The data has categorical and numerical variables. The categorical variables include ‘DBA’, ‘STREET’, 'INCOME_LEVEL’, ‘BOROUGH’, ‘CUISINE_DESCRIPTION’, and ‘GRADE’. ‘ZIPCODE’ was converted into a ‘string’ data type, and hence became a categorical variable too. ‘SCORE’ is a numerical variable. The ‘GRADE’ variable consists of ‘A’, ‘B’, ‘C’, ‘N’, ‘P’, and ‘Z’ unique values. However, since we do not know what ‘N’ and ‘Z’ signify and ‘P’ signifies ‘pending’, we have dropped ‘N’, ‘P’, and ‘Z’ from our dataset to prepare a more reliable model.
Initially we experimented with ‘DBA’, ‘STREET’, INCOME_LEVEL’, ‘BOROUGH’, ‘CUISINE_DESCRIPTION’, ‘ZIPCODE’, and ‘SCORE’ as features variables, and ‘GRADE’ as our target variable. However, we later experimented with various combinations of features variables to determine which ones are most significant when it comes to our model. We are trying to build a model that can predict whether or not a restaurant will get a ‘high’ grade given our features variables.
We removed 'SCORE' from our features variables since 'SCORE' is akin to 'GRADE'. 'GRADE' is created by gradations made over 'SCORE' values, which basically denotes the number of points you lose off your restaurant. 'SCORE' and 'GRADE' then are inversely related. In fact, 'SCORE' could work as a target variable too. The experiments we ran (can be found here in ML_experiments folder) were:
. Model 2 : 'INCOME_LEVEL’ and ‘ZIPCODE’ against ‘GRADE’;
. Model 3 : 'INCOME_LEVEL’, ‘ZIPCODE’ and ‘BOROUGH’ against ‘GRADE’;
. Model 4 : 'INCOME_LEVEL’, ‘ZIPCODE’ and ‘CUISINE_DESCRIPTION’ against ‘GRADE’;
. Model 5 : 'INCOME_LEVEL’, ‘ZIPCODE’, 'BOROUGH’ and ‘CUISINE_DESCRIPTION’ against ‘GRADE’;
The preliminary data preprocessing included normalization of the categorical variables ‘DBA’, ‘STREET’, ‘ZIPCODE’, and ‘CUISINE_DESCRIPTION’. These specific categorical variables were picked out from the rest since these have rare (or uncommon) unique values enough that if left as is, would make the dataset to wide to work with. A density plot was used for each of these variables, to identify where the value counts ‘fall off’ and the threshold thus set in that particular region. The thresholds selected for these variables are as follows: 5 for ‘DBA’, 30 for ‘STREET’, 200 for ‘ZIPCODE’, and 250 for ‘CUISINE_DESCRIPTION’. The rare values were bucketed into the ‘other’ category, to help normalize the uneven distribution. A categorical variable ‘CUISINE_DESCRITPION’s rare values and density plot prior to normalization is shown below:
The encoding process included encoding the ‘GRADE’ variable into a ‘high’ and ‘low’ grade: grade ‘high’ comprised grades ‘A’ and ‘B’, whereas grade ‘low’ comprised grade ‘C’. This was followed by running a OneHotEncoder on all the categorical variables in our data.
As shown by the results from these model experiments too, it makes most sense to use 'INCOME_LEVEL’ and ‘ZIPCODE’ as our features variables. ML_model_NormalizedStandardized_2 contains our choice selection of features variables and target variable.
Theoretically too, one can hypothesize how income level and zipcode may have an effect on the grade a restuarant receives. We choose 'ZIPCODE'' as a marker of geographical locality, and not 'STREET' or 'BOROUGH' since they are too specific or too general a markers - Zipcode captures just the right level of geographical distinctness for our purposes. 'DBA' (doing business as) denotes the restaurant or restaurant chain name, and we surmise that will not have any considerable, if any, effect on the outcomes. ‘CUISINE_DESCRIPTION’ is left out as well for we hypothesize that the type of cuisine may not have any effect on the grade that restaurant gets either.
The data was then split into training and testing data (75% training and 25% testing data). About 75% of the data (i.e. the training data) was used for training (or ‘fitting’) the models, and the remaining 25% data (i.e. the testing data) was used for testing each model. After the data split, if there were any numerical variables in the data, they were standardized. We standardize after we split the data, and not before, because we do not want to include the testing values into the scale. The data we are working with is tabular and not raw (i.e. has no natural language data or images therein), so supervised machine learning models run well on it.
The different supervised machine learning models tried on our data include resampling and ensemble learning models. The resampling models used on the data include: Naive Random Oversampling, SMOTE Oversampling, Cluster Centroids Undersampling, and SMOTEENN which combines over- and under-sampling techniques.
Naive Random Oversampling and SMOTE Oversampling ‘oversample’ the minority class so the data values are on par with the majority class. The resample gives us 5598 ‘high’ and 5598 ‘low’ grades to run the ML model on. We experimented with Cluster Centroids Undersampling as well, which ‘undersamples’ the majority class down to equal the number of values in the minority class: this gives us 159 ‘high’ and 159 ‘low’ grades to run the ML model on. The SMOTEENN resampling method combines over and under sampling techniques by ‘oversampling’ the minority class to equate the number of data values in the majority class, followed by ‘undersampling’ by eliminating the data values that happen to fall in the neighborhood of both classes. This technique gave us 1780 ‘high’ and 773 ‘low’ grades to run the ML model on. The ensemble learning models used on the data include Random Forest Classifier, Balanced Random Forest Classifier, and Easy Ensemble AdaBoost Classifier, each utilizing 5598 'high' and 159 'low'.
From among these models, the best model turns out to be the Random Forest Classifier, which has an accuracy score of 0.97.
Our model chosen is the same as last week's - Random Forest Classifier - which yields the best results. However, there is some change in the features variables used in this model: we re-selected our features variables down to only two: 'INCOME_LEVEL’ and ‘ZIPCODE’. The target variable remains same as before: 'GRADE'. Initially we were working with ‘DBA’, ‘STREET’, 'INCOME_LEVEL’, ‘BOROUGH’, ‘CUISINE_DESCRIPTION’, ‘ZIPCODE’, and ‘SCORE’ as features variables, and ‘GRADE’ as our target variable. The re-selection of features variables is explained in detail above in the section titled "Feature engineering, selection, and model tweaking". ML_model_NormalizedStandardized_2 contains our model from this week.
In order to improve this model, we can try binning 'GRADE' another way: instead of placing 'A' and 'B' in 'high' grade and 'C' only in 'low' grade, we can bin 'A' into 'high' grade and 'B' and 'C' into 'low' grade. Yet another way to refine our results is to use 'SCORE' instead of 'GRADE' and make multiple classes out of the target variable of 'SCORE', and then run our models against that target variable.
Our best model is Random Forest Classifier with an accuracy score of 0.97. A Random Forest Classifier involves training each weak learner on a subset of the data and then bases its result on the consensus reached by these weak learners together. A Random Forest Classifier model can, however, miss out the variability in the data. However, if the model’s number of estimators and the depth is sufficient, it should perform quite well. The confusion matrix for this model can be seen below:
The model's precision to predict 'high' grades is 0.97, whereas its precision to predict 'low' grades is 0.00. The model's recall (or sensitivity) for 'high' grades is 1.00, whereas recall for 'low' grades is 0.00.
Plot some useful maps based on income level in the areas and showing the grades that spreading amoung in the local area.
- Creating a heatmap showing the average scores by cuisine in different borough.
- Plot a Geo map to indicate the grades that restaurants received in five borough.
- An interactive dashboard using javascript , html, css , bootstrap where we can filter New York city restaurant data by zip code, income level, grade, borough and cuisine, separately and also in different combination.
- Use plotly and javascript to make interactive graph to visualize mean grading score by cuisine in different zip code.
- Using plotly we visualize different ML model to see which model give us the better accuracy score.
- PostgreSQL
- SQLAlchemy
- Pandas
- Imbalanced-learn
- sklearn
- Tableau
- Javascript
- Plotly
- HTML
- CSS
- Bootstrap
The Dashboard provides detail about the tools and elements that we used to create the dashboad.
This is the webpage that we build after data analysit NYC Restaurant Analysis Dashboard.
-
Based on the Dashboard we could see that majority of high income neighborhoods are located in the borough of Manhattan. And some of the minority cuisine such as Afghan, Ethiopian and Filipino cuisines are only located in Manhattan.
-
We could also tell from the Heat Map that the African cuisine located in Brooklyn and the Eastern cuisine located in Manhattan has the lowest inspection grade among the neighborhood. All analysis are including in the Story. We also found out that the zip code 10119 has the lowest points among all zip codes in five boroughs. In which means the zip code has the highest inspection grade in high income level neighborhood located in Manhattan. The zip code 10461 has the lowest inspection grade in low income level neighborhood located in the Bronx.
Accuracy score was obtained and visualize to compared which Machine Learning Model performs best with our data. Random Forest Classifier was the best model to predict grade from a specific zip code and income level.