Weather-To-Eat-Or-Not-Dataset

-- Uncover the nation’s appetite – a decision support dataset for would-be restauranteurs

Authors: Devanshi Dholakia, Abeer Almahyawi, Akshay Jayakumar and Ou Stella Liang

Last updated: 12/11/2018

Introduction

Project Scope:

Our project consists of creating a national restaurant dataset using the Yelp Fusion API and other data sources. There are four cities of interest for comparison: Philadelphia, San Francisco, Chicago and Miami. We extracted attributes associated with the restaurants such as coordinates, phone number, operating hour, address, pricing level, rating and count of reviews. Also, we have decided to extract data from openweathermap.org such as humidity, Minimum temperature, Maximum temperature and wind speed for the four cities.

Motivation and Purpose:

The dataset is ideal for answering questions such as:

Does weather impact reviewer’s rating?
Do restaurants benefit from certain characteristics of locations in terms of the reviews they receive? E.g. Chinese restaurant in Chinatown vs. a tourist site.
Comparison of competing restaurants in the same area? Eg:- Which cafe is better in the University City? Starbucks, Saxby’s or Wawa
What are the flavor preferences of residents of different cities?

The dataset is also good for developing products such as:

Geo-mapping/heatmaps of ‘cuisine districts’

A consulting algorithm that recommends the ideal location/price range/cuisine type given certain inputs from a business owner.

Finally, we believe the dataset may be of interests to several stakeholders:

Business owners who are interested in entering the market. (Market Analysis )
Public health agencies who are interested in the population’s access and attitudes in dining out.
Public agencies interested in studying the local economy.
Food reviewers who are interested in assessing population’ interest.
Students/travelers who are looking for a type of restaurant in a new area (via the products of this dataset, e.g. heatmaps)

Working with the APIs

This program allows users to obtain restaurant review and weather information based on customized location and business type.

The Yelp Fusion APIs offer several endpoints including search, business details, etc. Each endpoint returns a specific facet about the businesses in Yelp's database. Its construct is similar to many database APIs in that one first needs to identify a unique ID of the desired entity by certain parameters in order to obtain more attributes of it, hence the separate end points search and business details.

Our 'Search' Parameters

Name	Data Type	Description
term	string	Search term, for example "food" or "restaurants". The term may also be business names, such as "Starbucks". If term is not included the endpoint will default to searching across businesses from a small number of popular categories
location	string	Required if either latitude or longitude is not provided. This string indicates the geographic area to be used when searching for businesses. Examples: "New York City", "NYC", "350 5th Ave, New York, NY 10118". Businesses returned in the response may not be strictly within the specified location.
limit	int	Number of business results to return. By default, it will return 20. Maximum is 50.
sort by	string	Suggestion to the search algorithm that the results be sorted by one of the these modes: best_match, rating, review_count or distance. The default is best_match. Note that specifying the sort_by is a suggestion (not strictly enforced) to Yelp's search, which considers multiple input parameters to return the most relevant results. For example, the rating sort is not strictly sorted by the rating value, but by an adjusted rating value that takes into account the number of ratings, similar to a Bayesian average. This is to prevent skewing results to businesses with a single review.
offset	int	Offset the list of returned business results by this amount.

In reviewing the API documentations, we summarized the following limits of the service.

A client is limited to 5,000 APIs per 24 hours resetting every midnight UTC.
If queries-per-second (QPS) against the API are deemed too frequent, the service will return a HTTP 429 error (too many requests). However, the documentation does not specify the maximum allowed frequency. We set our requests to be one second apart.

{
  "error": {
      "code": "TOO_MANY_REQUESTS_PER_SECOND",
      "description": "You have exceeded the queries-per-second limit for this endpoint.
                      Try reducing the rate at which you make queries."
  }
}

A client can get up to 1,000 businesses from the search endpoint.
Each call can have a maximum of 50 return results. (paging)

Due to the aforementioned limits, our data scope is to query the top 1000 restaurants for one city. We had to use the offset parameter, so that we obtain 1,000 in 20 calls. Results from all 20 queries from the same day were stored in a list and output to one JSON file.

The dataset generates a "pulse" for each restaurant by keeping track of the daily comment growth for each restaurant, despite Yelp does not provide the historical view of a business. This provides important intelligence on the business, and also enables users to associate the volume of comments with any number of measures of interest for a particular day to understand the factors impacting restaurant businesses. For example, the daily weather information is provided to address the question whether weather impacts people's dinning behaviors. Please refer to the automation manual for how to set up the daily queries.

In storing the results of our queries, we made the decision to store our daily results in separate files so that historical files are preserved and the JSON format cannot be corrupted by errors. The process of aggregating the daily results is discussed below.

Data aggregation/processing

The two programs work on digging into the JSON files returned by the Yelp Fusion API and OpenWeatherMap API, choosing required fields and combine data for each city and across cities in CSV format.

The JSON file returned from OpenWeatherMap API is extracted as a dictionary of dictionaries. Each file consists of each day's statistics in each city. On investigation, it turns out that OpenWeatherMap API sometimes concatenates the same data twice and stores it in one file. This causes the system to fail while creating the dataset. This requires manual intervention as it is difficult to correct the syntax of a JSON file using Python. The time complexity of extracting weather data is O(n⁴). The extracted fields are:

Weather Information Dataset

Name	Data Type	Description
date_id	string	Unique ID to map the restaurant reviews and weather information dataset
weather_main	string	Group of weather parameters (Example: Clouds, Rain, Snow, Extreme etc.)
weather_description	string	Weather condition within the group mentioned above (Example: Broken clouds, Few clouds, Light rain etc.)
main_temp	decimal	The temperature of that day. The default unit considered here is Kelvin
main_pressure	decimal	Atmospheric pressure of that day in hPa (hectopascals)
main_humidity	int	Relative air humidity of that day in %
main_temp_min	decimal	Minimum temperature on that day. The default unit considered here is Kelvin
main_temp_max	decimal	Maximum temperature on that day. The default unit considered here is Kelvin
visibility	decimal	Visibility on that day in meters
wind_speed	decimal	Speed of the wind in mps (meters per second)
wind_deg	decimal	Direction of the wind on that day in degrees (meteorogical). As an example, 270 degrees means the wind is blowing from the west
clouds_all	int	The % of cloudiness/cloud cover for that day
dt	int	Date & time of receiving data in the unix timestamp format, UTC (Example: Unix timestamp 1541696160 is equivalent to 11/08/2018 @ 4:56pm (UTC))
sys_sunrise	int	Date & time of sunrise in the unix timestamp format, UTC
sys_sunset	int	Date & time of sunrise in the unix timestamp format, UTC
id	int	City ID
name	string	City name
coord.lon, coord.lat	decimal	Longitude and latitude values for the given city

{
  "coord": {
    "lon": -75.16,
    "lat": 39.95
  },
  "weather": [
    {
      "id": 803,
      "main": "Clouds",
      "description": "broken clouds",
      "icon": "04n"
    }
  ],
  "base": "stations",
  "main": {
    "temp": 273.23,
    "pressure": 1019,
    "humidity": 83,
    "temp_min": 271.15,
    "temp_max": 274.85
  },
  "visibility": 16093,
  "wind": {
    "speed": 1.15,
    "deg": 274.504
  },
  "clouds": {
    "all": 75
  },
  "dt": 1543552500,
  "sys": {
    "type": 1,
    "id": 4743,
    "message": 0.0041,
    "country": "US",
    "sunrise": 1543579340,
    "sunset": 1543613777
  },
  "id": 4560349,
  "name": "Philadelphia",
  "cod": 200
}

The JSON file returned from Yelp Fusion API is extracted as a list of dictionaries. Extracting restaurant data is a more convoluted process since the nesting go as deep as 4-5 levels. This shows the level of detail returned each day. Unlike weather data, the data returned has always been clean. Due to the level of nesting, the complexity of extracting data is O(n⁵). The extracted fields are:

Restaurant Review Dataset

Name	Data Type	Description
date_id	string	Unique ID to map the restaurant reviews and weather information dataset
id	string	Unique Yelp ID of this business. Example: '4kMBvIEWPxWkWKFN__8SxQ'
name	string	Name of this business
is_closed	bool	Whether business has been (permanently) closed
url	string	URL for business page on Yelp
review_count	int	Number of reviews for this business
categories	string	Title of a category for display purpose. Example: Japanese, Italian, Juicebars vegan
rating	decimal	Rating for this business (value ranges from 1, 1.5, ... 4.5, 5)
latitude	decimal	City geo location - latitude coordinates of this business
longitude	decimal	City geo location - longitude coordinates of this business
transactions	string[]	List of transactions that the business is registered for. Current supported values are pickup, delivery and restaurant_reservation
price	string	Price level of the business. Value is one of $, $$, $$$ and $$$$.
display_address	string	Array of strings that if organized vertically give an address that is in the standard address format for the business's country
display_phone	string	Phone number of the business formatted nicely to be displayed to users. The format is the standard phone number format for the business's country
distance	decimal	Distance in meters from the search location. This returns meters regardless of the locale

[
  {
    "businesses": [
      {
        "id": "kT8IlV47kz1rz2lTuNyO1w",
        "alias": "christies-deli-philadelphia",
        "name": "Christie's Deli",
        "image_url": "https://s3-media3.fl.yelpcdn.com/bphoto/h7eEiQigXwXVq6MYdhG8vg/o.jpg",
        "is_closed": false,
        "url": "https://www.yelp.com/biz/christies-deli-philadelphia?adjust_creative=czYYjyLsMG5Xo5jy1DcI4A&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=czYYjyLsMG5Xo5jy1DcI4A",
        "review_count": 79,
        "categories": [
          {
            "alias": "delis",
            "title": "Delis"
          },
          {
            "alias": "breakfast_brunch",
            "title": "Breakfast & Brunch"
          },
          {
            "alias": "sandwiches",
            "title": "Sandwiches"
          }
        ],
        "rating": 5,
        "coordinates": {
          "latitude": 39.9630809276773,
          "longitude": -75.1693602651358
        },
        "transactions": [],
        "price": "$",
        "location": {
          "address1": "1822 Spring Garden St",
          "address2": "Unit B",
          "address3": "",
          "city": "Philadelphia",
          "zip_code": "19130",
          "country": "US",
          "state": "PA",
          "display_address": [
            "1822 Spring Garden St",
            "Unit B",
            "Philadelphia, PA 19130"
          ]
        },
        "phone": "+12155630555",
        "display_phone": "(215) 563-0555",
        "distance": 1045.8502098235545
      }
    ] }
    ]

In the program, datacombination.py, data from all the 4 cities were combined. During combination, an extra field was created using Pandas DataFrame to mention clearly the city in which the given restaurant is located.

Eliminated Attributes from each dataset

Restaurant Reviews

Name	Data Type	Reason not considered
business.alias	string	Every Yelp business has both a unique ID as well as a unique alias (eg: "name-of-business-separated-by-hyphen"). These can be used interchangeably. However, the business alias contains unicode characters and hence we thought using the business id is ideal and also we have the business name captured in column 'name'
business.image_url	string	Since our output presents the data in a csv format, decision to use just the URL of the business page on Yelp and not the the URL of the image was taken
business.location	object	The location object includes address1, address2, address3, city, state, zip and country. The attribute display_address, instead, merges all these elements as an array of strings which gives the address of the business in the standard address
business.phone	string	This attribute displays the phone number of the business in a simple format like - +17864520068 whereas the attribute we have used is the display_phone which displays the same number in a better standard format - (786) 452-0068

Weather Information

Name	Data Type	Reason not considered
weather.id	int	Displays the weather condition ID. Since the weather parameters and condition attributes are considered, this column was eliminated
weather.icon	string	Since our output presents the data in a csv format, use of the description of weather parameter and not related icon seemed fit
base.stations	string	This is an internal parameter for OpenWeatherMap to source weather data from meteorological broadcast services, raw data from airport weather stations, radar stations and other official weather stations
sys.type, sys.id, sys.message	number	These are internal parameters of the Sys structure that contain general information about the request and the surrounding area for where the request was made
cod	int	An internal parameter indicating the structure defined for JSON to be unmarshaled into

The processing will work even if the dataset is expanded to contain all cities in the United States, if not the world, as all that needs to be clear is the naming scheme of the files, which is taken care of by this system. The consistency in data returned by the two APIs adds to our confidence in stating so.

Challenges/Discussions/Future Work

The dataset only contains the top-rated restaurants. It may take the inclusion of businesses with lower rating to design an algorithm that predicts business viability in a select locale.
We found out that Yelp's limits were enforced in a funny way. On day 1, a total of 86 calls were accepted before the server gave an out of limit message. On day 3, only 20 calls were accepted. A closer look at the results from the 86 calls revealed that the server was looping through the same 1000 restaurants and giving out repeated returns.
Multiple returns from the same city name may create confusion. We made sure the querying result reflects the city of our choosing.
We plan to integrate the American Community Survey to the dataset by matching the neighborhood characteristics including the following. This would require us to first generate a geography ID based on the coordinates of restaurants using the census.gov geocoder API, and then query the geography of interest using the ACS APIs. Of note, the Census Bureau has made a large amount of its data repository machine-discoverable for users needing to retrieve historical data.
- Social: ancestry, language spoken at home, marital status, migration/residence 1 year ago, school enrollment, etc.
- Housing: computer and internet use, kitchen facilities, occupants per room, owner/renter, vehicle available, etc.
- Economic: employment status, health insurance coverage, industry and occupation, etc.
- Demographic: age;sex, total population =======
The distance field doesn't seem to be reflective of actual querying location (PHL) and other cities.
Timing of the API calls may yield better results in warmer seasons.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
Data		Data
API_request.py		API_request.py
AutomationManual.md		AutomationManual.md
Chi_map.png		Chi_map.png
ChicagoMap.png		ChicagoMap.png
PhilaMap.png		PhilaMap.png
Pil_map.png		Pil_map.png
README.md		README.md
datacombination.py		datacombination.py
dataextraction.py		dataextraction.py
offset.png		offset.png
owmcity.png		owmcity.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Weather-To-Eat-Or-Not-Dataset

-- Uncover the nation’s appetite – a decision support dataset for would-be restauranteurs

Introduction

Project Scope:

Motivation and Purpose:

Working with the APIs

Data aggregation/processing

Weather Information Dataset

Restaurant Review Dataset

Eliminated Attributes from each dataset

Restaurant Reviews

Weather Information

Challenges/Discussions/Future Work

About

Uh oh!

Releases

Packages

Languages

CHWangXing/Weather-To-Eat-Or-Not-Dataset

Folders and files

Latest commit

History

Repository files navigation

Weather-To-Eat-Or-Not-Dataset

-- Uncover the nation’s appetite – a decision support dataset for would-be restauranteurs

Introduction

Project Scope:

Motivation and Purpose:

Working with the APIs

Data aggregation/processing

Weather Information Dataset

Restaurant Review Dataset

Eliminated Attributes from each dataset

Restaurant Reviews

Weather Information

Challenges/Discussions/Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages