Skip to content

Design Doc: Adding in Geography and Time

Tristan Crockett edited this page Dec 16, 2016 · 1 revision

Geography and Time Deep Dive Writeup

Geography

What geography should we store for jobs and skills? My intuition is that anything more granular than metro area is probably a waste. In some cases, we only get a state. In others, we get metro area, and in others, a city/state. I think we can convert city/state to metro area, and store data long-term as either state or metro area.

At some point in the future, we can join state with metro area, so if a user wants to filter by state, we can reference both the state and metro area entries. This can be done using PostGIS, but for something as high-level as metro area that may be unnecessary.

##Time

We should just be able to store quarter. There might be more granular data during processing time, but that can be aggregated to quarter in the database.

What do we store for each job title per time/geo?

We have numbers of jobs, but we can't really just add up jobs from different sources and call it the total because there would be duplicates between sources, among other reasons. Do we just store the fact that the job title exists? Or some simple way to denote volume of jobs, so we can make a job title with 1000 instances appear higher than a job title with 1 instance? For instance, normalizing to 1-5?

##Schema Additions ###Quarters

  • id
  • name

###Geographies

  • id
  • geography_type
  • geography_name

###JobPresence

  • geography_id
  • quarter_id
  • job_uuid
  • num_jobs? (min_jobs? max_jobs? boolean? S/M/L?)

##Future Schema Additions ###Geography_Shapes

  • id
  • geom

Notes on Storage Capacity

The idea that we want to control database storage capacity as the project grows has come up. Here are some unrelated ideas I have on this.

  1. Small win: If we can decide on quarter and metro area/state as the most granular we get, those ids can easily be smallints.

  2. We could store change over time, instead of each geo/job combo every quarter. For instance, does the fact that the job title Software Engineer exists in the Chicago metro area really change from quarter to quarter? How about the skills needed to be a software engineer in the Chicago area? I doubt the list of skills will fully change from quarter to quarter, so what if we stored a baseline and then deviations from the baseline? Fully store the skills needed at year 1, and then only store what changed each year. The same goes for geography. For software engineer, won't most of the skills be the same for different geographies?

  3. Complex historical queries could be computed on-demand instead of cached; if we know how to compute through grabbing the correct time/geography data from S3, do we really need to store computed versions of all of them?