Skip to content

Data Assessment CMP data

Diana M. Dorinson edited this page Apr 7, 2017 · 18 revisions

Preliminary Questions

  • How far back should we go within the CMP folder? The 2015 files include some prior year data, so we could start with that folder and organize ourselves for updates going forward, but there may be reasons to look backwards at this stage.
    • Billy: I think we want to go all the way back to 2001, right? So we'll need to verify the data is consistent across those years...
    • Bhargav: The previous years' data in 2015 folder should be accurate. In case where numbers have changed due to change in methodology, the numbers using latest methods should be used.
    • Joe: Yes, include previous years' data, as far back as available.
  • Are we storing and displaying only the synthesized results, or do we also need to worry about capturing the individual runs & raw data, e.g.:
    • For auto speeds (In CMP > 2015...> Task 11...), the file Floating Car Results 2015 v1.0_WentToLOSF.xlsx has a tab labeled "Run Times", in addition to "Aggregated Results"
    • For the transit speeds files, do we have to do anything with the APC data?
    • Bhargav: For CMP no raw data needs to go into the data warehouse. Hopefully, fewer floating cars runs need to be done in the future. We need to figure out an approach for APC data since we need that for other applications (Fast-Trips, CHAMP validation, TNC studies etc.).
    • Joe: Agree that no raw LOS data needs to go into the data warehouse. Would like to better define use cases for APC data for CMP.
  • What about handling old versions of data that seem to have been subsequently been revised / superseded, e.g.:
    • In the 2015 Task 11 folder, do we care about the file 2013 LOS Monitoring Results, original and updated.xlsx).
    • In Task 13 folder, do we need both v1.0 and v2.0?
    • Bhargav: older versions of CMP data could be handled outside the data warehouse. Only the latest versions should be used.
  • Do we need to assess the data/structure for all contents in our target files, or do we only focus on the data table(s) of particular interest for this exercise?
    • Billy: I think we just want to focus on the data tables themselves?
    • Bhargav: Yes, we need to make sure all the metrics reported/tracked in Q:\CMP\CMPSF 2015\Final\CMP_2015_FINAL.pdf are in the database. We must note that other CMP datasets would go into the warehouse in the future - SWITRS, mainline and intersection counts etc.

Source of Data

  • Where does it come from?
    • All LOS data is generated through the work of a consultant team under contract to SFCTA (e.g., Iteris+Veritech)
    • Auto speed data: Prior to 2009, all auto speed data was derived from floating car runs. Beginning in 2009 [2011?] INRIX data became available, and is now the first choice source. The consultant downloads raw data directly from the INRIX site via MTC's license; for any segments with insufficient INRIX data, consultant conducts physical floating car runs and aggregates the results.
    • Transit speed data: Historically, some AVL data has been used for this purpose [years?]. More recently, the analysis is derived from an analysis of raw APC data; calculations are performed by consultant team.
  • Who owns it?
    • Data from the sources above is manipulated by consultants with scripts to create final tables of processed results.
    • Auto speed data: For INRIX, SFCTA receives only processed outputs from consultants; raw TMC data and scripts are not currently provided as contract deliverables at this time. For floating car runs, SFCTA receives a summary of individual run results in the Excel file that include the aggregated summaries, but the individual run data will not be included in the Data Warehouse anyway.
      • FUTURE FOLLOW-UP: At this time, SFCTA does not receive COV calculations for INRIX data. Consider adding to future scope of work, for parity with transit speed analysis?
    • Transit speed data: SFCTA receives APC data from SFMTA [confirm?] and sends this to the consultant. SFCTA receives a flat Excel file back with processed results; scripts are not currently provided as contract deliverables.
      • FUTURE FOLLOW-UP: Look into options for regular updates of APC data, as appropriate for SFCTA use cases (TBD) and current or pending availability of data feed from external sources.

Current data handling

  • Where does it live now? (on the network drive)
    • Stored on the SFCTA "gis" server [see O:\CMP]
    • There are sub-folders for each year that CMP monitoring/reporting is conducted, which occurs roughly every two years.
    • The files with the data of interest for this effort are named "LOS Monitoring YYYY"
    • We will store and display historical data, but the files in the folder for 2015 contain columns for prior years' data, so our approach will be to focus on the folder LOS Monitoring 2015 as the starting point.
  • How is it managed: subfolders? file naming? etc
    • Within each "LOS Monitoring..." folder for each year, the sub-folder organization varies depending on the exact scope of work for the consultants that year. In 2015, there were three scope tasks of interest for the overall DW/DV effort (descriptions below were provided by Bhargava):
      • Task 11 LOS Monitoring -- This has speeds on CMP segments and calculated LOS. Speeds mostly come from processed INRIX data but a few floating car runs are also conducted on segments that do not have reliable data from INRIX. These segments vary each year (Iteris does the relevant analysis). My hope is that the number of CMP segments that need floating car runs is decreasing each cycle (due to INRIX data getting better). I will be closely working with Iteris during CMP 2017 cycle and will gain a better understanding of the processes involved.
      • Task 12 Traffic Counts -- We have started monitoring both mainline counts and intersection movements at important locations throughout the city. Ideally these counts would be regularly uploaded into Count Dracula. This folder has the summarized data but you can find it more detailed form here - O:\CMP\LOS Monitoring 2015\From Iteris\Task 12 Traffic Counts.
        • Note: Traffic Counts are a low-priority use case that will be handled separately, after we tackle auto LOS (above) and transit speeds (below).
      • Task 13 Transit Speeds -- Transit speeds on CMP segments. This also has APC data we sent to Iteris along with documents describing the process used to derive transit speeds on CMP segments using APC data.
    • In addition, some data files can be found under sub-folder named Task 14 - LOS Monitoring report
    • As implied in the descriptions above, there are two tiers of files inside the LOS Monitoring 2015 folder. The discussion below will include the full file path to clarify where the referenced file is located:
      • Top level folders for Task 11 through Task 14 --or--
      • Secondary folders for Task 11 through Task 14 within the sub-folder From Iteris
  • Who at SFCTA is in charge of it?
    • For 2017, Bhargava Sana will lead the effort. Drew Cooper has been involved in recent years.
  • What are the QA/QC steps involved when it is received?
    • General review of consultant work for quality & completeness, but no formal validation at this time.

Data Records & Format

CMP Segment Definitions

  • All CMP data is organized based on a standard set of CMP segments, which can be defined by the following attributes:
Field Name Data Type Alternative Column Names Description / Notes
CMP ID integer Cmp ID unique numerical value that identifies a specific roadway segment ^
Route string Name, Route Name name of street being measured *
From string Start Intersection name of cross-street at beginning of segment (in the direction of travel) *
To string End Intersection name of cross-street at end of segment (in the direction of travel) *
Direction char Dir single letter representing the compass direction (N, S, E, W) for the direction of travel being measured
Length float Dist, Distance, Old Dist., 2006 Dist CMP segment length, in miles *
Class 1985 string Class indicates the roadway facility type according to the 1985 HCM definitions; values may be an Arabic numeral, a Roman numeral, or the text "Fwy" +
Class 2000 string Class, HCM 2000 Class indicates the roadway facility type according to the 2000 HCM definitions; values may be an Arabic numeral, a Roman numeral, or the text "Fwy" +
Comments string Notes notations on known reasons for data anomalies (optional?) #
  • Notes:
    • ^ Each of the records in both auto & transit files is associated with a peak period (AM or PM). In some cases, an entire worksheet is devoted to results for one of the two peaks. In some (but not all) of the worksheets that contain data for both peaks, the peak indicator is concatenated to the CMP ID and stored in a field named simply "ID." Although this ID is potentially a unique identifier, the time period is more logically associated with the record of measured travel speed than as an attribute of the roadway itself.
    • * These attributes may sometimes change over time as roadways are re-named and/or re-constructed. We will need a mechanism to identify the year(s) for which a particular segment definition is valid.
    • + The numerical Class assigned to each arterial segment is determined based on guidelines in the Highway Capacity Manual (HCM). LOS calculations are performed using both the 1985 and 2000 versions of the HCM, and both sets of calculations should be stored in the Data Warehouse. The results for 1985 HCM and 2000 HCM are often reported on separate worksheets, in which case the source for the Class category may be implied by the context of the worksheet, rather than being explicitly labeled in the column header.
    • # 'Comments' might logically be associated with CMP segments (above) or speed records (see below) or both. Will consider potential needs/uses for this field during Task 2.
  • The files for auto and transit speeds will have some or all of the attributes above listed in the leftmost columns of each worksheet. In some cases, three of the attributes above are displayed using the compound format "ROUTE: FROM to TO" in a single column. Regardless of the column layout, there should be a 1:1 association between the CMP segment and the individual records of speed and LOS rating described below.
  • There are currently 245 defined CMP segments. Over time, new segments are added to monitoring activities, so some historical files may have fewer records. It is our understanding that CMP ID numbers are never re-used.

Auto speeds file: O:\CMP\LOS Monitoring 2015\From Iteris\Task 14 LOS Monitoring Report\4. LOS Spreadsheet (updated)\Task 11_LOS monitoring v2.0.xlsx

  • How large is one set of the data?
    • Processed data files are extremely small, typically < 2MB
  • Column detail: a database will require each column to have a very specific defined TYPE.
    • Based on discussions with Bhargava & Drew, file formats and worksheets vary so much that it will be more efficient to define a preferred file format, and have someone else (Jayne? Iteris?) convert the data sets into the target format, rather than to document the numerous options available. See below for additional details.
  • Missing data: how are missing cells recorded? (By column, if need be)
    • Blank cells typically indicate that data was not collected or available for a particular year and should remain as NULL. In some cases, worksheets cells that appear empty are populated with formulas that return "" for missing values.
  • Worksheets to focus on: Data of interest is currently stored on multiple worksheets in this file. SFCTA staff will help assemble into flat files after the data schema has been drafted.
    • HCM1985 AM -- Contains the latest calculation of average speed and LOS rating in the AM Peak going back to 1991, calculated using the 1985 HCM formulas and definitions. Values are generally pasted in, except for 2013 and 2015, which use lookups to the "Master" worksheets noted below.
    • HCM1985 PM -- Same as above, but for PM peak period.
    • HCM2000 -- Same as two tabs above, except that data only goes back to 2000, and AM Peak and PM Peak columns are shown side-by-side for each year instead of on separate worksheets.
    • Master (2015) -- Contains a formatted table displaying more data for each record than the multi-year summaries in the three worksheets noted above. In particular, this sheet contains separate columns showing whether the speed value was obtained from INRIX or the Floating Car runs; the sample size for each speed calculation; LOS ratings displayed in separate columns for Arterials using 1985 HCM, Arterials using 2000 HCM and Freeway segments; and a calculation of travel time along the segment according to the average speed.
    • Master (2013) -- Same as worksheet above, but with 2013 data.
      • Note: per 'Sheet Descriptions' tab, this tab now uses updated methodology for 2013.
  • Additional data: This file does not contain some items of interest for historical records earlier than 2013. SFCTA staff will review historical files to identify the files that contain this information for historical years:
    • The source of each speed record for 2009 and 2011, i.e., INRIX vs. Floating Car (all entries prior to 2009 are floating car only).
    • Sample size for historical records.
    • Travel time can be calculated from other available fields, but we will need to confirm the timing of any changes in segment length.
    • Comment field, used to flag unusual results or causes of data anomalies.

Transit speeds file #1: O:\CMP\LOS Monitoring 2015\Task 13 Transit Speed Monitoring\2015_Transit_LOS_Results.xlsx

  • How large is one set of the data?
    • Processed data files are extremely small, typically < 2MB
  • Column detail: a database will require each column to have a very specific defined TYPE.
    • Based on discussions with Bhargava & Drew, file formats and worksheets vary so much that it will be more efficient to define a preferred file format, and have someone else (Jayne? Iteris?) convert the data sets into the target format, rather than to document the numerous options available. See below for additional details.
  • Missing data: how are missing cells recorded? (By column, if need be)
    • Some cells are empty, but several columns show a value of "NA" to indicate that data is not available. These values should probably be converted to NULL.
  • Worksheets to focus on: Data of interest is currently stored on multiple worksheets in this file. SFCTA staff will help assemble into flat files after the data schema has been drafted.
    • 2015_Transit_AM -- For each CMP segment, presents average transit speed in 2015 (or NA if speed was not available), sample size used to derive the 2015 average, the standard deviation of the sample, and the 2013 average transit speed.
    • 2015_Transit_LOS_PM -- Same as above, but for PM peak.
  • Additional data: This file does not contain some items of interest for historical records. SFCTA staff will review historical files to identify the files that contain this information for historical years:
    • Sample size and standard deviation for 2013 and years prior.
    • Travel time can be calculated using other available fields, as long as we confirm when the length changed over time.
    • For 2015, the source is analysis of APC data. Prior years' source may include other data sources.
    • Note: Transit speed files do not appear to include comments that are separate from those included in the auto speeds files.

Transit speeds file #2: O:\CMP\LOS Monitoring 2015\Task 13 Transit Speed Monitoring\Transit Speeds v2.0.xlsx

  • This file is an alternative presentation of the data contained in the previous file; it may be easier to work with, depending on the final database schema.
  • There is a single sheet displays values for AM Peak and PM Peak in separate columns side-by-side, instead of separate sheets. Also, where data is unavailable, cells are empty/blank. On the other hand, the file contains multiple header rows repeated throughout, likely to support hardcopy printing.

Transit speeds file #3: O:\CMP\LOS Monitoring 2015\Task 13 Transit Speed Monitoring\Transit Speeds v2.0 compare 2013 2015.xlsx

  • This file contains a good example of the derived values that we probably want to pre-calculate and store within the Data Warehouse for rapid retrieval:
    • Worksheet name: AllCMPSegments
    • Records: 245 total; first record in row 6
    • 16 columns in each record. Excel worksheet contains both numerical values and formulas to calculate results.
Position Description Type (V)alue or (F)ormula Notes
1 CMP segment ID number integer V permanent designation; lookup table available linking ID & description
2 CMP segment description string F permanent designation; uses vlookup from elsewhere in same file
3 2015 auto speed - AM Peak float V consider using double due to some values with many decimal places?
4 2015 auto speed - PM Peak float V consider using double due to some values with many decimal places?
5 2013 auto speed - AM Peak float V consider using double due to some values with many decimal places?
6 2013 auto speed - PM Peak float V consider using double due to some values with many decimal places?
7 Difference in auto speed - AM Peak float F calculated as 2015 (column 3) less 2013 (column 5)
8 Difference in auto speed - PM Peak float F calculated as 2015 (column 4) less 2013 (column 6)
9 2015 transit speed - AM Peak float F vlookup from elsewhere in same file
10 2015 transit speed - PM Peak float F vlookup from elsewhere in same file
11 2015 transit to auto ratio - AM Peak float F calculated as transit (column 9) divided by auto (column 3)
12 2015 transit to auto ratio - PM Peak float F calculated as transit (column 10) divided by auto (column 4)
13 2015 auto to transit ratio - AM Peak float F calculated as auto (column 3) divided by transit (column 9)
14 2015 auto to transit ratio - PM Peak float F calculated as auto (column 10) divided by transit (column 4)
15 2015 auto to transit category - AM Peak float F value from column 13 rounded down to nearest 0.25
16 2015 auto to transit category - PM Peak float F value from column 14 rounded down to nearest 0.25
  • Note that there are no empty cells in this table. Zero values are present in some transit data columns; this indicates either no transit service on the segment, or no data available for the segment. Should be interpreted as (or converted to) NULL value.

Potential approach for storing speed records

  • As noted above, the file structures across the years are quite varied, but can likely be harmonized into a single format, potentially across both auto and transit, if each record contains the following fields:
Field Name Data Type Description / Notes
CMP ID integer segment ID
Year integer use validation to ensure values relate to CMP reporting years
Peak string AM or PM designation
Mode string 'Auto' or 'Transit'
Speed float average value
StdDev float standard deviation of speed (is this transit only??)
Source string data source, from specified set (e.g., 'INRIX', 'Floating Car', 'APC')
Sample Size integer number of records used to derive the mean & SD values
Travel Time float calculated value, in minutes
LOS Rating char letter-grade, according to HCM definition noted in next field
LOS Method string indicate whether LOS value comes from 1985 HCM, 2000 HCM, or Freeway table
Comments string notations on known reasons for data anomalies (optional?) #
  • # 'Comments' might logically be associated with CMP segments or speed records or both. Will consider potential needs/uses for this field during Task 2.

Updates & Upkeep

  • How often is new data delivered & available?
    • LOS results (all files noted above) are delivered once every two years, as part of the formal CMP process.
  • Does existing data get corrected/updated after arrival?
    • Yes, for one of two reasons:
      • On rare occasions, CMP segments change between reporting cycles. This happens most often when a street is renamed, but in some cases the value of segment length changes due to reconstruction. The conversion of Doyle Drive to Presidio Parkway is a prime example of changes in both name and length. As far as we know, CMP segment ID numbers always refer to the same general stretch of roadway, even if some of the roadway's attributes change over time.
      • More generally, values of speed and/or LOS rating are sometimes updated to reflect minor changes in calculation methodology. If this happens, new data should replace the old.

Privacy

  • Does any aspect of this dataset have potential privacy issues?
    • None identified.
  • If so are there current practices for hiding records/merging cells etc?
    • N/A

Meta Stuff

  • Are there any issues with the current setup needing attention, that might be in scope as long as we're in there doing stuff anyway?
  • Other notes & findings
    • CMP files not in priority use cases:
      • Traffic count files are < 2MB for each year.
      • APC data files are ~ 3GB for a single year.
    • Background notes on the INRIX data:
      • One of the reasons that consultants have been needed for this part of the effort is that INRIX periodically changes the definition of its unit of analysis (called Traffic Message Channels, or TMCs), which then requires an analyst to re-work the calculations that assemble the TMCs into the segments of interest for CMP reporting.
      • For future development, we should keep in mind that INRIX link speed data is available from MTC more frequently than biennially. At this time, storing and displaying the fine-grained link data is a very low priority use case. In addition, SFCTA does not (yet?) have access to the scripts that would be needed to process the raw data into a more useful form, but this may be a possibility in the long-term.
Clone this wiki locally