Skip to content

Data interchange specifications

K. Shankari edited this page Sep 23, 2015 · 5 revisions

Data interchange specifications

Phone -> server component

This can again be split into two components - user generated data and data from background sensing.

User generated data

We want this to be as flexible as possible in order to support the rapid deployment of screens that can collect additional data. Each screen can add new keys to this area, and the corresponding server side module can read the corresponding keys, e.g.

{
  'confirmed_sections':
    {section_id_1: confirmed1,
     section_id_2: confirmed2,
     section_id_3: confirmed3}
}

Since our screens are javascript/HTML based, this allows us to quickly deploy new functionality without requiring native code changes and an update to the app. We can simply create a new javascript screen that writes to this data, and deploy a new python module that reads from it.

Data from background sensing

This requires less flexibility since we have to write native code or deploy native code plugins to collect data. Since we can only get background data collection on iOS when we receive location updates, we will always collect location updates.

{
    "locations": [
        location_obj1,
        location_obj2,
        location_obj3,
        ...],
    "activities": [
        activity_obj1,
        activity_obj2,
        activity_obj3,
        ...],
    "accelerometer": [
        accelerometer_obj1,
        accelerometer_obj2,
        accelerometer_obj3,
        ...],
    "fuel_consumption": [
        fuel_consumption_obj1,
        fuel_consumption_obj2,
        fuel_consumption_obj3,
        ...]
    ...
}

Some other observations and requirements on the data from background sensing:

Some other requirements on the data from background sensing:

  1. The location array is currently required, since background data collection on iOS is driven by location updates.
  2. The format of the objects can be platform specific -- we will have a conversion layer on the server to transform data into the common long-term data storage format. This allows us to have a much more flexible conversion step that can be tweaked as necessary without updating the client. In particular, we expect that both the location and activity objects will be different on android and iOS.
  3. We expect that all data objects will have at least an associated timestamp.
  4. The JSON objects will be inserted into SQLite databases on android and iOS. Since the location points are used by the data collection state machine, each row will contain some extracted information for easy querying. It will also contain the raw JSON for easy syncing!

Combined

The two will be combined into a single data structure as follows:

{ 'user': 
  {'confirmed_sections': 
        {section_id_1: confirmed1,
         section_id_2: confirmed2,
         section_id_3: confirmed3}
  },
  'background':
  {
    "locations": [
        location_obj1,
        location_obj2,
        location_obj3,
        ...],
    "activities": [
        activity_obj1,
        activity_obj2,
        activity_obj3,
        ...],
    "accelerometer": [
        accelerometer_obj1,
        accelerometer_obj2,
        accelerometer_obj3,
        ...],
    "fuel_consumption": [
        fuel_consumption_obj1,
        fuel_consumption_obj2,
        fuel_consumption_obj3,
        ...]
    ...
  }
}

Server -> phone component

This can again be divided into two components. The first is intended for user display (results, etc), and the second is intended for configuring the background collection. Again, the structure of the first part needs to be very flexible, in order to support dynamically deploying screens along with dynamically generated data. At the same time, the second part does not need to be as flexible, since changes to it will not work unless there are changes to the corresponding native code.

User results

{
    'data':
    {
        'carbon_footprint':
        {
            'mine': mFootprint,
            'mean': meanFootprint,
            'all_drive': driveFootprint,
            'optimal': optimalFootprint,
        }
        'distances':
            'mode1': mode1Distance,
             ....
        }
    ...
    }   
    'game':
    {   
        'my_score': 3435,
        'other_scores':
        {
            'person1': 2313,
            'person2': 4123,
            'person3': 2111
        }
    }   
    ...
}

Background configuration

{
    'pull_sensors': {
        ['accelerometer', 'gyroscope', ...]
    },
    'location':
    {
         'accuracy': POWER_BALANCED_ACCURACY,
         'filter': DISTANCE_FILTER,
         'geofence_radius': 100,
         ...
    },
    ...
}

The actual travel diary!!

Should the "documents" be the raw data, or should they be geojson encoded representations? If they are geojson encoded representations, we can simply display them using leaflet and put more of the processing on the server side. But it also means less flexibility on the client side for interactivity.

In general, my gut feeling is to send the raw data and generate the view on the client side. But geojson is data, just data in a different model than we have on the server side. Is it wrong to send data in the geojson format?

Let us think through this some more.

What is the difference between our data and geojson data? Why aren't we just using the geojson format?

Using the raw data

  • geojson encodes the spatial data, but not the temporal data or the accuracy, which is pretty important. But every geojson geometry can have an arbitrary number of properties, which can be used to encode the time and accuracy, albeit in a crude fashion.
  • it is a quick and cheap representation, but does not allow for as much customization. For example, if we wanted to see the intermediate points along a line using div markers, it wouldn't be supported using the default representation. But for a quick and dirty fix, we could just not show the points along the line. Also, it looks like the pointToLayer representation does support that.
  • It feels like a visualization of the data rather than the raw data, and in general, sending the raw data gives most flexiblity. For example, it is not consistent with our notion of a timeseries with an overlay of the trips and sections on top.
  • if we wanted to encode user edits to the sections and trips, how would we link back to the object IDs?

Using geojson

  • it is a standard, standards are good
  • it is not a bad idea to send data that is optimized for visualization, since it is a visualization. As long as we store/model the data in the generic way, we are fine.
  • even in a progressive system, we do have to materialize views over the data - think of this as a materialized view.
  • in particular, consider the case in which we want to store multiple versions of some inferred property such as a trip or a section or a mode. This could be from multiple inference algorithms, as well as from user overrides. If we wanted to send the raw data over, we should theoretically send all the versions over. If we wanted to send only the best version over, for some version of best, then we are already sending a view, not the full representation. Sending over geojson allows us to do most of the reasoning about the view on the server side, in a fairly flexible fashion.
  • It also means that we can unify the visualization on the server side for analysis and the visualization on the phone. One method that exports to geojson can be used to generate both files/ipython notebook visualizations, and phone visualizations.
  • The pointToLayer, style and onEachFeature options support pretty much any interaction and styling that I can think of now.
  • There are also three implementation decisions that become easier with this.
    • the document sent can be a geojson document for a day instead of a set of small documents for each point
    • for historical days, we can just call an API that returns the geojson for that day and display it. if we wanted to convert to geojson on the client side, we would need to do the conversion while the user is waiting, which is not cool.
    • if we wanted to generate geojson on the client, presumably we would read the trip and section data, and then query the database for points that are within them. But on the phone, we can only really query by metadata, and note that we have the gap in the sections if we use write_ts because of the gap between ts and write_ts.
    • also, then if we wanted to display data for a prior day, we would need to get all the points, store them into the database and then query for them to reconstruct the geojson. That is an admin nightmare. Alternatively, we have to be able to work from points in memory, which means that for the current day, we need to be able to query all points into memory and then select from them, which is not too bad, but seem like a pretty heavy memory drain.

Conclusion: We are going to return GeoJSON for now. Can revisit later if this turns out to be a hassle. This also implies that for many of the result visualizations such as the carbon footprint, we can also send a visualization instead of the raw data. In particular, the matplotlib -> d3 library mpld3 (http://mpld3.github.io/index.html) has come a long way in the time since I last looked at it. It now supports outputting json that corresponds to a matplotlib figure and then displaying it using the mpld3.js library. So we can use all the cool stuff that generates matplotlib figures, including from pandas dataframes, generate json, and send the json over to the client as the "document". Again, this allows us to do quickly iterate on common visualizations from the server side. A similar effort is vega (https://github.com/vega/vega), which also allows some interactivity in the grammar (unsure how important this is).

Common results

Things that can be used for both user results and background configuration, e.g. common trips.

Combined

{
    'user':
    {
        'data':
        {
            'carbon_footprint':
            {
                'mine': mFootprint,
                'mean': meanFootprint,
                'all_drive': driveFootprint,
                'optimal': optimalFootprint,
            }
            'distances':
            {
                'mode1': mode1Distance,
                 ....
            }
        ...
        }
        'game':
        {
            'my_score': 3435,
            'other_scores':
            {
                'person1': 2313,
                'person2': 4123,
                'person3': 2111
            }
        }
        ...
    },
    'background_config':
    {
        'pull_sensors': {
            ['accelerometer', 'gyroscope', ...]
        },
        'location':
        {
             'accuracy': POWER_BALANCED_ACCURACY,
             'filter': DISTANCE_FILTER,
             'geofence_radius': 100,
             ...
        },
        ...
    },
    'common':
    {
        'tour_models': {
            ...
        }
    }
}

Builtin implementation

In the builtin implementation, the cache sync happens through two REST API calls.

  • The /usercache/put sends the data from the phone to the server in a field called phone_to_server, i.e.
{
phone_to_server: {
    'user': {...},
    'background': {...}
}
}
  • The /usercache/get call pulls the data from the server to the phone. This doesn't have a root node (Question: Should it?), e.g.
{
    'background_config': {...},
    'user': {...}
}

Implementation plan

Since this is the background data collection repo, we will implement the background parts of both the server -> phone and phone -> server components as part of enhancing the background data collection.