Skip to content

Architecture

Gregory Martin edited this page Aug 11, 2020 · 4 revisions

The code is organized to python module best practices.

The pipeline queries three endpoints:

  • Google Analytics
  • QScend
  • Citizenserve.

And stores cleaned data in Google Analytics and QScend are standard REST endpoints, while Citizenserve is an FTP endpoint with a collection of dated csv files stored in a shared directory. The pipeline runs queries against the endpoints (or downloads the data files), cleans and organizes the data into Socrata friendly shapes, and uploads the data to the public Socrata data endpoints.

Organization

/config

Many outside changes (such as Socrata Data store locations, passwords, or QScend data categories) are tracked here. If there is a change to a permit category, a status code, or a qscend category this is the simplest place to make changes.

Remember to validate any new JSON before committing to master.

/bin

This is a Command Line Interface for the pipeline, mostly used in production (but can be used from a QScend API IP-allowed terminal). The available commands are:

  • stat_pipeline For a normal pipeline run
  • stat_pipeline -i One-off queries that dump to CSV files for upload to the Somerville Socrata instance via the Socrata UI.
  • stat_pipeline -m To run a (long) historical data dump to the QScend Data Endpoints

/stat_dashboard_pipeline

This is the main python codebase for the pipeline.

/

The primary entrance script (called by /bin/stat_pipeline) is __init__.py.
This root class inherits the methods from /stat_dashboard_pipeline/config.py to store authentication and endpoint variables, and instantiates the child classes for the four API client classes.

/clients

This contains the classes for the four data endpoints with some commonly used methods. These classes are all parents to the corresponding classes in the /pipeline directory, and so are accessible from the corresponding classes in that directory, with the exception of socrata_client.py, as it is not a data source.

If an endpoint changes, or there is an issue connecting to an endpoint, this is where a contributor would want to look to find answers.

/pipeline

These are child classes of their respective clients/* classes, and call methods against those classes, groom the data to Socrata friendly shapes, and store their value for calling by the parent Pipeline class in /stat_dashboard_pipeline/__init__.py.

At it's simplest, the pipeline queries the API endpoint, transforms the returned JSON into a python dict, and then sends that dict to Socrata. In the transformation to the dict is where one would discard data (such as PII), or reorganize or rename data tables.

If a contributor wanted to seek an alternative type of data, or expand the types of data kept and discarded this would be where they would look (assuming it's not a change in categories, which is covered in the JSON files in /config.
The google analytics pipeline (analytics.py) is the simplest data transformation, and a good place for a new contributor to get their feet wet.

/tests

Unit tests are built with the python unittest library. A developer can run them in the terminal with python setup.py test.
Contributors, if they add or change functionality, will have to green these tests before a pull request is accepted to merge to master (they run automatically when a PR is opened or added to), and if they add functionality should be encouraged to write tests for that functionality.

Clone this wiki locally