GeoBC Foundational Information and Technology (FIT) Section tool for downloading open data and reporting on changes since last download.
- Based on sources and schedule defined in a provided config file, download spatial data from the internet
- Compare downloaded data to cached version on object storage
- If changes are detected, write the latest download to object storage along with a change report
Using pip
managed by the target Python environment:
git clone git@github.com:bcgov/FIT_opendatadownloader.git
cd FIT_opendatadownloader
pip install .
A command line interface is provided:
$ fit_downloader process --help
Usage: fit_downloader process [OPTIONS] CONFIG_FILE
For each configured layer - download latest, detect changes, write to file
Options:
-l, --layer TEXT Layer to process in provided config.
-p, --prefix S3 prefix.
-f, --force Force download to out-path without running
change detection.
-s, --schedule [D|W|M|Q|A] Process only sources with given schedule tag.
-V, --validate Validate configuration
-v, --verbose Increase verbosity.
-q, --quiet Decrease verbosity.
--help Show this message and exit.
Examples:
-
Validate a configuration file for a given source:
$ fit_downloader process -vV example_config.json
-
Process data defined in
sources/CAPRD/victoria.json
configuration file, saving tos3://$BUCKET/CAPRD/victoria
:$ fit_downloader process -v \ --prefix s3://$BUCKET/Change_Detection/CAPRD/victoria \ sources/CAPRD/victoria.json
Output data will look something like this:
$ aws s3 ls s3://$BUCKET/Change_Detection/CAPRD/victoria --human --recursive
2025-02-11 14:41:16 4.2 KiB Change_Detection/CAPRD/victoria/fit_downloader.log
2025-01-13 14:48:38 148.6 KiB Change_Detection/CAPRD/victoria/parks.gdb.zip
2025-01-13 14:48:41 236.7 KiB Change_Detection/CAPRD/victoria/roads.gdb.zip
Layers for downloaded are configured per jusrisdiction in sources. Each config .json file has several tag defining how to handle data for the given jurisdiciton:
tag | required | description |
---|---|---|
out_layer |
Y | Name of target file/layer (parks , roads , etc) |
source |
Y | url or file path to file based source (required). For http protocol sources, data must be of format readable by GDAL/OGR |
protocol |
Y | Type of download (http - file via http/curl, esri - ESRI REST API endpoint, bcgw - download BCGW table via WFS/bcdata ) |
fields |
Y | List of source field(s) to retain in the download (required) |
schedule |
Y | Download frequency (required, must be one of: [D, W, M, Q, A ] - daily/weekly/monthly/quarterly/annual) |
source_layer |
N | Name of layer to use within source (optional, defaults to first layer in file) |
query |
N | Query to subset data in source/layer (OGR SQL) (optional, currently only supported for sources where protocol is http ) |
primary_key |
N | List of source field(s) used as primary key (optional, must be a subset of fields ) |
hash_fields |
N | List of additional source field(s) to add to a synthetic geometry hash based primary key (optional, must be a subset of fields) |
metadata_url |
N | Link to source metadata |
For the full schema definition, see source.schema.json
.
To add data sources:
-
Create (or edit) a config file with location/name corresponding to the admin area. For example:
/sources/CAPRD/central_saanich.json
-
If adding config files, consider validating the file names. A simple validation script is provided to check that file names correspond to values in
sources/valid_sources.csv
. To use the script:$ cd sources $ python validate_source_filenames.py . Names of all 15 json files in . are valid
-
Add sources to the config as needed. As a guide, see other files present in
/sources
and the configuration notes above. Note that only twoout_layer
values are supported at this time,parks
androads
As noted above, the source
tag in the config file is the url or file path. For sources of protocol http
, the data must be stored in a format readable by GDAL/OGR.
Steps to determine this will vary by data source, but the general sequence is:
- navigate to the data source's public web page and find the open data page/portal/etc (eg https://opendata.victoria.ca/)
- find the best link to the data of interest, where the general preference (in descending order) is:
- direct links to data files (eg https://www.nanaimo.ca/GISFiles/shp/Parks.zip)
- ArcGIS REST API endpoints (eg https://maps.victoria.ca/server/rest/services/OpenData/OpenData_Transportation/MapServer/25)
- links that auto re-direct to data files (eg https://governmentofbc.maps.arcgis.com/sharing/rest/content/items/4bba119c2e9042d683cc9378fb1e836e/data)
- generally, any format that is readable by OGR is acceptable, but (with all else being equal) the order of preference would be:
- GDB/GPKG
- geojson
- shp
- while the script may handle
source
urls without modification, prefixing sources of protocolhttp
with/vsicurl/
(or/vsizip//vsicurl
if the data is zipped) will generally be more reliable
To test/debug sources of protocol http
, use ogr2ogr
in debug and read-only mode, with the curl debug set to verbose:
ogrinfo -ro \
/vsizip//vsicurl/https://opendata.chilliwack.com/shp/Parks_SHP.zip \
--debug ON \
--config CPL_CURL_VERBOSE=TRUE
The resulting output is very verbose. If a given source cannot be read by ogrinfo
, look through the output for things like:
- any network errors reported
- redirects from endpoints to static files (if this is the case, replace the endpoint url with direct file url)
If problems continue, try downloading the file with a web browser or curl
and reading the result.
In some cases, the name of the zipfile downloaded does not correspond with the .gdb within, or zipfiles may be nested.
For example, Langley (City) packages an arbitrary .gdb into a file called transport.gdb.zip
, that can be handled like this:
"source": "/vsizip/{/vsicurl/https://governmentofbc.maps.arcgis.com/sharing/rest/content/items/4bba119c2e9042d683cc9378fb1e836e/data}/CoL_TransportationNetwork September 25 2024.gdb"
- see the /vsizip/ link below for how to handle zipfile complications.
When debugging connection to a quirky server/file combination, see these ogr2ogr/gdal references:
Using GDAL on your system:
$ git clone git@github.com:bcgov/FIT_opendatadownloader.git
$ cd FIT_opendatadownloader
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -e .[test]
(.venv) $ py.test
Using GDAL on a docker image:
To build:
$ git clone git@github.com:bcgov/FIT_opendatadownloader.git
$ cd FIT_opendatadownlaoder
$ docker build -t fit_opendatadownloader .
Drop in to a bash session:
$ docker run --rm -it -v ./:/home/fit_opendatadownloader fit_opendatadownloader bash