-
Notifications
You must be signed in to change notification settings - Fork 2
KIRK Project Plan (v0.0.2)
This document will briefly describe the replication systems at use at DataBC and in doing so identify the path that has led us to the development of KIRK (Keeping Information Replicated Kontinuously).
The long term high level objectives of KIRK are:
-
To be a self serve replication tool that allows clients to have complete control over their replications. (configuration, scheduling, retrieve status, logs, notifications etc)
-
API with replication metadata that can be integrated with the BC Data Catalog allowing sharing of replication information in the catalog.
Version 1 of kirk was implemented in November 2018. There is a significant amount of development work that needs to take place in order for KIRK to meet its objective. This document aims to identify the current status of KIRK and also briefly describe the work that needs to be completed for it to achieve its objectives.
KIRK draws on many other technologies and in its current implementation is built on top of FME Server. The design objectives for KIRK is for it to serve as a replication abstraction layer that allows us to more efficiently swap in and out new and emerging technology.
KIRK is the latest approach to modernizing DataBC's replication system. KIRK inherits and builds upon previous efforts to improve DataBC's replication systems such as the DataBC FME Framework. This section will provide some background on the different replication systems at use today at DataBC. The most detail is provided on FME Server as this is the current technology that KIRK is build on, and it is the tool that we use for spatial data replications.
DataBC maintains a data warehouse which is currently referred to as the BC Geographic Warehouse (BCGW). The purpose of the warehouse is a central location where people can to consume government information easily. All of the information in the warehouse is copied / replicated from other systems.
While DataBC has used many different systems over the years to replicate data to the BCGW, the focus of this section will be on systems that are currently in use. Systems currently in use are:
-
FME Server
-
Oracle Materialized Views
-
SDR
Oracle materialized views are primarily used for tabular data replications between two oracle databases.
SDR is a custom solution that has been used to replicate spatial data in an incremental manner from various IIT databases to the BCGW. SDR relies on update timestamps that are built into the data models that it replicates from.
FME Server is our primary replication tool especially for spatial data. FME Server offers the most flexibility as it can replicate from almost any data source, and perform many complex data transformations along the way. FME Server has been our attempt to use a COTS solution for replication vs previous bespoke solutions.
Experience using FME Server has identified the following problems:
-
Replication metadata gets defined inside individual replication scripts. Extracting information from these scripts is awkward and frequently requires reverse engineering the .FMW file format.
-
Related to the previous issue being able to build the relationship between source / destination and replication script given any single piece of information is difficult
-
Testing replications against different environments (DLV/TST/PRD) requires editing replication scripts connection parameters. Human error has frequently resulted in DLV or TST scripts getting implemented instead of PRD versions.
-
Related to previous idea, FME's power and flexibility is also its downfall when used as an enterprise scale replication tool. There are so many configuration options that can easily be set incorrectly resulting in replication failures, or worse still successful replications from the standpoint of the software, but not accomplishing what we intended.
-
Repliciaton scripts require the embedding / duplication of database credential which makes updating credentials cumbersome.
-
There is no api that can be used to update FMW scripts. Our current inventory of 400+ scripts make updating cumbersome. This means that implementing things like incremental updates requires editing individual scripts.
The DataBC FME Framework was created to address many of the technical shortcomings of the FME Server product from DataBC's business requirements perspective. The framework is made up of the following components:
-
Standardized Parameterization of Scripts - a standard way of defining source / destination in FME scripts making it possible to build the relationship between source / destination / FME Script
-
Parameterization of Destination Environment - Allows us to use the same script to replicate to DLV / TST / PRD environments. Also allows us to migrate entire replication system to new databases without having to edit individual scripts.
-
Automated Credential Retrieval - Framework scripts allow us to remove embedded authentication credentials from replication scripts. The framework retrieves credentials from PMP using an API key at run time.
-
Easy Injection of Startup / Shutdown Processes - We can inject in processes into the Framework scripts at the start and conclusion of each replication script. Currently we have processes to provide notifications on failures, report on replication status, run database analyze after data loads, etc. Some ideas that are currently not implemented that the framework would enable include: adding data model validation at startup, and data validation at shutdown.
Framework standardization also allows us to much more easily test how proposed architectural changes will impact our replication systems. With the framework we can easily duplicate our repliciaton system on a test server and run all replications against either DLV or TST to identify problems with proposed architectural changes.
While the DataBC FME Framework addresses a lot of the shortcomings of a vanilla FME framework implementation, it still does not address the problem associated with bulk replication script updates. Currently we try to maintain a one to one to one relationship between source / destination / fme replication script. This means that doing the following things requires individual script editing:
-
Updating deprecated readers / writers
-
Adding features like incremental update
-
Implementation of failed features handling
-
Tracking of more detailed replication metadata, when replication attempted, what data changed, etc
Kirk Addresses many of these shortcomings. Many of the ideas behind KIRK were originally implemented in a proof of concept project called Data Driven FME or DDF. In a nutshell the idea is to externalize all the replication configuration information and then consume that information in a single replication script. This approach would mean that to update a deprecated reader, we could update a single script instead of 450.
The current implementation of KIRK is made up of the following components:
A rest api created using the django rest framework. Code exists on github: [https://github.com/bcgov/kirk]{.underline}
This is a simple api that is where current replication metadata is stored. Currently KIRK tracks:
-
Source
-
Destination
-
Field mapping
-
Counter transformer configuration
The api provides a place where further replication information can be stored including things like replication event / job statistics, how much data changed and when etc. Other ideas that have been explored also include the ability for people to subscribe to data updates on specific data sets.
The Kirk Replication script is one that has been configured to run with a single input, the KIRK Job id. Upon receiving the job id parameter the replication script communicates with the api retrieving the following information for the given job id.
-
Source dataset
-
Destination dataset
-
Authentication credentials
-
Fieldmapping
Currently KIRK schedules are located on FME Server. In other words FME Server triggers the replication events for KIRK related jobs. FME Server schedules different job id's at different times, re-using the same replication script.
The following diagram attempts to communicate our current implementation:
The motivation for implementing KIRK was the ESRI SDE upgrade project. In order to support the SDE upgrade project we needed to update approximately half of our 400+ replication scripts. By implementing KIRK we were able to transition approximately 150 jobs from standalone scripts to KIRK.
In order to meet timelines technical debt was incurred. The following are the immediate development priorities for KIRK in approximate priority:
-
Improve build / deployment pipelines
-
Create UI to allow other users to view API data
-
Enhance logging
-
Simplify backup / restore procedures
-
Create Users / groups to different aspects of KIRK
The following section breaks up the development objectives into categories. It's likely that we will develop select features from various features moving forward. The categories are only there to help organize the different features that are planned to be added to KIRK.
At the time of writing there is no frontend to the rest api. This bundle of tasks describes desired features that would be part of the UI. The order is a rough approximation of priority.
This is both a backend and a front end task. Have attempted to add keycloak authentication to the built in browsable api that comes with Django rest framework, but this has proved to be too difficult as a short term stop gap measure. Thinking that when we create the first version of the api we will also introduce keycloak authentication. Thinking because I don't have direct access to the keycloak configuration that it will be easier to handle authorizations in the app as opposed to trying to configure it in keycloak.
Ability to view configured jobs. Thinking just a smart table view with column filtering would be adequate. Something like this: [https://xaksis.github.io/vue-good-table/]{.underline}
Or this: [https://tochoromero.github.io/vuejs-smart-table/filtering/#filters]{.underline}
Initially we would like DataBC staff to be able to edit jobs, but later on would like to be able to allow other users to view the jobs they are responsible for and edit them themselves. Edit functionality would include:
-
Change source
-
Change destination
-
Change field mapping
-
Change counter configuration
Ability to define new jobs, including:
-
source dataset
-
destination dataset,
-
fieldmaps
Would be great to have a widget to handle fieldmapping. Thinking that when a new source or destination dataset is defined the UI would get populated with the data model for that dataset, with the end goal of making it much easier to map the source / destination fields.
Add the ability to be able to run a job from the KIRK UI. This should be pretty easy to implement technically as its just a rest call to FME Server. Firewall networking configuration will be the biggest barriere.
Ability to define a schedule in the UI. This task should also identify long term location of where scheduling should reside. One option is to remove it from FME Server, and implement scheduling within KIRK using either one of the following options or something similar:
-
Reddis Queue [http://python-rq.org/]{.underline}
-
Advanced Python Scheduler [https://pypi.org/project/APScheduler/]{.underline}
Have the ability for different user types to be able to run different jobs. Example as a DA we should be able to run / manage any job in the system. An external user however should have the ability to view only their jobs, run only their jobs, and possibly run only their jobs against DLV or TST environments.
The current version of the framework supports only ESRI file geodatabase sources. This task is about adding support for other popular source formats that we replicate from.
Current version of KIRK supports only FGDB sources. This task would see an quick analysis of what other simple replications we have in our library and identify the next most popular data source and add it to KIRK. Eventually see support for:
-
Oracle SDE
-
Oracle spatial
-
Oracle non spatial
-
CSV
-
SQL Server
-
Shape files
-
Excel
This section describes changes to the actual replication process in some way.
This is a high priority as it benefits our replication system on a couple of different fronts. First off by incremental refresh will allow our database to start performing like a database. Our current approach of dumping and replacing data prevents the database query optimizers from doing their job. Secondly it will allow us to fill in some key metadata information that clients are frequently requesting. This includes information like:
-
When did the data actually last change as opposed to when was the last replication.
-
When was the last attempted replication
-
How much of the data changed since the last replication
-
Possibly what actual records were changed.
This information is necessary to support some of the ideas that are being piloted in the Llama zoo proof of concept work.
The current implementation of the KIRK inherits the file change logic that was used with the DataBC FME Framework. The current implementation stores the file change logs in a .csv files. This task would see us modify how the file change works so that it stores the change log data in the api instead of a csv.
Should consult with the catalog team when we implement this feature to see if there is an opportunity for a quick win in terms of adding this key piece of information to the catalog.
We can currently identify changes that have taken place on file sources by checking the last modified date on the file system. This task would add the ability to identify if database objects had changed. The incremental refresh process would only continue if the source object was deemed to have changed since the last replication.
Currently all the actual work associated with a replication is addressed by FME. This is primarily because using FME was the fastest way to get to our first version of KIRK. KIRK only requires a very small subset of the functionality that exists in FME. Using the open source GDAL/OGR libraries would likely be more than adequate. The biggest barrier is determining if its possible to write to ESRI SDE using the GDAL/OGR libraries. Another approach to address the ESRI / SDE problem would be to enable a two step replication process where the data gets loaded to a File Geodatabase first, then ESRI feature validation is run on the data, and finally data that passes those tests proceeds through the pipeline and is then written directly to oracle spatial.
FME is currently a black box system. Having little to no control over what FME is doing makes debugging replication scripts difficult. Determining the specific records in replications that are causing failures can be a very cumbersome process and uses up significant amounts of time. Using GDAL / OGR solution will provide additional control over capturing errors, with the potential to significantly reduce the amount of time we spend hunting down records that cause replication failures.
All features in the data pipeline would go through a filter that verifies that they are within the bounding box defined by the provinces boundaries.
Currently all job statistics are recorded into FME. This task would enable each job definition to have an events table associated with it that records all the replication events for a particular replication job. It could also serve as a way of share externally the status of a replication job.
As discussed in the introduction, the intent of KIRK is to be a replication system abstraction layer. By managing the replication configuration data in our own api it allows us to swap out replication technologies more easily. KIRK is to be the "garnish" or perhaps "duct tape" around other technologies. Our original thinking at DataBC with our implementation of FME Server was as a COT's all in one solution. As we've gained more experience with it we have discovered that like most COT's solutions for it to truly meet our objectives it requires significant customization.
We have explored the the following other replication technologies:
-
Kettle [https://community.hitachivantara.com/docs/DOC-1009855]{.underline}
-
Apache NiFi [https://nifi.apache.org/]{.underline}
-
NiFi vs Streamset [https://statsbot.co/blog/open-source-etl/]{.underline}
The following bullets summarize the main reasons for continuing to build KIRK on top of FME Server:
-
We are embedded with FME Server. Transitioning to a new tool without introducing service interruptions would require significantly more work and effort
-
Staff and contractors are familiar with using FME and building replications that we manage on their behalf
-
ESRI SDE! ESRI no longer supports SDE Sdk's. They only support writing to SDE through their own tools. FME is the best option currently available for writing to ESRI SDE.