-
Notifications
You must be signed in to change notification settings - Fork 3
Configuration
Recce is configured by adding datasources and datasets that you wish to reconcile.
You can manage configuration as multiple files; or one single file. They will be merged together at runtime.
As a Micronaut application, much of Recce's configuration is open for hacking and can be expressed in multiple ways.
Recce-specific configuration (as opposed to all possible generic Micronaut configuration) is documented within the default config.
To make it easier to configure Recce's own DB, some dedicated properties are respected to construct the appropriate URL including connection pooling.
Env value | Default | Description |
---|---|---|
DATABASE_HOST | localhost |
Host your Postgres DB is on |
DATABASE_PORT | 9000 |
Port your Postgres DB is on |
DATABASE_NAME | db |
The name of the logical database within Postgres |
DATABASE_USERNAME | user |
Username to connect with |
DATABASE_PASSWORD | password |
Password to connect with |
In the normal Micronaut way, additional configuring can be configured using
-
R2DBC_DATASOURCES_DEFAULT_*
environment variables, or -
r2dbc.datasource.default.*
system properties or - Merging in an additional YAML configuration file (see below)
As a Micronaut application, configuration can be externalised in many ways.
However, the recommended way to add additional configuration for your own datasources and datasets to reconcile is to
mount a volume with your configuration and set MICRONAUT_CONFIG_FILES
to point Recce at your additional configuration
which will be merged into the base configuration. This allows you to separate and manage configuration as appropriate
for your environment.
mkdir -p my-dataset-configs
touch my-dataset-configs/config1.yml my-dataset-configs/config2.yml
## Run latest from docker
docker run -p 8080:8080 \
-v $(pwd)/my-dataset-configs:/config \
-e MICRONAUT_CONFIG_FILES=/config/config1.yml,/config/config2.yml \
ghcr.io/thoughtworks-sea/recce-server:latest
The credentials used in basic authentication to protect the API endpoints can be configured in the auth
block of your config file.
auth:
username: some-username
password: some-password
Arbitrary #s of data sources can be configured in the r2dbc.datasources
block of your config file.
r2dbc:
datasources:
my-source-db: # Name your datasource anything you want, other than "default"
# R2DBC URL for your database r2dbc:pool:DB_TYPE://DB_HOST:DB_PORT/DB_NAME
# DB_TYPE supported = mysql|postgresql|mssql|mariadb|oracle
url: r2dbc:pool:mysql://source-db:3306/db
username: user
password: password
my-target-db:
url: r2dbc:pool:mysql://target-db:3306/db
username: user
password: password
For configuration specific to a given driver/DB you can consult their documentation. Usually additional settings needs to be in the options:
block; or sometimes inside the connection URL.
- MySQL
- Postgres
- MS SQL Server
- MariaDB (can also be used for MySQL, as well as AWS RDS Aurora MySQL)
- Oracle
By default, Recce is deployed with r2dbc-pool to manage connection pooling to data sources. If you remove :pool
from the URL, this will be disabled.
You can thus customise the connection pool size, etc
r2dbc:
datasources:
my-source-db:
url: r2dbc:pool:mysql://source-db:3306/db
username: user
password: password
options: # Connection Pool options
initialSize: 1
maxSize: 5
maxLifeTime: 5
# etc, see https://github.com/r2dbc/r2dbc-pool
Datasets are the heart of your configuration. This tells Recce what and how to reconcile logically equivalent chunks of data.
Datasets are groupings of data which
- point to source and target datasources configured above
- express queries against source and target that will should produce identical output if the datasets are to be considered reconciled
- can use normal SQL to express an equivalent data representation between source and target schemas which accounts for intended differences in the way data has been structured and migrated
- can be scheduled to run on regular intervals
reconciliation:
datasets:
pets: # Name your datasets however you would like
source:
# Reference to a datasource defined in `r2dbc.datasources`
datasourceRef: my-source-db
# Any SQL query to evaluate against the source DB
query: >
SELECT pet.id AS MigrationKey, category, name, status
FROM pet
target:
# Reference to a datasource defined in `r2dbc.datasources`
datasourceRef: my-target-db
# Any SQL query to evaluate against the source DB
query: >
SELECT pet.id AS MigrationKey, category.name AS category, pet.name, status
FROM pet INNER JOIN category ON pet.category_id = category.id
# Optional scheduling of regular or one-of reconciliations
schedule:
# Must adhere to format https://docs.micronaut.io/latest/api/io/micronaut/scheduling/cron/CronExpression.html
# or https://crontab.guru/ (without seconds)
cronExpression: 0 0 * * *
Fuller example Recce-specific configuration is available here.
The general philosophy of Recce is that differences between source and target are best handled by the developer in SQL.
Recce needs to know which column represents a unique identifier for the row that should be consistent between source
and target
and implies these rows represent the same entity.
To do this, designate a column by naming it as MigrationKey
(case insensitive)
SELECT natural_id AS MigrationKey, some, other, columns
FROM my_table
Recce will complain if there is more than one column in your dataset with this name.
Currently Recce ignores names of columns other than the MigrationKey
column. That means that the order of columns is critical and must match between your two queries.
If the column in position 3 represents datum X
in the source
dataset, then the column in position 3 in the target
dataset should also represent the same datum.
The data types of the columns need not match exactly; however if the values produce different hashes, this will lead to a row mismatch. For example an integer of 111
in source will produce a different hash to a string of "111"
in target. If you want these values to be considered identical, you should use SQL to coerce the data types to match and express this intended difference.
If you would like to see a nameBased
column matching option, consider adding your thoughts to #55.
You should look to handle differences in types and semantics using the SQL expressions and functions available on the relevant database platform.
For example, if your source database represented an enumeration using integers (1, 2, 3
etc) whereas your target represented them using VARCHAR
s, you would use a CASE
statement to express this expected difference
-- source
SELECT id AS MigrationKey, CASE WHEN enumerated = 1 THEN 'Completed' ELSE 'PENDING' END
FROM my_table
-- target
SELECT id AS MigrationKey, enumerated_text
FROM my_table
When reconciling very large data sets, it may not be feasible to do row-by-row, column-by-column comparisons. In such cases it may be sufficient to use aggregate queries to get an idea whether you have lost data. e.g
--- check #s of login audit records for a month of data by location and type
SELECT user_location || user_type AS MigrationKey, count(*) AS Count
FROM user_login_audit
WHERE user_login_datetime >= timestamp('2021-11-01 00:00:00') AND
user_login_datetime < timestamp('2021-12-01 00:00:00')
GROUP BY user_location, user_type