-
-
Notifications
You must be signed in to change notification settings - Fork 23
yaml diff Array of Hashes Options
- Introduction
- By Position
- Deep Comparison by Position
- By Value
- By Key
- Deep Comparison by Key
- Configuration File Options
This document is part of the body of knowledge about yaml-diff, one of the reference command-line tools provided by the YAML Path project.
The yaml-diff
command-line tool enables users to control how Arrays-of-Hashes (AoH) are compared. This is different from merging regular Arrays, discussed elsewhere. The position
mode is used, by default. It, and other options, are explored in the following sections. These include:
-
position
(the default) each record is treated as a whole unit and any differences between the LHS and RHS elements are reported as changes, no matter how nested the change may be. -
dpos
compares each element by position, except they are deeply compared, one node at a time. Any differences are reported at each node's own YAML Path instead of at the AoH element path. -
value
treats each element as a whole unit, synchronizing the two Arrays being compared by equal values before they are checked for differences. The entire Hash must be perfectly identical to be matched up. -
key
treats each element as a record with an identity key, arranging the two Arrays being compared by matching only these key fields. However, the entire record is treated as a whole unit, so any differences -- no matter how deeply nested in any record -- cause a change of the whole element to be reported. -
deep
treats each element as a record with an identity key, arranging the two Arrays being compared by matching only these key fields before recursively comparing the LHS and RHS records, one node at a time. Any differences are reported at each node's own YAML Path instead of at the AoH element path.
When comparing AoH elements by position, the Hashes are treated a whole units. Differences in their child nodes are detected but not individually reported. Rather, the whole of any different Hashes are reported in JSON format.
This is the default comparison mode. It is not ideal for every use-case, so several other modes are available.
For an example, consider these two documents and their position-based differences:
File: LHS1.yaml
---
products:
- product: doodad
availability:
start:
date: 2020-10-10
time: 08:00
stop:
date: 2020-10-29
time: 17:00
dimensions:
width: 5
height: 5
depth: 5
weight: 10
- product: dumdow
availability:
start:
date: 2020-10-23
time: 08:00
stop:
date: 2020-11-23
time: 17:00
dimensions:
width: 3
height: 3
depth: 3
weight: 27
- product: doohickey
availability:
start:
date: 2020-08-01
time: 10:00
stop:
date: 2020-09-25
time: 10:00
dimensions:
width: 1
height: 2
depth: 3
weight: 4
- product: widget
availability:
start:
date: 2020-01-01
time: 12:00
stop:
date: 2020-01-01
time: 16:00
dimensions:
width: 9
height: 10
depth: 1
weight: 4
File: RHS1.yaml
---
products:
- product: doodad
availability:
start:
date: 2020-10-10
time: 08:00
stop:
date: 2020-10-29
time: 17:00
dimensions:
width: 5
height: 5
depth: 5
weight: 10
- product: dumdow
availability:
start:
date: 2020-10-23
time: 08:00
stop:
date: 2020-11-23
time: 17:00
dimensions:
width: 4
height: 3
depth: 3
weight: 27
- product: widget
availability:
start:
date: 2020-01-01
time: 12:00
stop:
date: 2020-01-01
time: 16:00
dimensions:
width: 9
height: 10
depth: 1
weight: 4
- product: doohickey
availability:
start:
date: 2020-08-01
time: 10:00
stop:
date: 2020-09-25
time: 10:00
dimensions:
width: 1
height: 2
depth: 3
weight: 4
At a glance, we might spot a couple of differences, or think there are more differences than there really are, depending on how these documents are compared. When we instruct yaml-diff
to compare these by position
, it produces this report:
c products[1]
< {"product": "dumdow", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"product": "dumdow", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}
c products[2]
< {"product": "doohickey", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
---
> {"product": "widget", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
c products[3]
< {"product": "widget", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
---
> {"product": "doohickey", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
This report has low granularity. It displays entire Hashes when there are differences, no matter how small or large. There is no change to the first Hash -- at index 0 -- between the two documents, so it doesn't appear in the report, at all. The next record -- at index 1 -- has a very small change to its width property. Because this is a position
based report, the entirety of both the LHS and RHS Hashes are returned as having been changed. The next two elements are arguably identical between the two documents other than their ordinal positions within their respective documents. They are reported as entire changes because a comparison by position is concerned only with whether the LHS and RHS elements have any differences whatsoever at the same Array index.
If you want a high-granularity version of the same report from By Position, use dpos
(an abbreviation for "deep position" comparison). Be warned: high-granularity reports can be quite long! Doing so with the same LHS1.yaml and RHS1.yaml documents produces this very detailed report:
c products[1].dimensions.width
< 3
---
> 4
c products[2].product
< doohickey
---
> widget
c products[2].availability.start.date
< 2020-08-01
---
> 2020-01-01
c products[2].availability.start.time
< 10:00
---
> 12:00
c products[2].availability.stop.date
< 2020-09-25
---
> 2020-01-01
c products[2].availability.stop.time
< 10:00
---
> 16:00
c products[2].dimensions.width
< 1
---
> 9
c products[2].dimensions.height
< 2
---
> 10
c products[2].dimensions.depth
< 3
---
> 1
c products[3].product
< widget
---
> doohickey
c products[3].availability.start.date
< 2020-01-01
---
> 2020-08-01
c products[3].availability.start.time
< 12:00
---
> 10:00
c products[3].availability.stop.date
< 2020-01-01
---
> 2020-09-25
c products[3].availability.stop.time
< 16:00
---
> 10:00
c products[3].dimensions.width
< 9
---
> 1
c products[3].dimensions.height
< 10
---
> 2
c products[3].dimensions.depth
< 1
---
> 3
Identified by their YAML Paths, every single leaf-node-level difference between both documents is reported. We can see the anticipated -- very small -- change to the "dumdow" element: its "width" changed from 3 to 4. The remainder of the report is really just overly accurate noise with this particular data. Other comparison modes -- Deep Comparison by Key, in particular -- would be far more useful for this use-case.
If we compare the same two documents from By Position using the value
mode, we get a much smaller report:
c products[1]
< {"product": "dumdow", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"product": "dumdow", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}
In this mode, the records for "widget" and "doohickey" were identified as being identical, so they were omitted from the report. However, like the position
mode, this value
mode still has low granularity. We can see that there was a change to the "dumdow" record but it may be difficult to see precisely what that change is with this mode.
Note that value
is not tricked by any reordering of the child nodes within any of the elements. As long as the nodes are identical, their order is irrelevant.
When the Array of Hash elements are records with identity keys and you want a low-granularity report of differences, use the key
comparison mode. The value
comparison mode can produce suboptimal results when there are material differences between the two record sets. This is particularly the case when there are differences in otherwise equivalent records, which also happen to be at different ordinal positions in the two documents. Further, when the identity key attribute may not be the first attribute of the first record in a particular Array of Hashes or you need special handling for certain records, use the yaml-diff Configuration File.
Consider these two variations of the above YAML data files:
File: LHS2.yaml
---
products:
- product: doodad
sku: 0-000-0001-0
availability:
start:
date: 2020-10-10
time: 08:00
stop:
date: 2020-10-29
time: 17:00
dimensions:
width: 5
height: 5
depth: 5
weight: 10
- product: dumdow
sku: 0-000-0002-0
availability:
start:
date: 2020-10-23
time: 08:00
stop:
date: 2020-11-23
time: 17:00
dimensions:
width: 3
height: 3
depth: 3
weight: 27
- product: doohickey
sku: 0-000-0003-0
availability:
start:
date: 2020-08-01
time: 10:00
stop:
date: 2020-09-25
time: 10:00
dimensions:
width: 1
height: 2
depth: 3
weight: 4
- product: widget
sku: 0-000-0004-0
availability:
start:
date: 2020-01-01
time: 12:00
stop:
date: 2020-01-01
time: 16:00
dimensions:
width: 9
height: 10
depth: 1
weight: 4
File: RHS2.yaml
---
products:
- sku: 0-000-0001-0
availability:
start:
date: 2020-10-10
time: 08:00
stop:
date: 2020-10-29
time: 17:00
dimensions:
width: 5
height: 5
depth: 5
weight: 10
- sku: 0-000-0002-0
availability:
start:
date: 2020-10-23
time: 08:00
stop:
date: 2020-11-23
time: 17:00
dimensions:
width: 4
height: 3
depth: 3
weight: 27
- dimensions:
width: 9
height: 10
depth: 1
weight: 4
sku: 0-000-0004-0
availability:
start:
date: 2020-01-01
time: 12:00
stop:
date: 2020-01-01
time: 16:00
- product: doohickey
availability:
stop:
date: 2020-09-25
time: 10:00
start:
date: 2020-08-01
time: 10:00
dimensions:
weight: 4
width: 1
depth: 3
height: 2
Note that all of the "product" fields were removed from the RHS document except for the "doohickey" record, which is missing its "sku" field.
Comparing these two documents with the value
mode produces this report:
c products[0]
< {"product": "doodad", "sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
---
> {"sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
c products[1]
< {"product": "dumdow", "sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}
c products[2]
< {"product": "doohickey", "sku": "0-000-0003-0", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
---
> {"dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}, "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}}
c products[3]
< {"product": "widget", "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
---
> {"product": "doohickey", "availability": {"stop": {"date": "2020-09-25", "time": "10:00"}, "start": {"date": "2020-08-01", "time": "10:00"}}, "dimensions": {"weight": 4, "width": 1, "depth": 3, "height": 2}}
This report doesn't make a lot of sense. The changes to the earlier records were detected -- by coincidence that the ordinal positions of the same records were identical between both documents -- and the "doohickey" and "widget" records could not be automatically matched up, at all. This is because the value
mode falls-back to position
when records are not identical.
In this case, we need to use the key
mode. Contrast the value
report above with the report generated using the key
mode:
c products[0]
< {"product": "doodad", "sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
---
> {"sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
c products[1]
< {"product": "dumdow", "sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}
d products[2]
< {"product": "doohickey", "sku": "0-000-0003-0", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
a products[3]
> {"product": "doohickey", "availability": {"stop": {"date": "2020-09-25", "time": "10:00"}, "start": {"date": "2020-08-01", "time": "10:00"}}, "dimensions": {"weight": 4, "width": 1, "depth": 3, "height": 2}}
c products[3]
< {"product": "widget", "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
---
> {"dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}, "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}}
For this case -- when we need to use an identity key to match up otherwise very different record sets -- the records were properly matched up and the real differences were reported. This includes a correct delete-add difference for the "doohickey" record because it is missing the mandatory identity key in the RHS document and could therefore not be matched for direct comparison.
What if you want the "doohickey" product from this record set to be matched despite the record lacking the mandatory "sku"? That's easy: use a yaml-diff Configuration File like so:
[rules]
/products = key
[keys]
/products[product=doohickey] = product
This changes the report to:
c products[0]
< {"product": "doodad", "sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
---
> {"sku": "0-000-0001-0", "availability": {"start": {"date": "2020-10-10", "time": "08:00"}, "stop": {"date": "2020-10-29", "time": "17:00"}}, "dimensions": {"width": 5, "height": 5, "depth": 5, "weight": 10}}
c products[1]
< {"product": "dumdow", "sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 3, "height": 3, "depth": 3, "weight": 27}}
---
> {"sku": "0-000-0002-0", "availability": {"start": {"date": "2020-10-23", "time": "08:00"}, "stop": {"date": "2020-11-23", "time": "17:00"}}, "dimensions": {"width": 4, "height": 3, "depth": 3, "weight": 27}}
c products[2]
< {"product": "doohickey", "sku": "0-000-0003-0", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
---
> {"product": "doohickey", "availability": {"stop": {"date": "2020-09-25", "time": "10:00"}, "start": {"date": "2020-08-01", "time": "10:00"}}, "dimensions": {"weight": 4, "width": 1, "depth": 3, "height": 2}}
c products[3]
< {"product": "widget", "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}, "dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}}
---
> {"dimensions": {"width": 9, "height": 10, "depth": 1, "weight": 4}, "sku": "0-000-0004-0", "availability": {"start": {"date": "2020-01-01", "time": "12:00"}, "stop": {"date": "2020-01-01", "time": "16:00"}}}
Notice that the "doohickey" record was successfully compared against its LHS equivalent, despite lacking a "sku". This custom configuration changed the identity key of just this specific oddball record from "sku" to "product". All other records were still matched up by their "sku" values.
This will be the preferred comparison mode when dealing with record sets -- the AoH elements have identity keys -- which may be in disjointed ordinal positions and which may have minor differences between the comparison documents or you need to see the minute differences between the records, no matter how few or many.
Using this mode against LHS2.yaml and RHS2.yaml (without the bonus configuration file) produces this report:
d products[0].product
< doodad
d products[1].product
< dumdow
c products[1].dimensions.width
< 3
---
> 4
d products[2]
< {"product": "doohickey", "sku": "0-000-0003-0", "availability": {"start": {"date": "2020-08-01", "time": "10:00"}, "stop": {"date": "2020-09-25", "time": "10:00"}}, "dimensions": {"width": 1, "height": 2, "depth": 3, "weight": 4}}
a products[3]
> {"product": "doohickey", "availability": {"stop": {"date": "2020-09-25", "time": "10:00"}, "start": {"date": "2020-08-01", "time": "10:00"}}, "dimensions": {"weight": 4, "width": 1, "depth": 3, "height": 2}}
d products[2].product
< widget
We can clearly see that the most common difference between the records is the deletion of the "product" field. Because the RHS "doohickey" record lacks a "sku", it was still not matched for comparison but was instead marked correctly as a delete-add.
Adding the same bonus configuration file from Bonus: Configuration Files with Key Comparisons further reduces the differences report to:
d products[0].product
< doodad
d products[1].product
< dumdow
c products[1].dimensions.width
< 3
---
> 4
d products[3].sku
< 0-000-0003-0
d products[2].product
< widget
Note that the delete-add difference pair was consolidated to show the "sku" field was removed from the oddball record.
The yaml-diff
tool can read per YAML Path comparison options from an INI-Style configuration file via its --config
(-c
) argument. Whereas the --aoh
(-O
) argument supplies an overarching mode for comparing AoHs, using a configuration file permits far more precise control whenever you need a different mode for specific parts of the documents being compared.
The [defaults]
section permits a key named, aoh
, which behaves identically to the --aoh
(-O
) command-line argument to the yaml-diff
tool. The [defaults]aoh
setting is overridden by the same-named command-line argument, when supplied. In practice, this file may look like:
File merge-options.ini
[defaults]
aoh = position
Note the spaces around the =
sign are optional but only an =
sign may be used to separate each key from its value.
The [rules]
section takes any YAML Paths as keys and any of the AoH comparison modes that are available to the --aoh
(-O
) command-line argument. This enables extremely fine precision for applying the available modes.
This has already been explored at Bonus: Configuration Files with Key Comparisons.
Like the [rules]
section, the [keys]
section takes any YAML Paths as keys. In contrast, each entry specifies the identity key for the AoH at the specified YAML Path, overriding implicit identity key detection for the targeted AoHs.
See Bonus: Configuration Files with Key Comparisons for an example.