Seeking advice on my very hacky "schema guessing" to generate YAML Paths with wildcards matching non-str document nodes #200
Unanswered
AndydeCleyre
asked this question in
Q&A
Replies: 1 comment 2 replies
-
In my mind, the best answer requires two things:
I threw this together in about 40 minutes: from typing import Any
from datetime import date
import pprint
from dateutil.parser import parse
from yamlpath.enums import PathSeperators
from yamlpath import YAMLPath
# I don't know how you are getting from TOML to internal Python, so I hand-
# edited the sample data.
sample_data = {
"title": "TOML Example",
"owner": {
"name": "Tom Preston-Werner",
"dob": parse("1979-05-27T07:32:00-08:00"),
},
"database": {
"enabled": True,
"ports": [ 8000, 8001, 8002 ],
"data": [ ["delta", "phi"], [3.14] ],
"temp_targets": { "cpu": 79.5, "case": 72.0 },
},
"servers": {},
"servers.alpha": {
"ip": "10.0.0.1",
"role": "frontend",
},
"servers.beta": {
"ip": "10.0.0.2",
"role": "backend",
},
}
class ShortCircuitException(Exception):
"""Custom exception to break out of a recursive iteration."""
def container_is_homogeneous(node: Any, seek_type: type) -> bool:
"""Test whether every leaf in a container is of one data-type."""
is_homogeneous = True
if isinstance(node, list):
for val in node:
if not isinstance(val, seek_type):
is_homogeneous = False
break
elif isinstance(node, dict):
for val in node.values():
if not isinstance(val, seek_type):
is_homogeneous = False
break
return is_homogeneous
# Somewhere in your code, you're recursively processing the source data. In
# THAT method, you know when you've found something you want to add to the
# result. If you keep track of the parent node, you could check the type of
# the parent node and when it is a container type, you could short-circuit the
# iteration if every child of the parent matches the sought data-type,
# returning your wildcard path.
def recurse_data(
node: Any,
parent: Any,
seek_type: type,
build_schema: dict,
build_path: YAMLPath
) -> dict:
"""Build the expected schema."""
if isinstance(node, list):
for lidx, lnode in enumerate(node):
try:
build_schema = recurse_data(
lnode, node, seek_type, build_schema,
build_path + f"[{lidx}]")
except ShortCircuitException:
break
elif isinstance(node, dict):
for key, value in node.items():
# If your intention is to later use these YAML Paths as input to
# another process, you should escape the keys.
escaped_key = YAMLPath.escape_path_section(key, PathSeperators.DOT)
try:
build_schema = recurse_data(
value, node, seek_type, build_schema,
build_path + escaped_key)
except ShortCircuitException:
break
elif isinstance(node, seek_type):
if not seek_type in build_schema:
build_schema[seek_type] = []
if (isinstance(parent, (list, dict, set))
and container_is_homogeneous(parent, seek_type)
):
build_path.pop()
build_schema[seek_type].append(build_path + "*")
raise ShortCircuitException()
build_schema[seek_type].append(build_path)
return build_schema
# Try it out
pp = pprint.PrettyPrinter(indent=4)
print("Source Data:")
pp.pprint(sample_data)
schema = {}
schema = recurse_data(sample_data, None, date, schema, YAMLPath(""))
schema = recurse_data(sample_data, None, bool, schema, YAMLPath(""))
schema = recurse_data(sample_data, None, int, schema, YAMLPath(""))
schema = recurse_data(sample_data, None, float, schema, YAMLPath(""))
print("\n\nSchema:")
pp.pprint(schema) This walks through the source data, scanning it for the required type. I don't know how you're building the schema or checking for those custom data-types, so I made this as generic as I could. I used a custom The output is:
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello!
It might be best if I lead with the example, then explain.
My yamlpath-powered tool, NestedTextTo, in the
feature/experimental-schema-guessing
branch, behaves as follows:sample.toml
:$ toml2nt sample.toml --to-schema
The part I'm working on is that commented bit in the output. In this case, it is producing the result I want: an efficient list of properly matching YAML Paths for each non-string node.
This example has TOML input, but it works the same for YAML input.
But the way I construct these guesses is fragile; I'm deconstructing YAMLPath segments, doing string manipulations, then reconstructing, and it'll probably break hard on YAML Paths that include special characters, etc. And I'm just guessing that if the first array item is a match, it's ok to wildcard that index... but I'm ok with that heuristic for now.
So I'd very much appreciate any ideas about how to do this better, more robustly handling weird paths, using yamlpath's own objects and functions in more places rather than hoping that joining strings will just be fine.
The function which does this is
nt2.yamlpath_tools.guess_briefer_schema
, copied at the bottom of this post. It takes adict
parameter with more literal matches; in this case:and returns, in this case:
And now, hidden behind this fold to protect the faint of heart, is the function itself, in its current state:
Here be dragons
Beta Was this translation helpful? Give feedback.
All reactions