-
Notifications
You must be signed in to change notification settings - Fork 585
fix serialiser makes invalid turtle curis #3364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 7.x
Are you sure you want to change the base?
fix serialiser makes invalid turtle curis #3364
Conversation
|
Suggested location for fix where the entire string should be perform backslash escaping for any character which will fail to parse later (includes % character). https://github.com/RDFLib/rdflib/blob/7.x/rdflib/plugins/serializers/turtle.py#L331 The list of characters that need backslash escaping is at https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC |
…cial characters in local names of IRIs
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
…ackslash escaping
|
This looks fine to me at least. I have compared it to the discussions in the corresponding issue. Can you test every to be escaped character? |
I will address these point on Friday. I actually have a feeling some other magic code deals with every character except percent but haven't had a chance to get back to dev environment to test thoroughly. I will split each test into its own file if needed.... |
rdflib/plugins/serializers/turtle.py
Outdated
| prefix, namespace, local = parts | ||
|
|
||
| local = local.replace(r"(", r"\(").replace(r")", r"\)") | ||
| local = re.sub(r"[\"'~!$&\(\)*+,;=/\?#@%]", r"\\\0", local) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Should this RE be compiled somewhere to make it faster?
I don't know whether there are perf test somewhere though.
e.g.
# After imports ...
RE_ESCAPE_CHARS = re.compile(r"[\"'~!$&\(\)*+,;=/\?#@%]")
...
# in the function ...
local RE_ESCAPE_CHARS.sub(r"\\\0", local)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely should :)
|
These are the canonical references that I need to check when I get back to this:
|
|
Removing the following from the test file as none of these triples failed roundtrip and I assume there is probably test coverage elsewhere for these. It was just the percent sign being escaped that failed. |
|
Ready again. The test and fix is now very precisely targeted to escaped percent character in a localname. This works in Jena turtle serialiser/parser just fine. Rdflib must have missed it due to percent character appearing in the grammar specifically to percent-escape other/unprintable characters with 2-digit hexadecimal sequence. The fix is using a precompiled regex to detect percent (%) characters not followed by 2-digit hex sequence. Such characters are replaced by blackslash plus percent character. As requested also took opportunity to rename getQName function to get_pname where it appeared to be in fact getting a Prefixed Name. This was relevant to or touched four serialisers: Turtle, Long turtle, Trig and N3. |
|
Am I expected to rebase and selectively squash to clean up commits or will they get squashed on merge? |
|
Bump - anyone out there? |
|
Hi @lisat-dstg, thanks for providing a fix. I haven't reviewed your code just yet but I have approved the running of the validate workflows on this PR. There appears to be a mypy error. Do you mind taking a look? If you're unsure of anything, please reach out and I'll try to respond promptly. |
Thanks @edmondchuc I think I've fixed but having trouble with dev envt. Can you please rerun the CI pipelines? Thx |
|
@lisat-dstg I've triggered the CI again. Sorry for the delay. Your changes look great. It correctly escapes |
Fixes issue #1395
So far just the breaking test (1 new test artefact generates 4 breaking tests) -- one each for formats: trig, ttl, turtle and n3