Skip to content

Conversation

@lisat-dstg
Copy link

Fixes issue #1395

So far just the breaking test (1 new test artefact generates 4 breaking tests) -- one each for formats: trig, ttl, turtle and n3

@lisat-dstg lisat-dstg changed the base branch from main to 7.x January 9, 2026 05:44
@lisat-dstg
Copy link
Author

Suggested location for fix where the entire string should be perform backslash escaping for any character which will fail to parse later (includes % character).

https://github.com/RDFLib/rdflib/blob/7.x/rdflib/plugins/serializers/turtle.py#L331

The list of characters that need backslash escaping is at https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC

@lisat-dstg lisat-dstg marked this pull request as ready for review January 12, 2026 04:25
@lisat-dstg
Copy link
Author

pre-commit.ci autofix

@WhiteGobo
Copy link
Contributor

This looks fine to me at least. I have compared it to the discussions in the corresponding issue.
I think getQName should also be renamed to getPName as one of the comments of niklasl suggests.

Can you test every to be escaped character?
And you could add the link https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC as comment to getQName.

@lisat-dstg
Copy link
Author

This looks fine to me at least. I have compared it to the discussions in the corresponding issue. I think getQName should also be renamed to getPName as one of the comments of niklasl suggests.

Can you test every to be escaped character? And you could add the link https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC as comment to getQName.

I will address these point on Friday. I actually have a feeling some other magic code deals with every character except percent but haven't had a chance to get back to dev environment to test thoroughly. I will split each test into its own file if needed....

prefix, namespace, local = parts

local = local.replace(r"(", r"\(").replace(r")", r"\)")
local = re.sub(r"[\"'~!$&\(\)*+,;=/\?#@%]", r"\\\0", local)
Copy link

@ioggstream ioggstream Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Should this RE be compiled somewhere to make it faster?
I don't know whether there are perf test somewhere though.

e.g.

# After imports ...
RE_ESCAPE_CHARS = re.compile(r"[\"'~!$&\(\)*+,;=/\?#@%]")

...
        # in the function ...
        local RE_ESCAPE_CHARS.sub(r"\\\0", local)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely should :)

@lisat-dstg
Copy link
Author

lisat-dstg commented Jan 19, 2026

These are the canonical references that I need to check when I get back to this:

  1. Turtle grammar (see https://www.w3.org/TR/turtle/#sec-grammar-grammar) defines LOCALNAME in the PN_LOCAL production (see https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL).
  2. TriG grammar (see https://www.w3.org/TR/rdf12-trig/#grammar-ebnf) defines LOCALNAME in the PN_LOCAL production (see https://www.w3.org/TR/rdf12-trig/#grammar-production-PN_LOCAL).
  3. N-Triples grammar (see https://www.w3.org/TR/rdf12-n-triples/#sec-grammar-grammar) does not define LOCALNAME because N-Triples doesn't permit prefixed names. It permits something similar for blank nodes, but not for named nodes, so is irrelevant to disucssion/specification of LOCALNAME.
  4. Same for N-Quads.
  5. RDF XML does include the concept of LOCALNAME. IRIs can be formed in three ways, one of which is via qualified names (namespace-qualified elements or attribute names). Qualified names (QNames) are basically the XML version of prefixed names. Just like a prefixed name a QName has a namespace prefix followed by colon followed by a localname. The QName grammar (see https://www.w3.org/TR/REC-xml-names/#ns-qualnames) decribes the localname equivalent grammar in the LocalPart production (see https://www.w3.org/TR/REC-xml-names/#NT-LocalPart) which in turn is described in the NCName production (see https://www.w3.org/TR/REC-xml-names/#NT-NCName) which states the localname comprises all characters in the Name production minus colon (see https://www.w3.org/TR/REC-xml/#NT-Name)

@lisat-dstg
Copy link
Author

Removing the following from the test file as none of these triples failed roundtrip and I assume there is probably test coverage elsewhere for these. It was just the percent sign being escaped that failed.

:foo_\'_bar :prop "test iri including escaped char '" .
:foo_\~_bar :prop "test iri including escaped char ~" .
:foo_\!_bar :prop "test iri including escaped char !" .
:foo_\$_bar :prop "test iri including escaped char $" .
:foo_\&_bar :prop "test iri including escaped char &" .
:foo_\(_bar :prop "test iri including escaped char (" .
:foo_\)_bar :prop "test iri including escaped char )" .
:foo_\*_bar :prop "test iri including escaped char *" .
:foo_\+_bar :prop "test iri including escaped char +" .
:foo_\,_bar :prop "test iri including escaped char ," .
:foo_\;_bar :prop "test iri including escaped char ;" .
:foo_\=_bar :prop "test iri including escaped char =" .
:foo_\/_bar :prop "test iri including escaped char /" .
:foo_\?_bar :prop "test iri including escaped char ?" .
:foo_\#_bar :prop "test iri including escaped char #" .
:foo_\@_bar :prop "test iri including escaped char @" .

@lisat-dstg
Copy link
Author

Ready again.

The test and fix is now very precisely targeted to escaped percent character in a localname. This works in Jena turtle serialiser/parser just fine. Rdflib must have missed it due to percent character appearing in the grammar specifically to percent-escape other/unprintable characters with 2-digit hexadecimal sequence.

The fix is using a precompiled regex to detect percent (%) characters not followed by 2-digit hex sequence. Such characters are replaced by blackslash plus percent character.

As requested also took opportunity to rename getQName function to get_pname where it appeared to be in fact getting a Prefixed Name. This was relevant to or touched four serialisers: Turtle, Long turtle, Trig and N3.

@lisat-dstg
Copy link
Author

Am I expected to rebase and selectively squash to clean up commits or will they get squashed on merge?

@lisat-dstg
Copy link
Author

Bump - anyone out there?

@edmondchuc
Copy link
Contributor

Hi @lisat-dstg, thanks for providing a fix. I haven't reviewed your code just yet but I have approved the running of the validate workflows on this PR.

There appears to be a mypy error. Do you mind taking a look? If you're unsure of anything, please reach out and I'll try to respond promptly.

  py38: commands[3]> poetry run python -m mypy --show-error-context --show-error-codes --junit-xml=test_reports/3.8-macos-latest-mypy-junit.xml
  rdflib/plugins/serializers/longturtle.py: note: In member "get_pname" of class "LongTurtleSerializer":
  rdflib/plugins/serializers/longturtle.py:198: error: "LongTurtleSerializer" has no attribute "LOCALNAME_PECRENT_CHARACTER_REQUIRING_ESCAPE_REGEX"  [attr-defined]
  Found 1 error in 1 file (checked 464 source files)

@lisat-dstg
Copy link
Author

Hi @lisat-dstg, thanks for providing a fix. I haven't reviewed your code just yet but I have approved the running of the validate workflows on this PR.

There appears to be a mypy error. Do you mind taking a look? If you're unsure of anything, please reach out and I'll try to respond promptly.

  py38: commands[3]> poetry run python -m mypy --show-error-context --show-error-codes --junit-xml=test_reports/3.8-macos-latest-mypy-junit.xml
  rdflib/plugins/serializers/longturtle.py: note: In member "get_pname" of class "LongTurtleSerializer":
  rdflib/plugins/serializers/longturtle.py:198: error: "LongTurtleSerializer" has no attribute "LOCALNAME_PECRENT_CHARACTER_REQUIRING_ESCAPE_REGEX"  [attr-defined]
  Found 1 error in 1 file (checked 464 source files)

Thanks @edmondchuc I think I've fixed but having trouble with dev envt. Can you please rerun the CI pipelines? Thx

@edmondchuc
Copy link
Contributor

@lisat-dstg I've triggered the CI again. Sorry for the delay.

Your changes look great. It correctly escapes % when not followed by two hex digits, as demonstrated in your test file, so percent-encoded values remain untouched. Nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants