Characters inside plain text urls should not be escaped #324

fredck · 2020-03-06T15:10:01Z

Input:

https://text.com/some_page

Current output:

https://text.com/some\_page

Expected:

https://text.com/some_page

Rational

All I can confirm is that the current output is broken when loaded in GitHub (it points to the url with backslash), not to the original url without it.

I can't confirm whether or not escaping is right or if this is a GitHub bug (by design?). And I can assume that GitHub will never change this, considering that amount of content it has already.

The text was updated successfully, but these errors were encountered:

fredck · 2020-03-06T15:11:42Z

Additionally, I was undecided whether this is a turndown-plugin-gfm or turndown issue. Decided to start here but please correct me if I'm wrong and I can move it to turndown.

martincizek · 2020-06-27T17:41:03Z

Actually this issue is the break-even point where naive iterative escaping can no longer work. That's why we implemented union-replacer and gfm-escape on top of it . Unfortunately, there is no hook allowing to choose it as an alternative to the default escaping atm, so we're using a fork with such a hook introduced.

For more details, see this discussion - it's an older proposal of such hook, but we've used the textReplacement approach instead in the end.

Edit: I realized the principle of addressing this issue is actually demonstrated in union-replacer docs here: https://github.com/orchitech/union-replacer#conservative-markdown-escaping. It is still a simplified solution for demonstration purposes. A comprehensive rule for the extended autolink particular replace is a little more complex, especially trailing punctuation chars are extremely tricky, as they include formatting tags, but they must not be backslash-escaped. These tricky cases can be found in the tests, which are automatically tested on backrendering equality ... the feeling of uncorrupted data is priceless. :)

@domchristie Any chance of some follow-up on this suggestion?

fredck · 2020-06-29T12:34:59Z

To add to this discussion, let me just point you guys to the solution (hack) we implemented to workaround for this problem:
https://github.com/ckeditor/ckeditor5/blob/92aa0b3c999397c8f87cd29595fe4d9ad9d4c376/packages/ckeditor5-markdown-gfm/src/html2markdown/html2markdown.js#L12-L54

…prehensive, context-dependent escaping. See mixmark-io#324.

martincizek · 2020-07-06T13:10:16Z

To put @fredck's "hack" in context - that's in principle the approach where the text escaping is not performed in the iterative way. Some text is matched, the corresponding escape rule is applied and escaping continues after the matched text without escaping the already escaped text again using another rules. My point is that this mechanism should just be provided by default for all the escape rules, as pointed out above. Extended web autolinks are not the only case making it necessary and Fred's hack indeed has side effects, as it changes the matching context for the regular expressions (e.g. introduces false text beginnings).

@fredck Sorry for being such a bitch :-), but I should probably explain why I said it is not possible to implement this correctly with current Turndown's hooks. While Turndown's iterative escaping approach can be worked around, the other problem is that this escaping is also Markdown context-dependent. Special chars need to be escaped even in URLs when they are part of a link text. Which cannot be easily hacked now and which is why the hook has to be extended like suggested in #339.

Consider this:

HTML: <a href="http://example.org/_foo_/">See http://example.org/_foo_/</a>
If it converts to: [See http://example.org/_foo_/](http://example.org/_foo_/).
then it translates back as: <a href="http://example.org/_foo_/">See http://example.org/<em>foo</em>/</a>.
The correct output is: [See http://example.org/\_foo\_/](http://example.org/_foo_/).

The same issue is probably achievable even by just hitting twice edit and save in your wysiwyg, when the original text contains an autolink. :-)
Indeed, it is still better to corrupt only the text and not the URL, but I hope it would be solvable consistently soon...

fredck · 2020-07-06T13:34:04Z

@fredck Sorry for being such a bitch :-)

@martincizek, I'm with you on that. As said, we implemented a "hack" to solve a very specific problem. As such, it is far from being the right-right way of moving forward here. Thanks for taking care of this ;)

domchristie transferred this issue from mixmark-io/turndown-plugin-gfm May 11, 2020

domchristie added the escape label May 13, 2020

martincizek pushed a commit to orchitech/turndown that referenced this issue Jul 6, 2020

Provide text escaping and replacement hooks to allow implementing com…

f1805ba

…prehensive, context-dependent escaping. See mixmark-io#324.

martincizek mentioned this issue Jul 6, 2020

Provide text escaping and replacement hooks for context-dependent escaping #339

Draft

martincizek mentioned this issue Jul 6, 2020

Do no escape urls in text nodes mixmark-io/turndown-plugin-gfm#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Characters inside plain text urls should not be escaped #324

Characters inside plain text urls should not be escaped #324

fredck commented Mar 6, 2020

fredck commented Mar 6, 2020

martincizek commented Jun 27, 2020 •

edited

Loading

fredck commented Jun 29, 2020

martincizek commented Jul 6, 2020 •

edited

Loading

fredck commented Jul 6, 2020

Characters inside plain text urls should not be escaped #324

Characters inside plain text urls should not be escaped #324

Comments

fredck commented Mar 6, 2020

Rational

fredck commented Mar 6, 2020

martincizek commented Jun 27, 2020 • edited Loading

fredck commented Jun 29, 2020

martincizek commented Jul 6, 2020 • edited Loading

fredck commented Jul 6, 2020

martincizek commented Jun 27, 2020 •

edited

Loading

martincizek commented Jul 6, 2020 •

edited

Loading