Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters inside plain text urls should not be escaped #324

Open
fredck opened this issue Mar 6, 2020 · 5 comments
Open

Characters inside plain text urls should not be escaped #324

fredck opened this issue Mar 6, 2020 · 5 comments
Labels

Comments

@fredck
Copy link
Contributor

fredck commented Mar 6, 2020

Input:

https://text.com/some_page

Current output:

https://text.com/some\_page

Expected:

https://text.com/some_page

Rational

All I can confirm is that the current output is broken when loaded in GitHub (it points to the url with backslash), not to the original url without it.

I can't confirm whether or not escaping is right or if this is a GitHub bug (by design?). And I can assume that GitHub will never change this, considering that amount of content it has already.

@fredck
Copy link
Contributor Author

fredck commented Mar 6, 2020

Additionally, I was undecided whether this is a turndown-plugin-gfm or turndown issue. Decided to start here but please correct me if I'm wrong and I can move it to turndown.

@domchristie domchristie transferred this issue from mixmark-io/turndown-plugin-gfm May 11, 2020
@martincizek
Copy link
Collaborator

martincizek commented Jun 27, 2020

Actually this issue is the break-even point where naive iterative escaping can no longer work. That's why we implemented union-replacer and gfm-escape on top of it . Unfortunately, there is no hook allowing to choose it as an alternative to the default escaping atm, so we're using a fork with such a hook introduced.

For more details, see this discussion - it's an older proposal of such hook, but we've used the textReplacement approach instead in the end.

Edit: I realized the principle of addressing this issue is actually demonstrated in union-replacer docs here: https://github.com/orchitech/union-replacer#conservative-markdown-escaping. It is still a simplified solution for demonstration purposes. A comprehensive rule for the extended autolink particular replace is a little more complex, especially trailing punctuation chars are extremely tricky, as they include formatting tags, but they must not be backslash-escaped. These tricky cases can be found in the tests, which are automatically tested on backrendering equality ... the feeling of uncorrupted data is priceless. :)

@domchristie Any chance of some follow-up on this suggestion?

@fredck
Copy link
Contributor Author

fredck commented Jun 29, 2020

To add to this discussion, let me just point you guys to the solution (hack) we implemented to workaround for this problem:
https://github.com/ckeditor/ckeditor5/blob/92aa0b3c999397c8f87cd29595fe4d9ad9d4c376/packages/ckeditor5-markdown-gfm/src/html2markdown/html2markdown.js#L12-L54

@martincizek
Copy link
Collaborator

martincizek commented Jul 6, 2020

To put @fredck's "hack" in context - that's in principle the approach where the text escaping is not performed in the iterative way. Some text is matched, the corresponding escape rule is applied and escaping continues after the matched text without escaping the already escaped text again using another rules. My point is that this mechanism should just be provided by default for all the escape rules, as pointed out above. Extended web autolinks are not the only case making it necessary and Fred's hack indeed has side effects, as it changes the matching context for the regular expressions (e.g. introduces false text beginnings).

@fredck Sorry for being such a bitch :-), but I should probably explain why I said it is not possible to implement this correctly with current Turndown's hooks. While Turndown's iterative escaping approach can be worked around, the other problem is that this escaping is also Markdown context-dependent. Special chars need to be escaped even in URLs when they are part of a link text. Which cannot be easily hacked now and which is why the hook has to be extended like suggested in #339.

Consider this:

  • HTML: <a href="http://example.org/_foo_/">See http://example.org/_foo_/</a>
  • If it converts to: [See http://example.org/_foo_/](http://example.org/_foo_/).
  • then it translates back as: <a href="http://example.org/_foo_/">See http://example.org/<em>foo</em>/</a>.
  • The correct output is: [See http://example.org/\_foo\_/](http://example.org/_foo_/).

The same issue is probably achievable even by just hitting twice edit and save in your wysiwyg, when the original text contains an autolink. :-)
Indeed, it is still better to corrupt only the text and not the URL, but I hope it would be solvable consistently soon...

@fredck
Copy link
Contributor Author

fredck commented Jul 6, 2020

@fredck Sorry for being such a bitch :-)

@martincizek, I'm with you on that. As said, we implemented a "hack" to solve a very specific problem. As such, it is far from being the right-right way of moving forward here. Thanks for taking care of this ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants