Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce “UnforgivingHtml” parser. #231

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

theengineear
Copy link
Collaborator

@theengineear theengineear commented Dec 5, 2024

Goals of the parser:

  • Tighten control over things like double-quotes & closing tags.
  • Improve error messaging for malformed markup.
  • Improve performance.

TODOs / Questions:

  • Do we need to strictly reject CDATA in math and svg namespaces? https://w3c.github.io/html-reference/syntax.html#cdata-sections / https://developer.mozilla.org/en-US/docs/Web/API/CDATASection. Note, it appears that when attempting to add CDATA to html… it is converted to an html comment. Otherwise, in svg and math — it will work as expected. Pass, we can introduce new “forbidden” parsing states to throw an error here. It was already failing due to the fact that it wouldn’t match — but we now add a special error-time lookup to throw a helpful error.
  • Some of the validations around tag name, attribute name, and property name may be slow… should we be less strict? Is it the case that the browser would throw on any of these anyhow? Could we just let that happen? Punting this problem to later — let’s start strict.
  • I think HTML is often assumed to be UTF-8, but JS strings (i think) are sometimes treated via UTF-16. That means we can accept UTF-16 characters and inject them as text content. But, should we reject that? I would prefer to go as long as possible without concerning character encodings. If / when we can think of a way this could break (i.e., differing character encodings in text / files) — we could always revisit.
  • Is there a way to test UTF-8 multi-byte sequence interop with multi-by UTF-16 surrogate pair interop. Maybe not a problem, but it could be important to know one way or the other! I think if you really blow it, you will get some sort of error. See isWellFormed for some context.
  • Do we need special handling of NUL character — U+0000? It is forbidden in text / character data. What about “permanently undefined Unicode characters”? YAGNI.
  • Can text nodes be added in svg / math namespaces? yes
  • Should we use setAttributeNS when in a namespace? I believe this would only be required if there was a prefix associated with an attribute name like spec:align… since we would reject that attribute name anyhow, I think we already can’t get into this situation. Going to not bother with this one.
  • Do we need special consideration around optional newline for pre / textarea — https://html.spec.whatwg.org/multipage/syntax.html#element-restrictions. Indeed, if you set like document.body.innerHTML = '<pre>\nhi</pre> — that initial newline will not appear. Yuck. Yes. Fixed. (still “yuck” though!)
  • Are there any more maps we ought to turn into more-performant switches? This is a “later” problem.
  • Seems like there may be some conventions to use camelCase element names in SVG. Are we okay forcing them to be lowercase? Might as well start strict.

@theengineear theengineear force-pushed the unforgiving-html branch 3 times, most recently from 3bcbba0 to 295b4ed Compare December 11, 2024 17:00
@theengineear theengineear marked this pull request as draft December 17, 2024 01:40
@theengineear theengineear force-pushed the unforgiving-html branch 2 times, most recently from e5c6421 to 6ac7000 Compare December 18, 2024 21:30
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
x-template.js Outdated Show resolved Hide resolved
@theengineear theengineear force-pushed the unforgiving-html branch 4 times, most recently from a3ed7d8 to 4ca7863 Compare December 20, 2024 00:15
}
// Again, there might be a quote we need to slice off here still.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned — removing that Forgiving comment has made the git diff super wonky here. Let me know if you want that back in 👌

@theengineear theengineear marked this pull request as ready for review December 20, 2024 00:20
@theengineear theengineear requested a review from klebba December 20, 2024 00:21
@theengineear theengineear force-pushed the unforgiving-html branch 2 times, most recently from 3bfc3cf to 9874ee3 Compare December 20, 2024 01:18
@theengineear
Copy link
Collaborator Author

OK @klebba — Certainly, there are still bugs remaining in here, but I believe this should be a reasonable first-pass at the new “unforgiving” parser. I’m going to flip my attention to integration testing this in code we’ve authored to start confirming parsing behavior.

@theengineear theengineear force-pushed the unforgiving-html branch 3 times, most recently from ae0ea29 to 72c5137 Compare December 23, 2024 15:48
@theengineear theengineear force-pushed the unforgiving-html branch 2 times, most recently from 650a871 to 42257d6 Compare December 27, 2024 23:15
Goals of the parser:
* Tighten control over things like double-quotes & closing tags.
* Improve error messaging for malformed markup.
* Improve performance.

Closes #239.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants