-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce “UnforgivingHtml” parser. #231
base: main
Are you sure you want to change the base?
Conversation
3bcbba0
to
295b4ed
Compare
295b4ed
to
a964a2b
Compare
e5c6421
to
6ac7000
Compare
a3ed7d8
to
4ca7863
Compare
} | ||
// Again, there might be a quote we need to slice off here still. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned — removing that Forgiving
comment has made the git diff super wonky here. Let me know if you want that back in 👌
3bfc3cf
to
9874ee3
Compare
OK @klebba — Certainly, there are still bugs remaining in here, but I believe this should be a reasonable first-pass at the new “unforgiving” parser. I’m going to flip my attention to integration testing this in code we’ve authored to start confirming parsing behavior. |
ae0ea29
to
72c5137
Compare
650a871
to
42257d6
Compare
Goals of the parser: * Tighten control over things like double-quotes & closing tags. * Improve error messaging for malformed markup. * Improve performance. Closes #239.
42257d6
to
72ade95
Compare
Goals of the parser:
TODOs / Questions:
Do we need to strictly rejectPass, we can introduce new “forbidden” parsing states to throw an error here. It was already failing due to the fact that it wouldn’t match — but we now add a special error-time lookup to throw a helpful error.CDATA
inmath
andsvg
namespaces? https://w3c.github.io/html-reference/syntax.html#cdata-sections / https://developer.mozilla.org/en-US/docs/Web/API/CDATASection. Note, it appears that when attempting to add CDATA to html… it is converted to an html comment. Otherwise, in svg and math — it will work as expected.Some of the validations around tag name, attribute name, and property name may be slow… should we be less strict? Is it the case that the browser would throw on any of these anyhow? Could we just let that happen?Punting this problem to later — let’s start strict.I think HTML is often assumed to be UTF-8, but JS strings (i think) are sometimes treated via UTF-16. That means we can accept UTF-16 characters and inject them as text content. But, should we reject that?I would prefer to go as long as possible without concerning character encodings. If / when we can think of a way this could break (i.e., differing character encodings in text / files) — we could always revisit.Is there a way to test UTF-8 multi-byte sequence interop with multi-by UTF-16 surrogate pair interop. Maybe not a problem, but it could be important to know one way or the other!I think if you really blow it, you will get some sort of error. See isWellFormed for some context.Do we need special handling of NUL character — U+0000? It is forbidden in text / character data. What about “permanently undefined Unicode characters”?YAGNI.Should we use setAttributeNS when in a namespace?I believe this would only be required if there was a prefix associated with an attribute name likespec:align
… since we would reject that attribute name anyhow, I think we already can’t get into this situation. Going to not bother with this one.Do we need special consideration around optional newline for pre / textarea — https://html.spec.whatwg.org/multipage/syntax.html#element-restrictions. Indeed, if you set likeYes. Fixed. (still “yuck” though!)document.body.innerHTML = '<pre>\nhi</pre>
— that initial newline will not appear. Yuck.Are there any more maps we ought to turn into more-performant switches?This is a “later” problem.Seems like there may be some conventions to use camelCase element names in SVG. Are we okay forcing them to be lowercase?Might as well start strict.