Skip to content

Conversation

SirNickolas
Copy link
Contributor

  1. Quotation marks are mandatory in XML. Since there is no easy way of telling HTML apart from HTML-looking XML (<title> is both an HTML and SVG tag, for example), we should quote everything.
    This also addresses Attribute values not quoted when getting HTML from Document #31.

  2. Some entities must be preserved to prevent JS injection attacks:

// Read properly escaped text.
auto doc = createDocument(`&lt;script&gt;alert("Hello!");&lt;/script&gt;`);
// Write malicious code.
assert(doc.root.innerHTML == `<script>alert("Hello!");</script>`);

@marcioapm
Copy link
Contributor

marcioapm commented Nov 29, 2018

I'm leaning towards always quote only SVG and other XML tags and all descendants, when in compact*HTML functions. The idea behind those is to be used as a simple HTML compressor. For the normal output, and your current change will greatly reduce the effectiveness for this use-case. Enabling it only where necessary seems like a sensible solution. Maybe even do it for all tags except the known "HTML5" ones. What do you think?

@SirNickolas
Copy link
Contributor Author

SirNickolas commented Nov 30, 2018

Minification is indeed a nice goal as long as it doesn't lead to invalid outcome.

Hmm, it might be a good idea to keep a whitelist of HTML block tags. Then an emitter could work in one of three modes:

  1. Generic (conservative) mode,
  2. HTML mode,
  3. XML mode.

In HTML mode it omits quotation marks; in XML mode it "self-closes" empty tags (which is not legal in HTML). When stepping into a non-whitelisted tag, it switches to XML.

However, since this library can be used not only for complete documents but also for fragments, an emitter should start in generic mode by default (to not accidentally break anything). There should be an optional parameter in compactHTML which specifies the initial mode. That way a user can manually enable quotes omission if they know for sure their document is an HTML document.

Any remarks on that?

Edit: I just realized there already is "HTML" word in the method's name, so specifying it explicitly would be a bit redundant. Does it attract enough attention that we could make DocumentType.html the default?

@forbjok
Copy link
Contributor

forbjok commented Nov 30, 2018

Personally, I think that attribute values should either always be quoted, or that there should be some sort of setting to specify whether the output should be "compact" (non-quoted) or fully quoted.

I've rarely if ever seen HTML with non-quoted attribute values before encountering it with htmld, and while yeah, it does save a few bytes here and there, it just feels sloppy and inconsistent to me to leave certain values unquoted even if it is technically allowed by the spec.

@SirNickolas
Copy link
Contributor Author

Yes, I also consistently quote values when have to write HTML by hand. But here we deal with machine-generated code. I see nothing bad in saving a few bytes if it is not intended to be human-readable.

If, however, it is (I don't know your use-case), then you might want to turn minification off altogether (with innerHTML/outerHTML). Or you will be able to disable just HTML-specific tricks via compactHTML(DocumentType.unknown) (not implemented yet; syntax might change): it will have "always quote" behaviour and will self-close void tags (like <br/>).

@marcioapm
Copy link
Contributor

marcioapm commented Dec 1, 2018

The idea of the compactHTML methods was to produce the smallest possible valid representation of the nodes, not for humans to look at or having feelings about, but for crawlers and browsers to ingest. Note that being XML compatible is not at all the primary goal of this library, but I do agree it doesn't hurt to aim for it. I think having the non-compact methods allow for a setting is a great idea, but compactHTML methods should try to live up to their name.

@marcioapm
Copy link
Contributor

marcioapm commented Dec 1, 2018

How about having a compactXML method? Then we could switch to this when we encounter for example an SVG node. Would be useful for someone using the library as XML as well.

@SirNickolas
Copy link
Contributor Author

SirNickolas commented Dec 1, 2018

How about having a compactXML method?

I'm afraid there would be too much duplicated code, so I'd stick with an enum parameter to compactHTML.

The idea is to be able to get the smallest possible representation that is:

  1. valid HTML (the most common use-case),
  2. valid XML,
  3. both valid HTML and XML when it's not known ahead of time what the document is (by coincidence, it's more aesthetically appealing for some people than the 1st).

Edit: By the way, there's XHTML as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants