I have at least partially confirmed high memory use for large HTML files.

Summary

The word size of the result of parsing HTML can be significantly larger than the source HTML, increasing with node density.

An 19MB HTML file is parsed by Meeseeks to an estimated 280MB document (excluding binary data size), while a 500KB HTML file parsed to an estimated 3MB (excluding binary data size).

This increase is less significant when parsing into a tuple-tree, and Html5ever (the Elixir library) parses the same files to tuple-trees around 112MB and 1.3MB (excluding binary data size) respectively.

Data

For a 500KB (on disk) HTML file parsing to about 11,000 nodes, :erts_debug.flat_size/1 returns the following when used on the result of parsing:

Meeseeks (parse/1): ~375,000 words (3MB at 8 bytes per word on 64-bit system, or 272 bytes/node)
Html5ever (flat_parse/1): ~350,000 words (2.8MB, or 254 bytes/node)
Html5ever (parse/1): ~160,000 words (1.3MB, or 116 bytes/node)

This excludes most of the binary data (any larger than 64 bytes), but already gives an example of how a large HTML document uses more memory when parsed.

Both implementations that use flat-maps to represent the HTML document weigh in at ~6x the original document size, and even the tuple-tree representation weighs in at ~2.5x.

So how does this scale up to even larger documents? To find out I created a file duplicating the largest portion of the original document (a big <table>) 100 times, leaving me with a 19MB (on disk) HTML document that parsed to around 1,030,000 nodes.

Meeseeks (parse/1): ~35,000,000 words (280MB, or 271 bytes/node)
Html5ever (flat_parse/1): ~30,000,000 words (240MB, or 233 bytes/node)
Html5ever (parse/1): ~14,000,000 words (112MB, or 109 bytes/node)

The bytes per node have scaled close to linearly, but given the density of nodes, the flat-map representations are ~14x and ~12x the original document size, while the tuple-tree representation is ~5.5x.

In both examples, the flat-map representations seem to be about 2.5x the size of the tuple-tree representation.

Memory Pressure

Sampling total memory allocation with :erlang.memory/0 before and after parsing reflects a difference in memory efficiency between tuple-trees (more efficient) and flat-maps similar to that suggested by word-size, though the Html5ever flat-map is seems particularly inefficient.

500KB File:

Meeseeks (parse/1): ~+8.5MB
Html5ever (flat_parse/1): ~+18.5MB
Html5ever (parse/1): ~+3MB

19MB FIle:

Meeseeks (parse/1): ~+775MB
Html5ever (flat_parse/1): ~+1300MB
Html5ever (parse/1): ~+225MB

Discussion

It is not unexpected that the flat-map representation would be bigger than the tuple-tree representation, since flat-maps both make explicit the relationship between nodes that a tuple-tree holds implicitly (more data), and represent nodes using maps instead of tuples (less memory efficient).

It is unfortunate, however, that parsing a 19MB (albeit node-dense) HTML file can yield ~300-800MB of memory usage.

Next Steps

I'm not sure yet. I'm open to suggestions.

Parsing large HTML files uses way too much memory #31

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions