Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Proposal: Limited mutability API #311

Closed
nostrademons opened this issue Feb 26, 2015 · 6 comments
Closed

Proposal: Limited mutability API #311

nostrademons opened this issue Feb 26, 2015 · 6 comments
Milestone

Comments

@nostrademons
Copy link
Contributor

Based off discussion in #295. Both Sigil and Github have a need to make small local modifications to the parse tree before reserializing it out. This is currently very difficult because of the number of pointers that must be kept in sync, the possibility of introducing memory leaks by not updating them, and the need to pass a GumboParser around for the allocator.

Concrete proposal

  1. Remove the ability to set custom allocators on GumboOptions. Use the system malloc for all memory.
  2. Expose create_node, destroy_node, get_attribute, set_attribute, set_attribute_value, and the vector modification functions (add, remove, remove_at, insert_at) to the public API.

Current workaround

We currently recommend that people who want mutation wrap the whole parse tree in an API of their choice, mutate that, and then serialize it out. Gumbo's API is simple enough that a tree-walker can be written in a page or so of code, and tree traversal time is negligible compared to parse time (~1%). Several outside bindings have DOM APIs already, eg. lua-gumbo, gumbo-libxml, and the html5lib and BeautifulSoup adaptors that come with the main distribution.

Benefits

If this is useful to you, you'll probably know it immediately. :-) But enumerating them:

  1. No need to use & learn an outside library just to do mutation.
  2. Mutation can work in terms of the GumboNodes you already have; if you're doing querying or traversal already, there's no need to adapt that code to work on a different DOM representation.
  3. Well-suited to small local mutations, where it feels like overkill to have to reserialize the whole parse tree just to change one node.
  4. Possibly marginally faster, since there's no need to traverse & allocate for a new parse tree, although empirically this effect has been negligible.
  5. Simplified API in some cases, since some functions that previously needed a GumboParser/GumboOptions argument no longer do (notably gumbo_destroy_output).

There is a partial branch demonstrating some of these changes at vmg/development.

Drawbacks

  1. Incompatible with the existing allocator machinery.
  2. Incompatible with the arena change in Arena #309.
  3. Backwards-incompatible; at a minimum, this change results in signature changes for GumboOptions and gumbo_destroy_output, and exposes a half dozen or so new functions.
  4. Possibly more API surface for third-party bindings to wrap. External bindings are under no obligation to offer the full feature set of Gumbo, but if this goes in, there will likely be pressure from users to expand the feature set of them.
  5. More API surface for new users of the library to learn.
  6. Many of the existing helpers that would be exposed by this proposal are not designed for efficiency or for this usage. gumbo_get_attribute, for example, takes linear time, and gumbo_create_node wouldn't know where to insert the node in the list of next/prev pointers.

Compromise solutions

  1. Replace the custom allocators in GumboOptions with global gumbo_set_allocator/gumbo_set_deallocator functions. This restores custom allocators, but still eliminates the ability of different instances of gumbo_parse (eg. in a multithreaded program) to run with separate heaps, so eg. a per-parse arena would require locking that destroys many of the speed benefits of an arena.
  2. Have functions take an optional first parameter, perhaps a GumboOptions with the allocator/deallocator functions or a GumboArena, and use that if provided. If NULL, it falls back to the system malloc. This gets all the functionality and doesn't compromise any design options, but it still results in an ugly API, and it's very easy to make a mistake and forget what allocator you used.

Comment with a +1 or -1, or any additional comments or considerations.

@nostrademons nostrademons added this to the v1.0.0 milestone Feb 26, 2015
@rgrove
Copy link
Contributor

rgrove commented Feb 26, 2015

As a Gumbo user, I'm divided on this. Gumbo is a fantastic HTML5 parser. Is there sufficient motivation behind the mutability API to make it equally fantastic? Will the next step be to add a serializer? If so, will that be fantastic?

If the answer to those questions is yes, then this seems like an exciting possibility. But @nostrademons has presented some strong arguments against going down that path here and in other issues, and I find them convincing even though I also stand to benefit from a mutability API.

This seems like a potentially vast scope expansion for Gumbo, and something that might be better left to other projects (either existing or new). As @nostrademons points out, there are a number of third party bindings that provide DOM APIs, and those projects benefit from being able to focus on that problem while Gumbo focuses on parsing.

@kevinhendricks
Copy link
Contributor

To clarify a few things:

Drawbacks

• Incompatible with the existing allocator machinery.

But will work just fine with any global allocator, even a custom one. And of course will work with the simple malloc, realloc, and free.

• Backwards-incompatible; at a minimum, this change results in signature changes for GumboOptions and gumbo_destroy_output, and exposes a half dozen or so new functions.

As will any change to the memory allocator including the arena proposal. We are talking about v1.0 here not v0.9.5. It is really the change to memory allocation that will will need to make changes to the Options. Furthermore the changes to support fragment_parsing, templates, add rtc, and the like all create similar backwards incompatibility issues but they need to be done at some point and are now in the master tree. That is why we are proposing this for v1.0.0 not v0.9.5.

• More API surface for new users of the library to learn.

Actually if they want to walk the tree at al, they need to be up to speed on all of the vector and attribute calls and understand how they work. So it is really not much new (or shouldn't be). There are no new functions being proposed just asking to expose existing ones.

• Many of the existing helpers that would be exposed by this proposal are not designed for efficiency or for this usage. gumbo_get_attribute, for example, takes linear time, and gumbo_create_node wouldn't know where to insert the node in the list of next/prev pointers.

Based on Nostrademons recent statistical studies on the 60K+ websites, given the very low number of attributes and children nodes, being O(n) is really a moot point if 99.5% of the time n is 1 or 2.

Some other points based on an earlier response:

  1. There is already a serializer (see the examples)
  2. We are not talking about adding new code at all, just exposing currently existing code, so this is not technically an area of growth in anything but the half-dozen or so function calls that already exist being exposed. If you already wrap gumbo and use your own language DOM nothing much if at all really need change for you. The newly exposed but already existing functions will be enough for others to use to hook into the library to allow for some mutability of the tree using code external to gumbo. No new code base to support mutability is actually needed or even desired for the gumbo-parser. It will simply allow some downstream users to do more with the library than they can now.

Hope this clarifies things a bit. I realize the discussion at #295 is very very long, but it can make for interesting reading!

@craigbarnes craigbarnes mentioned this issue May 2, 2015
@sylvinus
Copy link

My 2c: I think some work could be done on the current workarounds to make them more practical without adding this to gumbo.

Gumbo is amazing because it does one thing really well and interoperates quite well with other tools that might have a wider scope.

So I'm all in favor of keeping Gumbo as simple as possible, and maybe focus on speed & other optimizations instead.

@lastmjs
Copy link

lastmjs commented Mar 3, 2017

@rgrove Are there any DOM implementations in C that you know of? Which projects do you know of that offer good DOM APIs that work in conjunction with this project?

@rgrove
Copy link
Contributor

rgrove commented Mar 3, 2017

@lastmjs Sorry, C isn't my area of expertise so I'm not sure what the landscape looks like for DOM-related projects in C. I currently use Gumbo via Nokogumbo in Ruby, which provides bindings with Nokogiri.

@kevinhendricks
Copy link
Contributor

kevinhendricks commented Mar 3, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants