Skip to content

Conversation

@behdad
Copy link
Contributor

@behdad behdad commented Dec 30, 2025

This is the result of a few days vibe-coding with codex. This moves access checks from table construction time, to each individual fields's getter method.

If there's out of bounds access, this code crashes. We can see if we can codegen sanitize methods ala HB to address that and reject at font table level bad data.

Accompanying HR PR: harfbuzz/harfrust#314

Experiment details:
https://docs.google.com/document/d/1LjYFjZj8Kw8zyhqsfZg0_VgzHf2zhYUsksPCkZL7GJI/edit

@behdad behdad requested review from cmyr and dfrg December 30, 2025 23:30
@behdad

This comment was marked as outdated.

@behdad

This comment was marked as outdated.

@behdad

This comment was marked as outdated.

@behdad

This comment was marked as outdated.

behdad added 4 commits January 4, 2026 16:21
This will panic for bad font data, but good for measurements in
HarfRust.
This change is not necessary for performance gains in my measurement.
Invalid memory access on bad font data, but useful for measuring
HarfRust performance boost.
@cmyr
Copy link
Member

cmyr commented Jan 14, 2026

So this is doing two things: it is skipping any validation of the input data, and it is subsequently doing unsafe reads.

I don't see that this is actually doing any access checks in the getters?

I do see the benefit of trying to do bounds checking on some full table graph instead of always needing to do it on each read, I can try to play around with what that might look like..

@behdad
Copy link
Contributor Author

behdad commented Jan 14, 2026

I do see the benefit of trying to do bounds checking on some full table graph instead of always needing to do it on each read, I can try to play around with what that might look like..

Right. Here's some unhooked vibe-coded sanitize methods:
f9aeba4#diff-60c87c44076fdcbdd9d4c569cb2cebed11c15a8c3c7ab4a076f5380e747170f6

@behdad
Copy link
Contributor Author

behdad commented Jan 14, 2026

This is what I propose: an alternate reading path, full-on HarfBuzz style:

All the simple codegen'ed tables & structs get a sanitize() method, that returns a result object in the sanitized namespace. If the result is ok, then the entire subtree can be accessed infallibly and without bounds checks. The object is simply a typed pointer.

OpenType has the following wording:

In some cases, fields for subtable offsets are documented as permitting NULL values when the given subtable is optional. For example, in the BASE table header, the horizAxisOffset and vertAxisOffset fields may be NULL, meaning that either subtable (or both) is optional. A NULL subtable offset always indicates that the given subtable is not present. Applications must never interpret a NULL offset value as the offset to subtable data. For cases in which subtable offset fields are not documented as permitting NULL values, font compilers must include a subtable of the indicated format, even if it is a header stub without further data (for example, a coverage table with no glyph IDs). Applications parsing font data should, however, anticipate non-conformant font data that has a NULL subtable offset where only a non-NULL value is expected.

https://learn.microsoft.com/en-us/typography/opentype/spec/otff

To implement this, I suggest that we adopt the HarfBuzz null-object model, whereas if a null offset is dereferenced, a pointer to a shared singleton null object of that type is returned instead, removing codepath divergence in the callers for when an offset is null vs points to an empty object.

cc @raphlinus @dfrg @rsheeter

@dfrg
Copy link
Member

dfrg commented Jan 14, 2026

So the problem I'm seeing with the sanitize approach is that we have a TOCTOU problem unless you retain a sanitized reference to the whole checked subtree.

For HR, this means that we have to run the sanitize pass every time we construct a Shaper since we receive a fresh reference to the font data and that might not be the same one that was used to initialize the shaping caches. In the current state, if a user provides different font data, the shaping will be incorrect but the code will still be memory safe.

This might not matter for Chrome because I imagine we'll keep the Shaper type around anyway (the HB/HR code currently does this) so it might be good enough.

In either case, sanitize and read cannot really be separate-- we need to sanitize on read to guarantee safety at the API level.

@behdad
Copy link
Contributor Author

behdad commented Jan 15, 2026

So the problem I'm seeing with the sanitize approach is that we have a TOCTOU problem unless you retain a sanitized reference to the whole checked subtree.

With mmaped files, you can't get around TOCTOU since the data can change under you anyway. Just saying.

Sanitizing per Shaper is fine I think. As long as it's not per shape() call, we should be good.

@dfrg
Copy link
Member

dfrg commented Jan 15, 2026

With mmaped files, you can't get around TOCTOU since the data can change under you anyway. Just saying.

Point taken. I believe there are still ongoing arguments about the combination of mmap and Rust. Same for anything else that lets you poke at process memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants