Skip to content

Conversation

@yakovsh
Copy link
Contributor

@yakovsh yakovsh commented Feb 5, 2026

This PR adds support for grouping fragments into smaller number of files by combining them by the first N characters from the hash into the same file (issues #74 and #739). I tried to make the changes minimally invasive and added some tests. Let me know if something needs to be adjusted or changed.

@yakovsh yakovsh requested a review from bglw as a code owner February 5, 2026 04:08
@bglw
Copy link
Member

bglw commented Feb 8, 2026

👋 Thanks for the PR!

This is a feature we need, though this isn't quite how I'd like to approach it. Some notes:

  1. I don't love shipping options that are tied to implementation details, rather than what someone actually wants it to do. Rather than --fragment-group-lenwe should just specify something like --max-fragments.
  2. This does require more explicit grouping when indexing, but one bonus of this would be actually storing the shorter hashes in the meta index file. For large sites this meta file gets somewhat large, so reducing the length and quantity of the fragment hashes stored in it is an extra bonus we could get.
  3. The multi-fragments should be cached by the js in the browser, rather than loading+parsing the chunk and discarding the rest.
  4. The test should cover a scenario where the grouping actually has an effect.

Let me know how you want to proceed with this one, whether you want to reshape this PR or if you'd rather I take a crack at this feature after 1.5.0 ships :)

@yakovsh
Copy link
Contributor Author

yakovsh commented Feb 10, 2026

With the approach of "--max-fragments", we would need to know first how many total pages there are, then decide how many fragments we would end up generating. So if "--max-fragments" is larger than the total number of pages, I would assume things can stay as today. Once we find the total number of pages, the number of fragments would be pages/max_fragments.

(An alternative approach might be to specify how many pages we store per fragment, something like "--pages-per-fragment" which can make this a little cleaner but probably too much tied to implementation details)

Then, instead of using page hashes, the fragments can be named like "en_001.pf_fragment", "en_002.pf_fragment", etc. and the metadata that ties pages to fragments can be numeric like "1, 2, 3", instead of hashes as they are today?

Is this how you understand it?

@bglw
Copy link
Member

bglw commented Feb 10, 2026

No we do still need hashes, and in fact an issue I didn't highlight with this PR is that reducing the hash prefix down too far will cause cache collisions. One of the jobs of the hashes is to allow indefinite caching of Pagefind assets, as they naturally cache bust when content changes. A en_001.pf_fragment file would thus be cached ~forever and would contain stale data. These files will need to be a hash of all the fragments within (or a hash of the hashes etc).

@yakovsh
Copy link
Contributor Author

yakovsh commented Feb 10, 2026

I didn't think of that since in my use case, I refresh the caching manually via a CDN. Let me try to reshape the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants