Skip to content

Conversation

@zmre
Copy link
Contributor

@zmre zmre commented Jan 17, 2026

I'm using pagefind on some large folders and it takes quite awhile. I noticed the CPU wasn't being taxed through this time and started to investigate if I could do something to speed things up. I tried preloading all files in memory in batches and running them through pagefind among other external approaches, but my changes had very little impact.

So I started poking at pagefind code myself trying to see if I could remove some of the loops that blocked and processed files one at a time, but I didn't get very far on my own.

I leveraged Opus and Claude Code to see if I could find a way to make this work. I use custom skills with quality controls and rust standards that I use elsewhere. This has created a fairly big set of changes, which I've reviewed to the best of my ability given my lack of familiarity with pagefind internals.

You may not like AI generated code and if that's a blocker for you, I understand. I spent quite a bit of time cycling on this, improving the code, and running benchmarks so this isn't without a fair bit of human effort.

But here's the deal: indexing a large directory (I'm using 44k wikipedia files as a test) is now more than 2x faster and all tests pass).

I hope this can help the project.

@bglw
Copy link
Member

bglw commented Jan 17, 2026

Hi!

This is work that needed doing, and this seems like the correct approach. In its nascent days, Pagefind was far more IO-bound, so I recall the tokio pattern winning over rayon. Since then it does a lot more work, though, so this change makes total sense as a performance improvement.

Typically, in principle, I prefer to do large refactors myself (so that I am comfortable maintaining it going forward). In saying that, skimming this PR it all looks about right, and it comes in as a smaller diff than I would have predicted for this refactor so I'm happy to review this and merge it in. Plus, one of the goals of the large test suite here is to be confident in changes under the hood.

There is a beta for 1.5.0 out right now 1, so I wouldn't want to merge this before putting that stable release out, just for peace of mind. And so that I have time to properly review the changes. Which means this won't land on a release imminently — after 1.5.0 lands I could put together an alpha with this and some other post-1.5 changes.

Footnotes

  1. which funnily enough does speed up Pagefind by ~18% over 1.4.0, purely by remembering to turn on LTO

@zmre
Copy link
Contributor Author

zmre commented Jan 17, 2026

Glad you like the approach. I saw the 1.5.0 release notes and they look great. Sad I'll have to wait longer for faster things now that they're here, but I agree with your approach so sounds like a good plan to me and I'll look forward to the 1.5 updates and the speed updates when they follow. Thanks for looking.

For what it's worth, I also played with caching language stemmers instead of creating one for every file in the hopes that would bring some big gains, but it came out at about a 3% bump so I didn't pursue that beyond an initial PoC phase. Something for the future though, perhaps.

Copy link
Member

@bglw bglw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a while to tackle some of the final 1.5.0 work, and I don't want this to stagnate so I'll merge it in ahead after all 🙂

I have tacked a few changes onto your branch. The main concern I had was maintaining both sync and async paths for all the entrypoints into indexing, so I've just removed the async variants.

Additionally, the blocking in the service API needed some extra handling so as to not block the runtime of users of the Rust crate.

@bglw bglw merged commit dbc1e52 into Pagefind:main Feb 8, 2026
8 of 9 checks passed
@zmre
Copy link
Contributor Author

zmre commented Feb 8, 2026

Hurray! Glad it will be in 1.5. Can't wait for that. Thanks for all your work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants