-
Notifications
You must be signed in to change notification settings - Fork 175
Description
Hi team,
First, thank you for building Pagefind — it's an excellent static search solution with great performance and low bandwidth usage.
However, the current search experience for languages without word boundaries (especially Chinese and Japanese) is quite limited. Pagefind appears to rely on word segmentation for indexing, which works well for space-separated languages like English, but performs poorly for CJK text.
Example:
Consider the following sentence in a document:
“这是一段简单的测试文本”
Expected behavior (common user expectation in Chinese search):
Searching “一段” → matches
Searching “简单” → matches
Searching “段简” → should match (substring within “一段简单”)
Searching “是一段” → should match
Current behavior:
“一段” and “简单” usually match
“段简” or “是一段” often return no results, even though the characters are clearly present in the text.
This happens because Chinese users typically expect substring / fuzzy containment matching (similar to Python's "keyword" in text or JavaScript's text.includes(keyword)), rather than word-based matching.
A great reference implementation is Fuse.js, which performs character-level fuzzy search by default and works perfectly for Chinese without any language-specific configuration. It reliably returns results for any substring within the text.
The extended version of Pagefind with npx already improves CJK support significantly, but it still falls short of the substring matching that native speakers expect.
It would be amazing if Pagefind could offer an optional fuzzy/character-level indexing mode (perhaps behind a flag like --fuzzy-cjk or automatically for lang="zh" / ja / ko) to better serve users of spaceless languages.
Thank you for considering this improvement — it would make Pagefind much more usable for a huge portion of the web.