Support fuzzy substring search for CJK (Chinese/Japanese/Korean) languages instead of word-based segmentation

Hi team,

First, thank you for building Pagefind — it's an excellent static search solution with great performance and low bandwidth usage.

However, the current search experience for languages without word boundaries (especially Chinese and Japanese) is quite limited. Pagefind appears to rely on word segmentation for indexing, which works well for space-separated languages like English, but performs poorly for CJK text.

Example:
Consider the following sentence in a document:
“这是一段简单的测试文本”
Expected behavior (common user expectation in Chinese search):

Searching “一段” → matches
Searching “简单” → matches
Searching “段简” → should match (substring within “一段简单”)
Searching “是一段” → should match

Current behavior:

“一段” and “简单” usually match
“段简” or “是一段” often return no results, even though the characters are clearly present in the text.

This happens because Chinese users typically expect substring / fuzzy containment matching (similar to Python's `"keyword" in text` or JavaScript's `text.includes(keyword)`), rather than word-based matching.

A great reference implementation is [Fuse.js](https://github.com/krisk/Fuse), which performs character-level fuzzy search by default and works perfectly for Chinese without any language-specific configuration. It reliably returns results for any substring within the text.

The extended version of Pagefind with npx already improves CJK support significantly, but it still falls short of the substring matching that native speakers expect.

It would be amazing if Pagefind could offer an optional fuzzy/character-level indexing mode (perhaps behind a flag like `--fuzzy-cjk` or automatically for `lang="zh"` / `ja` / `ko`) to better serve users of spaceless languages.

Thank you for considering this improvement — it would make Pagefind much more usable for a huge portion of the web.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fuzzy substring search for CJK (Chinese/Japanese/Korean) languages instead of word-based segmentation #987

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support fuzzy substring search for CJK (Chinese/Japanese/Korean) languages instead of word-based segmentation #987

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions