Skip to content

Support fuzzy substring search for CJK (Chinese/Japanese/Korean) languages instead of word-based segmentation #987

@Aruelius

Description

@Aruelius

Hi team,

First, thank you for building Pagefind — it's an excellent static search solution with great performance and low bandwidth usage.

However, the current search experience for languages without word boundaries (especially Chinese and Japanese) is quite limited. Pagefind appears to rely on word segmentation for indexing, which works well for space-separated languages like English, but performs poorly for CJK text.

Example:
Consider the following sentence in a document:
“这是一段简单的测试文本”
Expected behavior (common user expectation in Chinese search):

Searching “一段” → matches
Searching “简单” → matches
Searching “段简” → should match (substring within “一段简单”)
Searching “是一段” → should match

Current behavior:

“一段” and “简单” usually match
“段简” or “是一段” often return no results, even though the characters are clearly present in the text.

This happens because Chinese users typically expect substring / fuzzy containment matching (similar to Python's "keyword" in text or JavaScript's text.includes(keyword)), rather than word-based matching.

A great reference implementation is Fuse.js, which performs character-level fuzzy search by default and works perfectly for Chinese without any language-specific configuration. It reliably returns results for any substring within the text.

The extended version of Pagefind with npx already improves CJK support significantly, but it still falls short of the substring matching that native speakers expect.

It would be amazing if Pagefind could offer an optional fuzzy/character-level indexing mode (perhaps behind a flag like --fuzzy-cjk or automatically for lang="zh" / ja / ko) to better serve users of spaceless languages.

Thank you for considering this improvement — it would make Pagefind much more usable for a huge portion of the web.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions