Skip to content

feat: pluggable chunker registry with settings integration#105

Open
dukelion wants to merge 2 commits intococoindex-io:mainfrom
dukelion:feat/pluggable-chunkers
Open

feat: pluggable chunker registry with settings integration#105
dukelion wants to merge 2 commits intococoindex-io:mainfrom
dukelion:feat/pluggable-chunkers

Conversation

@dukelion
Copy link

Problem

The default RecursiveSplitter uses a fixed line-window strategy for all file types. This works well for most code but produces poor chunks for structured formats like TOML, SQL, HCL, or Ansible where semantic boundaries don't align with line counts. There was no way to override splitting logic per file type without modifying the core indexer.

Solution

A pluggable chunker registry: users map file extensions to custom splitting functions via settings.yml. No code changes required to activate.

chunkers:
  - ext: toml
    module: my_package.chunkers:toml_chunker

Any callable with signature (Path, str) -> (str | None, list[Chunk]) importable from the project's venv can be registered. The language_override return value lets chunkers correct language detection for ambiguous extensions (e.g. .sls files that are Python renderers, not YAML).

Changes

File Change
cocoindex_code/chunking.py New public API module — exports ChunkerFn, CHUNKER_REGISTRY, Chunk, TextPosition. Single import path for chunker authors.
cocoindex_code/settings.py New ChunkerMapping(ext, module) dataclass and ProjectSettings.chunkers field. Mirrors existing language_overrides pattern.
cocoindex_code/daemon.py Resolves ChunkerMapping entries to callables via importlib at project load time. Fails loudly at startup on bad config, not per-file.
cocoindex_code/project.py Project.create() gains optional chunker_registry parameter. Empty by default — zero behavioural delta for existing users.
cocoindex_code/indexer.py process_file checks registry per suffix; falls through to RecursiveSplitter unchanged when no match.
tests/example_toml_chunker.py Demo implementation splitting TOML at [section] headers. Lives in tests/ to signal it belongs to a downstream package.

Design decisions

  • ChunkerFn is a Callable alias, not a Protocol — consistent with codebase style
  • tracked=False — callables are not fingerprint-able, consistent with SQLITE_DB, CODEBASE_DIR, and other non-serialisable context keys; changing a chunker requires a daemon restart which triggers a full re-index anyway
  • Resolution lives in daemon.py, its only call site — keeps settings.py as pure schema/IO and chunking.py as pure type definitions

Commits

  • chore: enable mypy explicit_package_bases, remove 16 now-unused type: ignore comments in test_daemon.py (standalone, bisectable)
  • feat: pluggable chunker registry

…comments

explicit_package_bases = true resolves module paths from the repo root,
preventing double-discovery of files in tests/ under different module names.
Required for tests/example_toml_chunker.py to be importable as
example_toml_chunker rather than an ambiguous bare module.

Also sets asyncio_mode = auto so pytest-asyncio behaviour is explicit
(default in 1.3.0 is strict).

With explicit_package_bases active, mypy can fully resolve the Response
union in test_daemon.py — the 16 type: ignore[union-attr] and
type: ignore[attr-defined] comments that were suppressing false positives
under the old resolution are now unused and removed.
Improves retrieval precision by letting users split specific file types at
semantic boundaries (e.g. TOML sections, SQL statements) instead of the
default line-window splitter.

## What

- cocoindex_code/chunking.py: public API module exporting ChunkerFn
  (Callable alias), CHUNKER_REGISTRY context key, and re-exports of
  Chunk/TextPosition from upstream. Single import path for chunker authors.

- tests/example_toml_chunker.py: demo chunker splitting at [section]
  headers; excludes [[array_of_tables]] via negative lookahead. Lives in
  tests/ to signal it belongs to a separate package, not the core library.

- ProjectSettings.chunkers: new list[ChunkerMapping] field, serialised as
  YAML. Each entry maps a file extension to a 'module.path:callable' string.
  Users activate chunkers by editing .cocoindex_code/settings.yml — no code
  changes required.

- daemon.py: _resolve_chunker_registry resolves ChunkerMapping entries via
  importlib at project load time and passes the result to Project.create().
  callable() guard gives a clear error at startup rather than a TypeError
  per file.

- Project.create(chunker_registry=...): new optional parameter. Injected as
  a cocoindex context key (tracked=False) rather than exposed via env internals.
  Empty registry by default — zero behavioural delta for existing users.

- indexer.py: process_file checks the registry per file suffix; falls through
  to RecursiveSplitter unchanged when no chunker is registered.

## Design decisions

- ChunkerFn returns (language_override, chunks): language_override=None keeps
  detect_code_language() result; non-None lets the chunker correct it (e.g.
  .sls files starting with #!py).

- tracked=False is consistent with SQLITE_DB, CODEBASE_DIR, and other
  non-serialisable context keys. Changing a chunker requires a daemon restart,
  which triggers a full re-index anyway.

- _resolve_chunker_registry lives in daemon.py, its only call site, keeping
  settings.py as pure schema/IO and chunking.py as pure type definitions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant