feat: pluggable chunker registry with settings integration#105
Open
dukelion wants to merge 2 commits intococoindex-io:mainfrom
Open
feat: pluggable chunker registry with settings integration#105dukelion wants to merge 2 commits intococoindex-io:mainfrom
dukelion wants to merge 2 commits intococoindex-io:mainfrom
Conversation
…comments explicit_package_bases = true resolves module paths from the repo root, preventing double-discovery of files in tests/ under different module names. Required for tests/example_toml_chunker.py to be importable as example_toml_chunker rather than an ambiguous bare module. Also sets asyncio_mode = auto so pytest-asyncio behaviour is explicit (default in 1.3.0 is strict). With explicit_package_bases active, mypy can fully resolve the Response union in test_daemon.py — the 16 type: ignore[union-attr] and type: ignore[attr-defined] comments that were suppressing false positives under the old resolution are now unused and removed.
Improves retrieval precision by letting users split specific file types at semantic boundaries (e.g. TOML sections, SQL statements) instead of the default line-window splitter. ## What - cocoindex_code/chunking.py: public API module exporting ChunkerFn (Callable alias), CHUNKER_REGISTRY context key, and re-exports of Chunk/TextPosition from upstream. Single import path for chunker authors. - tests/example_toml_chunker.py: demo chunker splitting at [section] headers; excludes [[array_of_tables]] via negative lookahead. Lives in tests/ to signal it belongs to a separate package, not the core library. - ProjectSettings.chunkers: new list[ChunkerMapping] field, serialised as YAML. Each entry maps a file extension to a 'module.path:callable' string. Users activate chunkers by editing .cocoindex_code/settings.yml — no code changes required. - daemon.py: _resolve_chunker_registry resolves ChunkerMapping entries via importlib at project load time and passes the result to Project.create(). callable() guard gives a clear error at startup rather than a TypeError per file. - Project.create(chunker_registry=...): new optional parameter. Injected as a cocoindex context key (tracked=False) rather than exposed via env internals. Empty registry by default — zero behavioural delta for existing users. - indexer.py: process_file checks the registry per file suffix; falls through to RecursiveSplitter unchanged when no chunker is registered. ## Design decisions - ChunkerFn returns (language_override, chunks): language_override=None keeps detect_code_language() result; non-None lets the chunker correct it (e.g. .sls files starting with #!py). - tracked=False is consistent with SQLITE_DB, CODEBASE_DIR, and other non-serialisable context keys. Changing a chunker requires a daemon restart, which triggers a full re-index anyway. - _resolve_chunker_registry lives in daemon.py, its only call site, keeping settings.py as pure schema/IO and chunking.py as pure type definitions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The default
RecursiveSplitteruses a fixed line-window strategy for all file types. This works well for most code but produces poor chunks for structured formats like TOML, SQL, HCL, or Ansible where semantic boundaries don't align with line counts. There was no way to override splitting logic per file type without modifying the core indexer.Solution
A pluggable chunker registry: users map file extensions to custom splitting functions via
settings.yml. No code changes required to activate.Any callable with signature
(Path, str) -> (str | None, list[Chunk])importable from the project's venv can be registered. Thelanguage_overridereturn value lets chunkers correct language detection for ambiguous extensions (e.g..slsfiles that are Python renderers, not YAML).Changes
cocoindex_code/chunking.pyChunkerFn,CHUNKER_REGISTRY,Chunk,TextPosition. Single import path for chunker authors.cocoindex_code/settings.pyChunkerMapping(ext, module)dataclass andProjectSettings.chunkersfield. Mirrors existinglanguage_overridespattern.cocoindex_code/daemon.pyChunkerMappingentries to callables viaimportlibat project load time. Fails loudly at startup on bad config, not per-file.cocoindex_code/project.pyProject.create()gains optionalchunker_registryparameter. Empty by default — zero behavioural delta for existing users.cocoindex_code/indexer.pyprocess_filechecks registry per suffix; falls through toRecursiveSplitterunchanged when no match.tests/example_toml_chunker.py[section]headers. Lives intests/to signal it belongs to a downstream package.Design decisions
ChunkerFnis aCallablealias, not aProtocol— consistent with codebase styletracked=False— callables are not fingerprint-able, consistent withSQLITE_DB,CODEBASE_DIR, and other non-serialisable context keys; changing a chunker requires a daemon restart which triggers a full re-index anywaydaemon.py, its only call site — keepssettings.pyas pure schema/IO andchunking.pyas pure type definitionsCommits
chore: enablemypy explicit_package_bases, remove 16 now-unusedtype: ignorecomments intest_daemon.py(standalone, bisectable)feat: pluggable chunker registry