Unify MCP extension whitelist and add data formats#6
Unify MCP extension whitelist and add data formats#6johnzfitch wants to merge 1 commit intomasterfrom
Conversation
The MCP tools.rs had a hardcoded 9-extension whitelist while handlers/mod.rs had a comprehensive 27-extension ALLOWED_EXTENSIONS constant. This caused the MCP index tool to silently drop most file types (Python, Go, Shell, logs, etc). Now both code paths use the same shared constant. Adds log/data extensions (.log, .jsonl, .csv, .xml, .env, .ini, .cfg, .conf, .swift) and fixes MCP walk_directory to skip hidden dirs and noise dirs (node_modules, target, dist, build). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 750406b057
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // SQL | ||
| "sql", | ||
| // Data/Logs | ||
| "log", "jsonl", "csv", "xml", "env", "ini", "cfg", "conf", |
There was a problem hiding this comment.
Index
.env dotfiles explicitly
Adding "env" to the extension whitelist does not actually enable indexing of real .env files: the ingestion path relies on Path::extension() in read_file, and for a filename like .env that returns None, so the file is rejected before whitelist matching; additionally, the directory walk skips names starting with .. In practice, users still get no chunks from .env despite this commit advertising support, so this should be handled with a dotfile-specific path check.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR aims to make MCP indexing behavior consistent with the main handlers-based indexing by sharing a single extension allowlist, expanding supported data/log formats, and aligning directory-walk filtering.
Changes:
- Replace MCP’s hardcoded extension whitelist with the shared
handlers::ALLOWED_EXTENSIONS. - Expand the allowlist and
detect_kind()to cover additional data/log formats (and treat.xmlasHtml). - Update MCP directory walking to skip hidden/noise directories similar to the handlers implementation.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| ingestor-core/src/util.rs | Extends detect_kind() for new extensions (.xml, .log, .jsonl, .csv, .env, .ini, .cfg, .conf). |
| ingestor-core/src/mcp/tools.rs | Uses shared ALLOWED_EXTENSIONS and adds hidden/noise directory skipping to MCP walker. |
| ingestor-core/src/lib.rs | Exposes handlers module from the library crate. |
| ingestor-core/src/handlers/mod.rs | Adds new extensions (Swift + data/log formats) to ALLOWED_EXTENSIONS. |
| Cargo.lock | Updates lockfile with additional dependencies now recorded for ingestor-core. |
Comments suppressed due to low confidence (1)
ingestor-core/src/mcp/tools.rs:397
path.extension()/to_str()is case-sensitive and the allowed list is all lowercase, so files likeREADME.MDorLOGS.JSONLwill be skipped. Normalizeextto lowercase (and ideally trim) before checkingALLOWED_EXTENSIONSso behavior matchesdetect_kind()which already lowercases the path.
// Check extension whitelist (shared with handlers module)
if let Some(ext) = path.extension().and_then(|e| e.to_str()) {
if !crate::handlers::ALLOWED_EXTENSIONS.contains(&ext) {
return Ok(());
}
} else {
return Ok(());
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Skip hidden directories and common non-code directories | ||
| if let Some(name) = path.file_name().and_then(|n| n.to_str()) { | ||
| if name.starts_with('.') || name == "node_modules" || name == "target" || name == "dist" || name == "build" { | ||
| continue; | ||
| } | ||
| } | ||
|
|
||
| if path.is_dir() { |
There was a problem hiding this comment.
The new hidden-path filter runs before the path.is_dir() check and skips any entry whose filename starts with '.'. That means hidden files (e.g. .env, .npmrc) will never be indexed, which conflicts with adding env to the allowed extensions and with the stated goal of skipping hidden directories (like .git). Consider applying the name.starts_with('.') rule only when the entry is a directory (or explicitly allow specific dotfiles you want indexed).
| // Skip hidden directories and common non-code directories | |
| if let Some(name) = path.file_name().and_then(|n| n.to_str()) { | |
| if name.starts_with('.') || name == "node_modules" || name == "target" || name == "dist" || name == "build" { | |
| continue; | |
| } | |
| } | |
| if path.is_dir() { | |
| let is_dir = path.is_dir(); | |
| // Skip hidden directories and common non-code directories | |
| if is_dir { | |
| if let Some(name) = path.file_name().and_then(|n| n.to_str()) { | |
| if name.starts_with('.') || name == "node_modules" || name == "target" || name == "dist" || name == "build" { | |
| continue; | |
| } | |
| } | |
| } | |
| if is_dir { |
| // SQL | ||
| "sql", | ||
| // Data/Logs | ||
| "log", "jsonl", "csv", "xml", "env", "ini", "cfg", "conf", |
There was a problem hiding this comment.
Adding env here likely won’t include typical .env dotfiles: walk_directory currently skips entries starting with '.' and Path::extension() returns None for dotfiles like .env, so they’ll be dropped before/inside read_file. Either remove env from the allowlist, or add explicit handling to include .env (and adjust hidden-path skipping to only skip hidden directories).
| "log", "jsonl", "csv", "xml", "env", "ini", "cfg", "conf", | |
| "log", "jsonl", "csv", "xml", "ini", "cfg", "conf", |
| } else if lower.ends_with(".xml") { | ||
| ChunkKind::Html | ||
| } else if lower.ends_with(".txt") || lower.ends_with(".log") || lower.ends_with(".jsonl") | ||
| || lower.ends_with(".csv") || lower.ends_with(".env") || lower.ends_with(".ini") | ||
| || lower.ends_with(".cfg") || lower.ends_with(".conf") | ||
| { |
There was a problem hiding this comment.
Classifying .xml as ChunkKind::Html routes XML through chunk_html(), which strips all tags (<[^>]+>) and skips <script>/<style> blocks. For many XML files (configs, manifests), tag/attribute names are important search terms and will be lost. If the intent is to index XML syntax rather than rendered text, map .xml to ChunkKind::Text (or introduce a dedicated XML kind/chunker).
| } else if lower.ends_with(".xml") { | ||
| ChunkKind::Html | ||
| } else if lower.ends_with(".txt") || lower.ends_with(".log") || lower.ends_with(".jsonl") | ||
| || lower.ends_with(".csv") || lower.ends_with(".env") || lower.ends_with(".ini") | ||
| || lower.ends_with(".cfg") || lower.ends_with(".conf") | ||
| { |
There was a problem hiding this comment.
New extension mappings were added in detect_kind() (.log, .jsonl, .csv, .xml, .ini, .cfg, .conf, .env), but there are no corresponding filetype tests. The repo has ingestor-core/tests/filetype_tests.rs specifically for extension→kind coverage; please add cases for the new extensions to prevent regressions.
| } else if lower.ends_with(".xml") { | |
| ChunkKind::Html | |
| } else if lower.ends_with(".txt") || lower.ends_with(".log") || lower.ends_with(".jsonl") | |
| || lower.ends_with(".csv") || lower.ends_with(".env") || lower.ends_with(".ini") | |
| || lower.ends_with(".cfg") || lower.ends_with(".conf") | |
| { | |
| } else if lower.ends_with(".txt") { |
Summary
tools.rshad a hardcoded 9-extension whitelist whilehandlers/mod.rshad 27 extensions — the MCP index tool silently dropped most file types. Now both use the sharedALLOWED_EXTENSIONSconstant..log,.jsonl,.csv,.xml,.env,.ini,.cfg,.conf,.swiftdetect_kind()to classify new extensions (.xmlas Html, rest as Text)walk_directoryto skip hidden dirs and noise dirs (node_modules,target,dist,build) matching the handlers versionTest plan
cargo build --release -p ingestor-core --bin llmx-mcpcompiles cleanly.logfiles — should now produce chunks instead of 0 files.csv/.jsonlfiles — should be indexed as Text.git) — should be skipped.rs,.ts,.md) still works🤖 Generated with Claude Code