Skip to content

Comments

Unify MCP extension whitelist and add data formats#6

Open
johnzfitch wants to merge 1 commit intomasterfrom
mac
Open

Unify MCP extension whitelist and add data formats#6
johnzfitch wants to merge 1 commit intomasterfrom
mac

Conversation

@johnzfitch
Copy link
Owner

Summary

  • MCP tools.rs had a hardcoded 9-extension whitelist while handlers/mod.rs had 27 extensions — the MCP index tool silently dropped most file types. Now both use the shared ALLOWED_EXTENSIONS constant.
  • Adds data/log extensions: .log, .jsonl, .csv, .xml, .env, .ini, .cfg, .conf, .swift
  • Extends detect_kind() to classify new extensions (.xml as Html, rest as Text)
  • Fixes MCP walk_directory to skip hidden dirs and noise dirs (node_modules, target, dist, build) matching the handlers version

Test plan

  • cargo build --release -p ingestor-core --bin llmx-mcp compiles cleanly
  • Index a folder with .log files — should now produce chunks instead of 0 files
  • Index a folder with .csv/.jsonl files — should be indexed as Text
  • Index a folder with hidden dirs (.git) — should be skipped
  • Existing codebase indexing (.rs, .ts, .md) still works

🤖 Generated with Claude Code

The MCP tools.rs had a hardcoded 9-extension whitelist while
handlers/mod.rs had a comprehensive 27-extension ALLOWED_EXTENSIONS
constant. This caused the MCP index tool to silently drop most file
types (Python, Go, Shell, logs, etc). Now both code paths use the
same shared constant.

Adds log/data extensions (.log, .jsonl, .csv, .xml, .env, .ini, .cfg,
.conf, .swift) and fixes MCP walk_directory to skip hidden dirs and
noise dirs (node_modules, target, dist, build).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 19, 2026 08:30
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 750406b057

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// SQL
"sql",
// Data/Logs
"log", "jsonl", "csv", "xml", "env", "ini", "cfg", "conf",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Index .env dotfiles explicitly

Adding "env" to the extension whitelist does not actually enable indexing of real .env files: the ingestion path relies on Path::extension() in read_file, and for a filename like .env that returns None, so the file is rejected before whitelist matching; additionally, the directory walk skips names starting with .. In practice, users still get no chunks from .env despite this commit advertising support, so this should be handled with a dotfile-specific path check.

Useful? React with 👍 / 👎.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make MCP indexing behavior consistent with the main handlers-based indexing by sharing a single extension allowlist, expanding supported data/log formats, and aligning directory-walk filtering.

Changes:

  • Replace MCP’s hardcoded extension whitelist with the shared handlers::ALLOWED_EXTENSIONS.
  • Expand the allowlist and detect_kind() to cover additional data/log formats (and treat .xml as Html).
  • Update MCP directory walking to skip hidden/noise directories similar to the handlers implementation.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ingestor-core/src/util.rs Extends detect_kind() for new extensions (.xml, .log, .jsonl, .csv, .env, .ini, .cfg, .conf).
ingestor-core/src/mcp/tools.rs Uses shared ALLOWED_EXTENSIONS and adds hidden/noise directory skipping to MCP walker.
ingestor-core/src/lib.rs Exposes handlers module from the library crate.
ingestor-core/src/handlers/mod.rs Adds new extensions (Swift + data/log formats) to ALLOWED_EXTENSIONS.
Cargo.lock Updates lockfile with additional dependencies now recorded for ingestor-core.
Comments suppressed due to low confidence (1)

ingestor-core/src/mcp/tools.rs:397

  • path.extension()/to_str() is case-sensitive and the allowed list is all lowercase, so files like README.MD or LOGS.JSONL will be skipped. Normalize ext to lowercase (and ideally trim) before checking ALLOWED_EXTENSIONS so behavior matches detect_kind() which already lowercases the path.
    // Check extension whitelist (shared with handlers module)
    if let Some(ext) = path.extension().and_then(|e| e.to_str()) {
        if !crate::handlers::ALLOWED_EXTENSIONS.contains(&ext) {
            return Ok(());
        }
    } else {
        return Ok(());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +374 to 381
// Skip hidden directories and common non-code directories
if let Some(name) = path.file_name().and_then(|n| n.to_str()) {
if name.starts_with('.') || name == "node_modules" || name == "target" || name == "dist" || name == "build" {
continue;
}
}

if path.is_dir() {
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new hidden-path filter runs before the path.is_dir() check and skips any entry whose filename starts with '.'. That means hidden files (e.g. .env, .npmrc) will never be indexed, which conflicts with adding env to the allowed extensions and with the stated goal of skipping hidden directories (like .git). Consider applying the name.starts_with('.') rule only when the entry is a directory (or explicitly allow specific dotfiles you want indexed).

Suggested change
// Skip hidden directories and common non-code directories
if let Some(name) = path.file_name().and_then(|n| n.to_str()) {
if name.starts_with('.') || name == "node_modules" || name == "target" || name == "dist" || name == "build" {
continue;
}
}
if path.is_dir() {
let is_dir = path.is_dir();
// Skip hidden directories and common non-code directories
if is_dir {
if let Some(name) = path.file_name().and_then(|n| n.to_str()) {
if name.starts_with('.') || name == "node_modules" || name == "target" || name == "dist" || name == "build" {
continue;
}
}
}
if is_dir {

Copilot uses AI. Check for mistakes.
// SQL
"sql",
// Data/Logs
"log", "jsonl", "csv", "xml", "env", "ini", "cfg", "conf",
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding env here likely won’t include typical .env dotfiles: walk_directory currently skips entries starting with '.' and Path::extension() returns None for dotfiles like .env, so they’ll be dropped before/inside read_file. Either remove env from the allowlist, or add explicit handling to include .env (and adjust hidden-path skipping to only skip hidden directories).

Suggested change
"log", "jsonl", "csv", "xml", "env", "ini", "cfg", "conf",
"log", "jsonl", "csv", "xml", "ini", "cfg", "conf",

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +33
} else if lower.ends_with(".xml") {
ChunkKind::Html
} else if lower.ends_with(".txt") || lower.ends_with(".log") || lower.ends_with(".jsonl")
|| lower.ends_with(".csv") || lower.ends_with(".env") || lower.ends_with(".ini")
|| lower.ends_with(".cfg") || lower.ends_with(".conf")
{
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classifying .xml as ChunkKind::Html routes XML through chunk_html(), which strips all tags (<[^>]+>) and skips <script>/<style> blocks. For many XML files (configs, manifests), tag/attribute names are important search terms and will be lost. If the intent is to index XML syntax rather than rendered text, map .xml to ChunkKind::Text (or introduce a dedicated XML kind/chunker).

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +33
} else if lower.ends_with(".xml") {
ChunkKind::Html
} else if lower.ends_with(".txt") || lower.ends_with(".log") || lower.ends_with(".jsonl")
|| lower.ends_with(".csv") || lower.ends_with(".env") || lower.ends_with(".ini")
|| lower.ends_with(".cfg") || lower.ends_with(".conf")
{
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New extension mappings were added in detect_kind() (.log, .jsonl, .csv, .xml, .ini, .cfg, .conf, .env), but there are no corresponding filetype tests. The repo has ingestor-core/tests/filetype_tests.rs specifically for extension→kind coverage; please add cases for the new extensions to prevent regressions.

Suggested change
} else if lower.ends_with(".xml") {
ChunkKind::Html
} else if lower.ends_with(".txt") || lower.ends_with(".log") || lower.ends_with(".jsonl")
|| lower.ends_with(".csv") || lower.ends_with(".env") || lower.ends_with(".ini")
|| lower.ends_with(".cfg") || lower.ends_with(".conf")
{
} else if lower.ends_with(".txt") {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant