Skip to content

Latest commit

 

History

History
531 lines (424 loc) · 14.6 KB

File metadata and controls

531 lines (424 loc) · 14.6 KB

code_skim

Transform source code by removing implementation details whilst preserving structure. Achieves 60-80% character reduction for optimising AI context windows.

Status

🔒 Disabled by default - Enable with ENABLE_ADDITIONAL_TOOLS=code_skim

⚠️ Platform Availability: Due to tree-sitter's CGO dependency, code_skim is only available on:

  • macOS (darwin) - included in GitHub release binaries
  • Linux AMD64 with CGO enabled - included in GitHub release binaries
  • Docker images exclude this tool (built with CGO_ENABLED=0 for minimal size)

Linux ARM64 and Windows builds exclude this tool. If you need code_skim on those platforms, you'll need to build from source with CGO enabled.

Overview

The code_skim tool uses tree-sitter to parse source code and strip function/method bodies whilst preserving signatures, types, and overall structure. Language is automatically detected from file extensions. Results are paginated to prevent overwhelming context windows.

Supported languages:

  • Python (.py)
  • Go (.go)
  • JavaScript (.js, .jsx)
  • TypeScript (.ts, .tsx)
  • Rust (.rs)
  • C (.c, .h)
  • C++ (.cpp, .cc, .cxx, .hpp, .hxx, .hh)
  • Bash (.sh, .bash)
  • HTML (.html, .htm)
  • CSS (.css)
  • Swift (.swift)
  • Java (.java)
  • YAML (.yml, .yaml)
  • HCL/Terraform (.hcl, .tf)

Why Use code_skim?

When working with large codebases, you often don't need implementation details to understand architecture, APIs, or structure. The code_skim tool addresses the context attention problem:

  • Large contexts degrade model performance (attention dilution)
  • 80% of the time, you don't need implementation details
  • Focus on what code does, not how it does it

Character reduction example:

  • Original: ~200,000 characters
  • Structure mode: ~60,000 characters (70% reduction)
  • Fits more code in limited context windows

Parameters

Required

  • source (array): Array of file paths, directory paths, or glob patterns
    • Single file: ["/path/to/file.py"]
    • Directory: ["/path/to/directory"] (recursively finds supported files)
    • Glob pattern: ["/path/to/**/*.py"] (matches using glob syntax)
    • Multiple: ["/path/to/file1.py", "/path/to/file2.go", "/path/**/*.ts"]
    • Multiple sources are automatically deduplicated

Optional

  • clear_cache (boolean): Clear cache entry before processing
    • Default: false
  • starting_line (number): Line number to start from (1-based) for pagination
    • Use when previous response was truncated
    • Specified in next_starting_line field of truncated responses
  • filter (array): Array of glob patterns to filter function/method/class names
    • Single pattern: ["handle_*"], ["test_*"], ["*Controller"]
    • Multiple patterns: ["handle_*", "process_*", "get*"]
    • Inverse filter (exclusion): Prefix with ! (e.g., ["!temp_*"], ["!test_*"])
    • Combined: ["handle_*", "!handle_temp*"] (include handle_* but exclude handle_temp*)
    • Exclusions take priority over inclusions
    • Returns matched_items, total_items, filtered_items counts in response
  • extract_graph (boolean): Extract relationship graph including imports, calls, and inheritance
    • Default: false
    • Adds graph field to file results with structured relationship data
  • output_format (string): Output format for the transformed code
    • "json" (default): Standard JSON response
    • "sigil": Compressed notation optimised for LLM context (see Sigil Format below)

How It Works

The tool removes function/method bodies whilst preserving:

  • Function and method signatures
  • Class declarations
  • Type definitions
  • Overall code structure

Character reduction: 60-80%

Example:

# Before
def process_user(user):
    validated = validate_user(user)
    if not validated:
        raise ValueError("Invalid user")
    normalised = normalise_data(user)
    return save_to_database(normalised)

# After transformation
def process_user(user): { /* ... */ }

Line Limiting

By default, results are limited to 10,000 lines per file to prevent overwhelming context windows. When results exceed this limit:

  • Response includes truncated: true
  • total_lines shows the full file line count
  • returned_lines shows how many lines were returned
  • next_starting_line specifies where to continue from

Configure the limit with the CODE_SKIM_MAX_LINES environment variable.

Examples

Transform a single file

{
  "source": ["/path/to/src/api.py"]
}

Transform all Python files in a directory

{
  "source": ["/path/to/src"]
}

Transform files matching a glob pattern

{
  "source": ["/path/to/src/**/*.ts"]
}

Clear cache and re-process

{
  "source": ["/path/to/app.js"],
  "clear_cache": true
}

Paginate through a large file

{
  "source": ["/path/to/large_file.py"],
  "starting_line": 10001
}

Filter by function name pattern

{
  "source": ["/path/to/api.py"],
  "filter": ["handle_*"]
}

Show only test functions

{
  "source": ["/path/to/tests.py"],
  "filter": ["test_*"]
}

Multiple source files

{
  "source": [
    "/path/to/api.py",
    "/path/to/handlers.py",
    "/path/to/models.py"
  ]
}

Multiple filter patterns

{
  "source": ["/path/to/api.py"],
  "filter": ["handle_*", "process_*", "validate_*"]
}

Exclude specific patterns (inverse filter)

{
  "source": ["/path/to/api.py"],
  "filter": ["handle_*", "!handle_temp*"]
}

Show everything except test functions

{
  "source": ["/path/to/src"],
  "filter": ["!test_*"]
}

Response Format

Single File

{
  "files": [
    {
      "path": "/path/to/api.py",
      "transformed": "def hello(name): { /* ... */ }",
      "language": "python",
      "from_cache": false,
      "truncated": false,
      "total_lines": 8,
      "returned_lines": 8,
      "reduction_percentage": 65
    }
  ],
  "total_files": 1,
  "processed_files": 1,
  "failed_files": 0,
  "processing_time_ms": 15
}

With Filtering

{
  "files": [
    {
      "path": "/path/to/api.py",
      "transformed": "def handle_request(): { /* ... */ }\ndef handle_response(): { /* ... */ }",
      "language": "python",
      "from_cache": false,
      "truncated": false,
      "total_lines": 4,
      "returned_lines": 4,
      "reduction_percentage": 75,
      "matched_items": 2,
      "total_items": 10,
      "filtered_items": 8
    }
  ],
  "total_files": 1,
  "processed_files": 1,
  "failed_files": 0,
  "processing_time_ms": 18
}

Truncated Response (Pagination)

{
  "files": [
    {
      "path": "/path/to/large_file.py",
      "transformed": "...first 10,000 lines...",
      "language": "python",
      "from_cache": false,
      "truncated": true,
      "total_lines": 25000,
      "returned_lines": 10000,
      "next_starting_line": 10001
    }
  ],
  "total_files": 1,
  "processed_files": 1,
  "failed_files": 0
}

Response Fields:

  • files: Array of file results
    • path: Absolute file path
    • transformed: Transformed source code
    • language: Detected language
    • from_cache: Whether result came from cache
    • truncated: Whether output was truncated due to line limit
    • total_lines: Total line count of transformed output
    • returned_lines: Number of lines returned in this response
    • next_starting_line: Line number to use for next request (if truncated)
    • reduction_percentage: Percentage of token/character reduction from original (0-100)
    • matched_items: Number of functions/methods/classes that matched filter (only when filtering)
    • total_items: Total number of functions/methods/classes found (only when filtering)
    • filtered_items: Number of functions/methods/classes excluded by filter (only when filtering)
    • error: Error message (if file processing failed)
  • total_files: Total number of files found
  • processed_files: Number of successfully processed files
  • failed_files: Number of files that failed processing
  • processing_time_ms: Total processing time in milliseconds

Graph Extraction

When extract_graph: true, the response includes relationship data:

{
  "files": [
    {
      "path": "/path/to/handler.py",
      "graph": {
        "imports": ["os", "json", "typing.Optional"],
        "functions": [
          {
            "name": "handle_request",
            "calls": ["validate", "process", "respond"],
            "connectivity": 3
          }
        ],
        "classes": [
          {
            "name": "RequestHandler",
            "extends": "BaseHandler",
            "implements": ["Loggable"],
            "methods": ["__init__", "handle"]
          }
        ]
      }
    }
  ]
}

Graph Fields:

  • imports: Module/package imports
  • functions: Function details with call relationships
    • calls: Functions called by this function
    • connectivity: Total number of relationships (★ rating)
  • classes: Class details with inheritance
    • extends: Parent class
    • implements: Implemented interfaces
    • methods: Method names

Sigil Format

The output_format: "sigil" option provides compressed notation optimised for LLM consumption:

# /path/to/handler.py [python]
!os !json !typing.Optional
$RequestHandler < BaseHandler & Loggable
  #__init__() -> #_setup_logging
  #handle() -> #validate #process ★3
#main() -> $RequestHandler.#handle ★1

Sigil Meanings:

  • ! - import/module
  • $ - class/type
  • # - function/method
  • < - extends
  • & - implements
  • -> - calls (outgoing)
  • ★n - connectivity rating (n relationships)

Example with Sigil Format:

{
  "source": ["/path/to/api.py"],
  "extract_graph": true,
  "output_format": "sigil"
}

Caching

Results are cached using a key based on:

  • File path
  • Language
  • Filter patterns (if applied)
  • Source code hash (SHA256)

Cache behaviour:

  • First call: Processes and caches result (from_cache: false)
  • Subsequent calls: Returns cached result if file content unchanged (from_cache: true)
  • Clear cache: Set clear_cache: true to force re-processing
  • Each file in batch operations is cached independently
  • Pagination: Cached transformed output is reused for different line ranges
  • Different filter patterns create separate cache entries

Use Cases

1. Codebase Overview

Quickly understand code structure without implementation noise:

{
  "source": "/path/to/src"
}

2. API Documentation

Extract function signatures for documentation:

{
  "source": "/path/to/api.py"
}

3. Architecture Analysis

Analyse entire packages or modules:

{
  "source": "/path/to/project/**/*.go"
}

4. Context Window Optimisation

Fit more code into limited AI context windows by removing implementation noise.

When to Use

Use when:

  • Analysing code structure without implementation details
  • Fitting large codebases into limited AI context windows
  • Providing architectural overviews
  • Examining API surfaces and function signatures
  • Understanding "what" code does without the "how" details

Don't use when:

  • Debugging implementation logic
  • Examining algorithm details
  • Reviewing line-by-line code quality
  • Actual implementation is required for the task
  • Working with unsupported languages

Troubleshooting

File not found or access denied

Problem: Error about file not found or access denied

Solution: Ensure the file path is absolute and exists. Check that the security configuration allows access to the file location.

No files match glob pattern

Problem: Error when using glob patterns

Solution: Verify the glob pattern is correct and matches existing files. Use **/*.py for recursive matching.

Language detection failed

Problem: Error about unsupported file extension or language

Solution: Ensure files have supported extensions. See the full list of supported languages and extensions in the Overview section.

Transformation failed with parse error

Problem: Tree-sitter parser error

Solution: Ensure source code is syntactically valid for the specified language. Tree-sitter requires valid syntax to parse.

Cache returning stale results

Problem: Getting old transformation when source has changed

Solution: Set clear_cache: true to force re-processing. Cache uses file content hash, so changes are automatically detected.

Token reduction lower than expected

Problem: Reduction percentage is much lower than 60-80%

Solution: Structure mode targets 60-80% reduction. Low reduction may indicate minimal function bodies in source code (e.g., mostly declarations or empty functions).

File too large error

Problem: Individual file exceeds 500KB size limit

Solution: The tool limits individual file sizes to 500KB to prevent memory exhaustion. Consider splitting large files, or if the file is genuinely needed, process it in smaller chunks or use alternative tools.

Memory limit exceeded error

Problem: Total memory usage would exceed 4GB limit

Solution: The tool limits total memory to 4GB across all files being processed. Process fewer files at once, use more specific glob patterns to target subsets, or process files in batches sequentially.

Memory and Resource Limits

To ensure safe operation and prevent resource exhaustion:

  • Maximum file size: 500KB per individual file
  • Maximum total memory: 4GB across all files being processed
  • Maximum AST depth: 500 levels (prevents stack overflow)
  • Maximum AST nodes: 100,000 per file (prevents memory exhaustion)
  • Parallel workers: Up to 10 concurrent file processors

Files exceeding these limits are skipped with detailed error messages in the response.

Implementation Details

  • Built on go-tree-sitter
  • Uses tree-sitter parsers for accurate AST analysis
  • Parallel processing with worker pool (up to 10 workers)
  • In-memory caching with SHA256 hashing for performance
  • File access controlled by security integration
  • Batch processing for directories and glob patterns using doublestar
  • Memory-safe with configurable limits

Related Tools

  • code_search: Semantic search over indexed code using natural language
  • find_long_files: Identify large files that may benefit from skimming
  • get_library_documentation: Get focused library documentation
  • fetch_url: Fetch web content (can be combined with skimming)

Extended Help

Use the get_tool_help tool to access detailed usage information:

{
  "tool_name": "code_skim"
}

This provides:

  • Detailed examples for all languages
  • Common usage patterns
  • Troubleshooting tips
  • Parameter explanations
  • When to use / when not to use guidance