ChunkSmith is a specialized workbench for Chunk Engineers. It allows you to visualize, test, and refine PDF chunking algorithms.
Designed for developers building RAG (Retrieval-Augmented Generation) pipelines, ChunkSmith provides a visual interface to see exactly where and how your documents are being split.
ChunkSmith is built with Python and uses uv for fast dependency management.
- Python 3.12+
uv(Universal Package Manager)
-
Clone the repository
git clone https://github.com/fhalde/chunksmith.git cd chunksmith -
Install dependencies
uv sync
-
Run the application
uv run main.py
ChunkSmith comes with several reference implementations to get you started:
-
Basic Word Chunker:
- Treats every individual word as a chunk.
- Use case: Debugging bounding box accuracy and coordinate systems.
-
Sentence Chunker:
- Splits text by sentence boundaries using PyMuPDF.
- Use case: Standard NLP tasks where sentence-level granularity is needed.
-
Semantic Chunker (Percentile):
- Uses
sentence-transformersto generate embeddings for sliding windows of text. - Calculates cosine distance between adjacent sentences.
- Dynamically splits at the 90th percentile of distances (the "peaks" of semantic change).
- Use case: Resumes, scientific papers, or structured documents where you want to capture distinct sections (e.g., "Experience" vs "Education").
- Uses
-
Topic Chunker (K-Means):
- Clusters sentences based on semantic similarity using K-Means.
- Non-sequential: Can group a paragraph from Page 1 and a paragraph from Page 10 into the same chunk if they discuss the same topic.
- Use case: Topic modeling, extracting specific themes (e.g., "Legal Disclaimers" scattered throughout a contract).
- Create a new file in
backend/chunkers/(e.g.,my_chunker.py). - Create a class that inherits from
BaseChunker. - Implement the
chunkmethod.
from typing import List
from .base import BaseChunker, Chunk, BoundingBox
class MyCustomChunker(BaseChunker):
@property
def name(self) -> str:
return "My Custom Logic"
@property
def description(self) -> str:
return "Splits by... magic?"
def chunk(self, pdf_path: str) -> List[Chunk]:
# Your logic here using pymupdf (fitz)
return []- Register your new chunker in
backend/api.py.
from .chunkers.my_chunker import MyCustomChunker
# ... inside Api.__init__
self.chunkers = {
# ...
"My Custom Logic": MyCustomChunker(),
}- Restart the app. Your new algorithm will appear in the dropdown.
Only AI generated code will be merged.
MIT
