Structure-aware semantic chunking for RAG/LLMs (test.pypi.org/project/smartchunk/)
SmartChunk is a Python package + CLI that creates higher-quality chunks for Retrieval-Augmented Generation (RAG) pipelines. Instead of breaking text blindly, SmartChunk respects structure and meaning — no more chopped sentences, broken code blocks, or messy lists.
The result? 👉 Better retrieval quality 👉 Lower token costs 👉 Chunks your LLM can actually understand
Naive splitters cut text every N tokens. That causes:
- ❌ Broken headings, lists, or tables
- ❌ Incoherent fragments across paragraphs
- ❌ Duplicate/boilerplate content bloating your index
SmartChunk fixes this by combining structure awareness + semantic similarity.
- Structure-Aware Splitting: Never slices through a heading, list, table, or fenced code block.
- Semantic Boundary Detection: Uses embeddings to find natural breakpoints between topics.
- Noise & Duplication Guard: Strips headers/footers, removes near-duplicates, normalizes whitespace.
- Flexible & Tunable: Control chunk size, overlap, and semantic sensitivity to fit your pipeline.
- End-to-End Ready: From URL → parsed → cleaned → JSONL chunks in one command.
For hackathon/demo (TestPyPI):
pip install -i https://test.pypi.org/simple/ smartchunkOnce we'll publish it to PyPI:
pip install smartchunksmartchunk chunk README.md \
--mode markdown \
--max-tokens 500 \
--overlap 100 \
--semantic \
--semantic-model all-MiniLM-L6-v2 \
--format jsonl \
--output chunks.jsonlsmartchunk fetch "https://en.wikipedia.org/wiki/Crayon_Shin-chan" \
--semantic \
--semantic-model all-MiniLM-L6-v2 \
--format tablesmartchunk compare README.md --mode markdown --max-chars 800Prints a terminal table comparing naive vs SmartChunk side-by-side.
Each line in the .jsonl output is a coherent chunk with rich metadata:
{
"id": "c0033",
"text": "###### Opening\n\n \n [\n\n \n edit\n\n \n ]\n\n* Footage from Japanese opening 8 (\"PLEASURE\") but with
completely different lyrics, to the melody of a techno remix of Japanese opening 3 (\"Ora wa Ninkimono\").Musical Director, Producer and
English Director: World Worm Studios composerGary Gibbons",
"header_path": "Media / Anime / Music / LUK Internacional dub / Opening",
"start_line": 709,
"end_line": 727
},fetch→ Fetch, parse & chunk a URL in one gochunk→ Chunk a local filecompare→ Compare SmartChunk vs naive splitter (HTML report)stream→ Stream chunks from STDIN in real-time
Run smartchunk --help for full options.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines. By participating, you agree to follow our Code of Conduct.
MIT License. Free to use, modify, and share.
SmartChunk = “Don’t let your RAG cut sentences in half.” It’s the first step for any production-grade RAG pipeline: clean, coherent, AI-ready chunks.