Skip to content

SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.

License

Notifications You must be signed in to change notification settings

ayush585/SmartChunk

SmartChunk 🧩

Structure-aware semantic chunking for RAG/LLMs (test.pypi.org/project/smartchunk/)

SmartChunk is a Python package + CLI that creates higher-quality chunks for Retrieval-Augmented Generation (RAG) pipelines. Instead of breaking text blindly, SmartChunk respects structure and meaning — no more chopped sentences, broken code blocks, or messy lists.

The result? 👉 Better retrieval quality 👉 Lower token costs 👉 Chunks your LLM can actually understand


✨ Why SmartChunk?

Naive splitters cut text every N tokens. That causes:

  • ❌ Broken headings, lists, or tables
  • ❌ Incoherent fragments across paragraphs
  • ❌ Duplicate/boilerplate content bloating your index

SmartChunk fixes this by combining structure awareness + semantic similarity.


🧠 Key Features

  • Structure-Aware Splitting: Never slices through a heading, list, table, or fenced code block.
  • Semantic Boundary Detection: Uses embeddings to find natural breakpoints between topics.
  • Noise & Duplication Guard: Strips headers/footers, removes near-duplicates, normalizes whitespace.
  • Flexible & Tunable: Control chunk size, overlap, and semantic sensitivity to fit your pipeline.
  • End-to-End Ready: From URL → parsed → cleaned → JSONL chunks in one command.

⚡ Quickstart

1. Install

For hackathon/demo (TestPyPI):

pip install -i https://test.pypi.org/simple/ smartchunk

Once we'll publish it to PyPI:

pip install smartchunk

2. Chunk a Document

smartchunk chunk README.md \
  --mode markdown \
  --max-tokens 500 \
  --overlap 100 \
  --semantic \
  --semantic-model all-MiniLM-L6-v2 \
  --format jsonl \
  --output chunks.jsonl

3. Fetch & Chunk a URL

smartchunk fetch "https://en.wikipedia.org/wiki/Crayon_Shin-chan" \
  --semantic \
  --semantic-model all-MiniLM-L6-v2 \
  --format table

4. Compare with a Naive Splitter

smartchunk compare README.md --mode markdown --max-chars 800

Prints a terminal table comparing naive vs SmartChunk side-by-side.


📦 Example Output

Each line in the .jsonl output is a coherent chunk with rich metadata:

{
    "id": "c0033",
    "text": "###### Opening\n\n \n        [\n\n \n         edit\n\n \n        ]\n\n* Footage from Japanese opening 8 (\"PLEASURE\") but with 
completely different lyrics, to the melody of a techno remix of Japanese opening 3 (\"Ora wa Ninkimono\").Musical Director, Producer and 
English Director: World Worm Studios composerGary Gibbons",
    "header_path": "Media / Anime / Music / LUK Internacional dub / Opening",
    "start_line": 709,
    "end_line": 727
  },

💻 CLI Overview

  • fetch → Fetch, parse & chunk a URL in one go
  • chunk → Chunk a local file
  • compare → Compare SmartChunk vs naive splitter (HTML report)
  • stream → Stream chunks from STDIN in real-time

Run smartchunk --help for full options.


🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines. By participating, you agree to follow our Code of Conduct.


🔑 License

MIT License. Free to use, modify, and share.


(In Simple Words) 📝

SmartChunk = “Don’t let your RAG cut sentences in half.” It’s the first step for any production-grade RAG pipeline: clean, coherent, AI-ready chunks.

About

SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages