SmartChunk 🧩

Structure-aware semantic chunking for RAG/LLMs (test.pypi.org/project/smartchunk/)

SmartChunk is a Python package + CLI that creates higher-quality chunks for Retrieval-Augmented Generation (RAG) pipelines. Instead of breaking text blindly, SmartChunk respects structure and meaning — no more chopped sentences, broken code blocks, or messy lists.

The result? 👉 Better retrieval quality 👉 Lower token costs 👉 Chunks your LLM can actually understand

✨ Why SmartChunk?

Naive splitters cut text every N tokens. That causes:

❌ Broken headings, lists, or tables
❌ Incoherent fragments across paragraphs
❌ Duplicate/boilerplate content bloating your index

SmartChunk fixes this by combining structure awareness + semantic similarity.

🧠 Key Features

Structure-Aware Splitting: Never slices through a heading, list, table, or fenced code block.
Semantic Boundary Detection: Uses embeddings to find natural breakpoints between topics.
Noise & Duplication Guard: Strips headers/footers, removes near-duplicates, normalizes whitespace.
Flexible & Tunable: Control chunk size, overlap, and semantic sensitivity to fit your pipeline.
End-to-End Ready: From URL → parsed → cleaned → JSONL chunks in one command.

⚡ Quickstart

1. Install

For hackathon/demo (TestPyPI):

pip install -i https://test.pypi.org/simple/ smartchunk

Once we'll publish it to PyPI:

pip install smartchunk

2. Chunk a Document

smartchunk chunk README.md \
  --mode markdown \
  --max-tokens 500 \
  --overlap 100 \
  --semantic \
  --semantic-model all-MiniLM-L6-v2 \
  --format jsonl \
  --output chunks.jsonl

3. Fetch & Chunk a URL

smartchunk fetch "https://en.wikipedia.org/wiki/Crayon_Shin-chan" \
  --semantic \
  --semantic-model all-MiniLM-L6-v2 \
  --format table

4. Compare with a Naive Splitter

smartchunk compare README.md --mode markdown --max-chars 800

Prints a terminal table comparing naive vs SmartChunk side-by-side.

📦 Example Output

Each line in the .jsonl output is a coherent chunk with rich metadata:

{
    "id": "c0033",
    "text": "###### Opening\n\n \n        [\n\n \n         edit\n\n \n        ]\n\n* Footage from Japanese opening 8 (\"PLEASURE\") but with 
completely different lyrics, to the melody of a techno remix of Japanese opening 3 (\"Ora wa Ninkimono\").Musical Director, Producer and 
English Director: World Worm Studios composerGary Gibbons",
    "header_path": "Media / Anime / Music / LUK Internacional dub / Opening",
    "start_line": 709,
    "end_line": 727
  },

💻 CLI Overview

fetch → Fetch, parse & chunk a URL in one go
chunk → Chunk a local file
compare → Compare SmartChunk vs naive splitter (HTML report)
stream → Stream chunks from STDIN in real-time

Run smartchunk --help for full options.

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines. By participating, you agree to follow our Code of Conduct.

🔑 License

MIT License. Free to use, modify, and share.

(In Simple Words) 📝

SmartChunk = “Don’t let your RAG cut sentences in half.” It’s the first step for any production-grade RAG pipeline: clean, coherent, AI-ready chunks.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
langchain		langchain
src/smartchunk		src/smartchunk
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
test_edge_cases.md		test_edge_cases.md
test_long.txt		test_long.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SmartChunk 🧩

✨ Why SmartChunk?

🧠 Key Features

⚡ Quickstart

1. Install

2. Chunk a Document

3. Fetch & Chunk a URL

4. Compare with a Naive Splitter

📦 Example Output

💻 CLI Overview

🤝 Contributing

🔑 License

(In Simple Words) 📝

About

Uh oh!

Releases 3

Packages

Contributors 2

Uh oh!

Languages

License

ayush585/SmartChunk

Folders and files

Latest commit

History

Repository files navigation

SmartChunk 🧩

✨ Why SmartChunk?

🧠 Key Features

⚡ Quickstart

1. Install

2. Chunk a Document

3. Fetch & Chunk a URL

4. Compare with a Naive Splitter

📦 Example Output

💻 CLI Overview

🤝 Contributing

🔑 License

(In Simple Words) 📝

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

Packages