From 1fd5ce676734897aa9e3a4567920cddf2bd01364 Mon Sep 17 00:00:00 2001 From: Justin Parker Date: Tue, 5 Nov 2024 23:14:20 -0800 Subject: [PATCH] update docs --- README.md | 25 +++++++++++++------------ webui/README.md | 25 +++++++++++++------------ 2 files changed, 26 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index f4540ec..75d1e00 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # 🍱 semantic-chunking -Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs). +NPM Package for Semantically creating chunks from large texts. Useful for workflows involving large language models (LLMs). ## Features @@ -12,6 +12,16 @@ Semantically create chunks from large texts. Useful for workflows involving larg - Chunk prefix support for RAG workflows - Web UI for experimenting with settings +## Semantic Chunking Workflow +_how it works_ + +1. **Sentence Splitting**: The input text is split into an array of sentences. +2. **Embedding Generation**: A vector is created for each sentence using the specified ONNX model. +3. **Similarity Calculation**: Cosine similarity scores are calculated for each sentence pair. +4. **Chunk Formation**: Sentences are grouped into chunks based on the similarity threshold and max token size. +5. **Chunk Rebalancing**: Optionally, similar adjacent chunks are combined into larger ones up to the max token size. +6. **Output**: The final chunks are returned as an array of objects, each containing the properties described above. + ## Installation ```bash @@ -83,15 +93,6 @@ The output is an array of chunks, each containing the following properties: - `embedding`: Array - The embedding vector (if `returnEmbedding` is `true`). - `token_length`: Integer - The token length (if `returnTokenLength` is `true`). -## Semantic Chunking Workflow - -1. **Sentence Splitting**: The input text is split into an array of sentences. -2. **Embedding Generation**: A vector is created for each sentence using the specified ONNX model. -3. **Similarity Calculation**: Cosine similarity scores are calculated for each sentence pair. -4. **Chunk Formation**: Sentences are grouped into chunks based on the similarity threshold and max token size. -5. **Chunk Rebalancing**: Optionally, similar adjacent chunks are combined into larger ones up to the max token size. -6. **Output**: The final chunks are returned as an array of objects, each containing the properties described above. - ## Examples Example 1: Basic usage with custom similarity threshold: @@ -219,6 +220,7 @@ The behavior of the `chunkit` function can be finely tuned using several optiona | Xenova/all-MiniLM-L6-v2 | true | [https://huggingface.co/Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2) | 23 MB | | Xenova/all-MiniLM-L6-v2 | false | [https://huggingface.co/Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2) | 90.4 MB | | Xenova/paraphrase-multilingual-MiniLM-L12-v2 | true | [https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2) | 118 MB | +| thenlper/gte-base | false | [https://huggingface.co/thenlper/gte-base](https://huggingface.co/thenlper/gte-base) | 436 MB | | Xenova/all-distilroberta-v1 | true | [https://huggingface.co/Xenova/all-distilroberta-v1](https://huggingface.co/Xenova/all-distilroberta-v1) | 82.1 MB | | Xenova/all-distilroberta-v1 | false | [https://huggingface.co/Xenova/all-distilroberta-v1](https://huggingface.co/Xenova/all-distilroberta-v1) | 326 MB | | BAAI/bge-base-en-v1.5 | false | [https://huggingface.co/BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 436 MB | @@ -241,8 +243,7 @@ The Semantic Chunking Web UI allows you to experiment with the chunking paramete - Example texts for testing - Dark mode interface - - +![Semantic Chunking Web UI](./img/semantic-chunking_web-ui.gif) --- diff --git a/webui/README.md b/webui/README.md index 3cc0ddc..cc2fb30 100644 --- a/webui/README.md +++ b/webui/README.md @@ -1,6 +1,6 @@ # 🍱 Semantic Chunking Web UI -A web-based interface for experimenting with and tuning Semantic Chunking settings. This tool provides a visual way to test and configure the `semantic-chunking` library's settings to get optimal results for your specific use case. +A web-based interface for experimenting with and tuning Semantic Chunking settings. This tool provides a visual way to test and configure the `semantic-chunking` library's settings to get optimal results for your specific use case. Once you've found the best settings, you can generate code to implement them in your project. ## Features @@ -13,6 +13,8 @@ A web-based interface for experimenting with and tuning Semantic Chunking settin - Example texts for testing - Dark mode interface +![semantic-chunking_web-ui](../img/semantic-chunking_web-ui.gif) + ## Getting Started ### Prerequisites @@ -22,36 +24,30 @@ A web-based interface for experimenting with and tuning Semantic Chunking settin ### Installation 1. Clone the repository: ----bash +```bash git clone https://github.com/jparkerweb/semantic-chunking.git ``` - 2. Navigate to the webui directory: ----bash +```bash cd semantic-chunking/webui ``` - 3. Install dependencies: ----bash +```bash npm install ``` 4. Start the server: ----bash +```bash npm start ``` - 5. Open your browser and visit: ----bash +```bash http://localhost:3000 ``` ---- - - ## Usage ### Basic Controls @@ -104,3 +100,8 @@ The web UI is built with: ## License This project is licensed under the MIT License - see the LICENSE file for details. + +## Appreciation + +If you enjoy this package please consider sending me a tip to support my work 😀 +# [🍵 tip me here](https://ko-fi.com/jparkerweb)