From 18933c325b9ea4578117a0276ed1bf8ae3cf6370 Mon Sep 17 00:00:00 2001 From: Titusz Pan Date: Mon, 19 Aug 2024 19:03:04 +0200 Subject: [PATCH] docs: update README to clarify text chunking process and add visual representation of ISCC generation process --- README.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 5bb493c..65fa11f 100644 --- a/README.md +++ b/README.md @@ -135,13 +135,29 @@ options: `iscc-sct` employs the following process: -1. Splits the text into semantically coherent chunks. +1. Splits the text into overlaping chunks (using syntactically sensible breakpoints). 1. Uses a pre-trained deep learning model for text embedding. 1. Generates feature vectors capturing essential characteristics of the chunks. 1. Aggregates these vectors and binarizes them to produce a Semantic Text-Code. +1. Prefixes the binarized vector with the matching ISCC header, encodes it with base32, and adds the + "ISCC:" prefix. This process ensures robustness to variations and translations, enabling cross-lingual matching. +Here's a visual representation of the ISCC Semantic Text-Code generation process: + +```mermaid +graph TD + A[Input Text] --> B[Split into Overlapping Chunks] + B --> C[Create Multilingual Vector Embeddings per Chunk] + C --> D[Calculate Document Vector using Mean Pooling] + D --> E[Binarize Document Vector] + E --> F[Prefix with ISCC Header] + F --> G[Encode with Base32] + G --> H[Prefix with 'ISCC:'] + H --> I[Final ISCC Semantic Text-Code] +``` + ## Development and Contributing We welcome contributions to enhance the capabilities and efficiency of this proof of concept. For