Tesseract spawning 3 threads, no matter how many CPU cores system has #57

nodesocket · 2025-03-19T07:20:37Z

It looks like the OCR tesseract is only spawning 3 threads, when my machine has 8 cores. Is this an optimization that can be made to spawn n(cores) + 1 threads/processes of tesseract when creating a new RAG?

The text was updated successfully, but these errors were encountered:

nodesocket · 2025-03-19T07:26:33Z

Even worse, pdftoppm looks to be single threaded? Is that a known limitation?

DonTizi · 2025-03-20T01:12:11Z

Hi, yes, thanks for letting me know! This is exactly why I appreciate feedback—it's such a small oversight, but an important one.

I've just applied the new implementation, which is now in the pull request. I should be merging it for the next release, v0.1.33, which will be out tomorrow or the day after. Let me know what you think.

This implementation includes several optimizations:

Parallel PDF Processing:

Uses pdfinfo to determine the page count
For larger PDFs, processes each page in parallel
Implements a semaphore to limit concurrent processes

Parallel OCR Processing:

Processes multiple images simultaneously
Determines the optimal number of workers using runtime.NumCPU() + 1
Implements a worker pool with a semaphore to control resource usage

Memory Management:

Uses a semaphore to limit the number of concurrent processes
Collects results in an ordered way to preserve document flow

Error Handling:

Gracefully handles failures in individual page processing
Provides informative warning messages
Continues processing even if some pages fail

This implementation should significantly speed up OCR processing on multi-core machines while optimizing memory usage.

nodesocket · 2025-03-20T01:16:59Z

@DonTizi awesome, look forward to trying this out. I've been running a rlama rag mistral:7b my-docs /mnt/a/documents for over 24 hours on a very large set of documents. Not even close to hitting system CPU or memory limits (system load is around 1.25) with the previous code on a 8 core system.

nodesocket · 2025-03-20T02:28:53Z

@DonTizi is there a way to specify x number of parallel workers or default to using number of (cpu cores + 1) in this step instead of 3?

⚠️ Could not use snowflake-arctic-embed2 for embeddings: failed to generate embedding: {"error":"model \"snowflake-arctic-embed2\" not found, try pulling it first"} (status: 404)
Attempting to pull snowflake-arctic-embed2 automatically...
Pulling snowflake-arctic-embed2 model (this may take a while)...
pulling manifest
pulling manifest
pulling manifest
pulling manifest
pulling manifest
pulling 8c625c9569c3... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.2 GB
pulling 58d1e17ffe51... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  11 KB
pulling 959bce8be135... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  337 B
verifying sha256 digest
writing manifest
success
Generating embeddings: 109613/109613 chunks processed (100%)
Successfully generated embeddings for 109613 chunks using 3 parallel workers

DonTizi · 2025-03-20T02:34:24Z

I’ll add it tonight. I see that you have a large knowledge base—if that’s correct, can I ask how many documents you’re trying to feed? Also, which LLM will you be using?

I’m currently working on a robust vector store to optimize performance for large knowledge bases, and discovering its current weaknesses would greatly help me with your use case!

You can leave your feedback here or at rmelbouci@rlama.dev—it would be greatly appreciated!

nodesocket · 2025-03-20T02:37:11Z

@DonTizi total documents is 2,182 pdfs and 6 GB total in size. LLM is mistral:7b

DonTizi · 2025-03-20T02:40:11Z

Ok, I recommend playing around with the context size when running your RAG using --context-size=xx.

Smaller models (below 32B) struggle with a context size of 20 and above.

Try testing with a range of 5 to 15 to see how it impacts performance.

In the version I’m working on, I’m trying to implement agentic RAG to retrieve the best contexts without having multiple ones for smaller LLMs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract spawning 3 threads, no matter how many CPU cores system has #57

Tesseract spawning 3 threads, no matter how many CPU cores system has #57

nodesocket commented Mar 19, 2025 •

edited

Loading

nodesocket commented Mar 19, 2025

DonTizi commented Mar 20, 2025

nodesocket commented Mar 20, 2025 •

edited

Loading

nodesocket commented Mar 20, 2025

DonTizi commented Mar 20, 2025

nodesocket commented Mar 20, 2025

DonTizi commented Mar 20, 2025

Tesseract spawning 3 threads, no matter how many CPU cores system has #57

Tesseract spawning 3 threads, no matter how many CPU cores system has #57

Comments

nodesocket commented Mar 19, 2025 • edited Loading

nodesocket commented Mar 19, 2025

DonTizi commented Mar 20, 2025

This implementation includes several optimizations:

Parallel PDF Processing:

Parallel OCR Processing:

Memory Management:

Error Handling:

nodesocket commented Mar 20, 2025 • edited Loading

nodesocket commented Mar 20, 2025

DonTizi commented Mar 20, 2025

nodesocket commented Mar 20, 2025

DonTizi commented Mar 20, 2025

nodesocket commented Mar 19, 2025 •

edited

Loading

nodesocket commented Mar 20, 2025 •

edited

Loading