Skip to content

Tesseract spawning 3 threads, no matter how many CPU cores system has #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nodesocket opened this issue Mar 19, 2025 · 7 comments
Open

Comments

@nodesocket
Copy link

nodesocket commented Mar 19, 2025

It looks like the OCR tesseract is only spawning 3 threads, when my machine has 8 cores. Is this an optimization that can be made to spawn n(cores) + 1 threads/processes of tesseract when creating a new RAG?

@nodesocket
Copy link
Author

Even worse, pdftoppm looks to be single threaded? Is that a known limitation?

@DonTizi
Copy link
Owner

DonTizi commented Mar 20, 2025

Hi, yes, thanks for letting me know! This is exactly why I appreciate feedback—it's such a small oversight, but an important one.

I've just applied the new implementation, which is now in the pull request. I should be merging it for the next release, v0.1.33, which will be out tomorrow or the day after. Let me know what you think.

This implementation includes several optimizations:

Parallel PDF Processing:

  • Uses pdfinfo to determine the page count
  • For larger PDFs, processes each page in parallel
  • Implements a semaphore to limit concurrent processes

Parallel OCR Processing:

  • Processes multiple images simultaneously
  • Determines the optimal number of workers using runtime.NumCPU() + 1
  • Implements a worker pool with a semaphore to control resource usage

Memory Management:

  • Uses a semaphore to limit the number of concurrent processes
  • Collects results in an ordered way to preserve document flow

Error Handling:

  • Gracefully handles failures in individual page processing
  • Provides informative warning messages
  • Continues processing even if some pages fail

This implementation should significantly speed up OCR processing on multi-core machines while optimizing memory usage.

@nodesocket
Copy link
Author

nodesocket commented Mar 20, 2025

@DonTizi awesome, look forward to trying this out. I've been running a rlama rag mistral:7b my-docs /mnt/a/documents for over 24 hours on a very large set of documents. Not even close to hitting system CPU or memory limits (system load is around 1.25) with the previous code on a 8 core system.

@nodesocket
Copy link
Author

@DonTizi is there a way to specify x number of parallel workers or default to using number of (cpu cores + 1) in this step instead of 3?

⚠️ Could not use snowflake-arctic-embed2 for embeddings: failed to generate embedding: {"error":"model \"snowflake-arctic-embed2\" not found, try pulling it first"} (status: 404)
Attempting to pull snowflake-arctic-embed2 automatically...
Pulling snowflake-arctic-embed2 model (this may take a while)...
pulling manifest
pulling manifest
pulling manifest
pulling manifest
pulling manifest
pulling 8c625c9569c3... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.2 GB
pulling 58d1e17ffe51... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  11 KB
pulling 959bce8be135... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  337 B
verifying sha256 digest
writing manifest
success
Generating embeddings: 109613/109613 chunks processed (100%)
Successfully generated embeddings for 109613 chunks using 3 parallel workers

@DonTizi
Copy link
Owner

DonTizi commented Mar 20, 2025

I’ll add it tonight. I see that you have a large knowledge base—if that’s correct, can I ask how many documents you’re trying to feed? Also, which LLM will you be using?

I’m currently working on a robust vector store to optimize performance for large knowledge bases, and discovering its current weaknesses would greatly help me with your use case!

You can leave your feedback here or at rmelbouci@rlama.dev—it would be greatly appreciated!

@nodesocket
Copy link
Author

@DonTizi total documents is 2,182 pdfs and 6 GB total in size. LLM is mistral:7b

@DonTizi
Copy link
Owner

DonTizi commented Mar 20, 2025

Ok, I recommend playing around with the context size when running your RAG using --context-size=xx.

Smaller models (below 32B) struggle with a context size of 20 and above.

Try testing with a range of 5 to 15 to see how it impacts performance.

In the version I’m working on, I’m trying to implement agentic RAG to retrieve the best contexts without having multiple ones for smaller LLMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants