Considering how to analyse book collections, Large Language Model style
- Get all the good spaceship names
- Finding old stories you have forgotten
- References found
- State of the art models for a current mineral exploration problem
- Consider an extensive ebook library that is decades old
- text
- html
- rtf
- prc
- mobipocket
- epub
- pdf
- cbz / cbr
- Not all digital native
- Failure OCR - what is open source state of the art
- If on windows - robocopy multithreaded is useful robocopy "C:\Users\bananasplits\OneDrive\Calibre Books" "D:\books\Calibre Books" /MIR /MT:16
- Fiction
- Non-fiction
- Games
- Academic
- Papers etc
- Code
- Many languages
- Do you want covers?
- Diagrams from non-fiction
- Pictures [or comic strips] from games
- Actual comics
- 2000 AD, Image, Humble Bundles have CBR/CBZ possibilities
- Calibre https://calibre-ebook.com/
- https://huggingface.co/learn/cookbook/en/rag_llamaindex_librarian
- https://github.com/huggingface/cookbook/blob/main/notebooks/en/rag_llamaindex_librarian.ipynb
- https://www.bitsgalore.org/2023/03/09/extracting-text-from-epub-files-in-python
- Clearly will need some sort of Retrieval Augmented Generation
- Currently a folder with metadata, db and a folder per 'author' - which is not important, but is the structure
- Is in memory ok if run on a decent server
- A books library example
.azw: 12
.azw3: 234
.azw4: 2
.db: 2
.doc: 1
.docx: 1
.epub: 1567
.htmlz: 6
.jpg: 9829
.json: 1
.lit: 2
.lrf: 2
.mobi: 10054
.opf: 10891
.original_epub: 20
.pdb: 9
.pdf: 102
.prc: 315
.rar: 18
.rtf: 9
.txt: 28
.zip: 17
- use this for conversion
- ubuntu has an apt package
- install https://ollama.com/download/linux
- curl -fsSL https://ollama.com/install.sh | sh
- models: https://ollama.com/library
- operation ollama/ollama#707
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.
- https://docs.llamaindex.ai/en/v0.9.48/examples/data_connectors/simple_directory_reader.html
- Readers https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-file
- Vector Stores https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/
- https://docs.trychroma.com/
- is multimodal
- installation https://docs.trychroma.com/getting-started
- vector store installation pip install llama-index-vector-stores-chroma
- embedding speed https://www.youtube.com/watch?v=7FvdwwvqrD4&ab_channel=JohnnyCode
- examples -
- ?
-
Text
-
Multimodal - Llava?
-
Model Storage ollama/ollama#1737
-
ollama pull llama2
-
ollama run llama3
-
etc
- ollama/ollama#707
- Hey all, not seeing ollama in the output of lsof could be a permissions issue. When you install ollama on linux via the install script it creates a service user for the background process. You may need to stop the process via systemctl in that case.
Here is some troubleshooting steps that will hopefully help:
Stop the background service:
sudo systemctl stop ollama
ollama serve #to restart
Run lsof as sudo to rule out permissions issues:
sudo lsof -i :11434
- multi GPU approach
- https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py
- Chroma will do multimodal
- https://thenewstack.io/exploring-chroma-the-open-source-vector-database-for-llms/
- sqlite3 linux package can be used to check pragma basics etc.
- discussion https://www.reddit.com/r/LocalLLaMA/comments/1brk3qo/pdf_to_json_for_rag_through_multimodal_models/
- Llava?
- layoutlm https://huggingface.co/docs/transformers/en/model_doc/layoutlm
- grobid https://grobid.readthedocs.io/en/latest/Introduction/
- LlamaParse https://medium.com/@salujav4/parsing-pdfs-text-image-and-tables-for-rag-based-applications-using-llamaparse-llamaindex-0f4c5ed50fb7
- Unstructured https://github.com/Unstructured-IO
- https://github.com/parsee-ai/parsee-pdf-reader
- TableTransformer
- PyMuPDF
- PDFMiner.six
- Camelot
- PyPDF2
- pikepdf https://github.com/RichardScottOZ/pikepdf
- tesseract https://tesseract-ocr.github.io/tessdoc/Installation.html
- sudo apt install tesseract-ocr
- sudo apt install libtesseract-dev
- https://github.com/parsee-ai/parsee-pdf-reader
- pdf reader conflicts
- Installing collected packages: pytesseract, pypdf, pycparser, pdf2image, opencv-python, cffi, cryptography, pdfminer-six, parsee-pdf-reader Attempting uninstall: pypdf Found existing installation: pypdf 4.2.0 Uninstalling pypdf-4.2.0: Successfully uninstalled pypdf-4.2.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. llama-index-readers-file 0.1.20 requires pypdf<5.0.0,>=4.0.1, but you have pypdf 3.17.4 which is incompatible. Successfully installed cffi-1.16.0 cryptography-42.0.7 opencv-python-4.9.0.80 parsee-pdf-reader-0.1.5.8 pdf2image-1.17.0 pdfminer-six-20221105 pycparser-2.22 pypdf-3.17.4 pytesseract-0.3.10
- could use isfdb database dumps
- or rpggeek similarly
- needs logging added to track things in main program
- Game collection with lots of images - some things being old scans will be slow
- Consider parallelising
- All at once embarassingly parallel job
- Perhaps Lithops to abstract some configuration [supposedly]?
- Config File
- Compute Backend [policy, role] https://github.com/lithops-cloud/lithops/blob/master/docs/source/compute_config/aws_lambda.md
- Storage Backend [bucket] - https://github.com/lithops-cloud/lithops/blob/master/docs/source/storage_config/aws_s3.md
/book-mentat/src/games$ lithops hello
2024-05-25 12:37:06,919 [INFO] config.py:139 -- Lithops v3.3.0 - Python3.10
2024-05-25 12:37:07,708 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
2024-05-25 12:37:09,791 [INFO] aws_lambda.py:97 -- AWS Lambda client created - Region: us-west-2
2024-05-25 12:37:09,793 [INFO] invokers.py:107 -- ExecutorID 8xc5f5-0 | JobID A000 - Selected Runtime: default-runtime-v310 - 256MB
2024-05-25 12:37:10,010 [INFO] invokers.py:115 -- Runtime default-runtime-v310 with 256MB is not yet deployed
2024-05-25 12:37:10,010 [INFO] aws_lambda.py:388 -- Deploying runtime: default-runtime-v310 - Memory: 256 - Timeout: 180
2024-05-25 12:37:10,838 [INFO] aws_lambda.py:187 -- Creating lambda layer for runtime default-runtime-v310
2024-05-25 12:38:53,088 [INFO] invokers.py:174 -- ExecutorID 8xc5f5-0 | JobID A000 - Starting function invocation: hello() - Total: 1 activations
2024-05-25 12:38:53,166 [INFO] invokers.py:213 -- ExecutorID 8xc5f5-0 | JobID A000 - View execution logs at /tmp/lithops-richard/logs/8dc5f5-0-A000.log
2024-05-25 12:38:53,191 [INFO] executors.py:491 -- ExecutorID 8xc5f5-0 - Getting results from 1 function activations
2024-05-25 12:38:53,191 [INFO] wait.py:101 -- ExecutorID 8xc5f5-0 - Waiting for 1 function activations to complete
-
Need to build a Docker image for anything bespoke of interest
-
Follow examples
-
Needs Docker Desktop https://docs.docker.com/desktop/install/ubuntu/
At the end of the installation process, apt displays an error due to installing a downloaded package. You can ignore this error message.
N: Download is performed unsandboxed as root, as file '/home/user/Downloads/doc
- it has two timeouts - a runtime_timeout
- execution_timeout - in lithops section
-
lithops runtime build -f MyDockerfile -b aws_lambda my-container-runtime-name
-
runtime memory –memory, -m Memory size in MBs to assign to the runtime.
-
permissions issues:
-
list runtimes lithops runtime list -b aws_lambda
-
libgl error
-
so good questions?
-
other solutions
-
pip install opencv-contrib-python
-
install opencv-contrib-python rather than opencv-python.
-
At one point, I used opencv-python-headless which worked for my case with FastAPI when I deployed on Heroku once. What's the
-
pip install -U opencv-python
-
apt update && apt install -y libsm6 libxext6 ffmpeg libfontconfig1 libxrender1 libg
-
runtime memory you can adjust in .lithops_config for pool
-
/tmp is writeable in container
-
Batch needs a service role that can do ECS
-
also a job Role lithops-cloud/lithops#1359
-
Choose Elastic Container Service on the use case list and then click on Elastic Container Service Task.
- Click Next: Permissions. Select the policy created before (lithops-policy).
- Click Next: Tags and Next: Review.
- Type a role name, for example ecsTaskJobRole. Click on Create Role.
- Set the LD_LIBRARY_PATH environment variable ENV LD_LIBRARY_PATH="/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH"
- docker system prune --all --force