Skip to content

jdwh08/Autodoc-Lifter

Repository files navigation

title emoji colorFrom colorTo python_version sdk sdk_version suggested_hardware suggested_storage app_file header short_description models tags license pinned preload_from_hub
Autodoc Lifter
🦊📝
yellow
red
3.11.9
streamlit
1.37.1
t4-small
small
app.py
mini
Good Local RAG for Bad PDFs
timm/resnet18.a1_in1k
microsoft/table-transformer-detection
mixedbread-ai/mxbai-embed-large-v1
mixedbread-ai/mxbai-rerank-large-v1
meta-llama/Meta-Llama-3.1-8B-Instruct
Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5
rag
llm
pdf
document
agpl-3.0
true
timm/resnet18.a1_in1k
microsoft/table-transformer-detection
mixedbread-ai/mxbai-embed-large-v1
mixedbread-ai/mxbai-rerank-large-v1
Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5

Autodoc Lifter

Document RAG system with LLMs. Some key goals for the project, once finished:

  1. All open, all local. I don't want to be calling APIs. You can the entire app locally, and inspect the code and models. This is particularly suitable for handling restricted information. Yes I know this is a web demo on Spaces, so don't actually do that here. Use the GitHub link: (here, once it's no longer ClosedAI)

  2. Support for atrocious and varied PDFs. Have images? Have tables? Have a set of PDFs with the worst quality and page layout known to man? Give it a try in here. I've been slowly building out custom processing for difficult documents by connecting Unstructured.IO to LlamaIndex in a slightly useful way. (A future dream: get rid of Unstructured and build our own pipeline one day.)

  3. Multiple PDFs, handled with agents. Instead of dumping all the documents into one central vector store and praying it works out, I'm try to be more thoughtful as to how to incorporate multiple documents.

  4. Answers that are sourced and verifiable. I'm sorry, but as an Definitely Human Person, I don't like hallucinated answers-ex-machina. Responses should give actual citations [0] when pulling text directly from source documents, and there should be a way to view the citations, referenced text, and the document itself.

    --- CITATIONS --- [0] Relies primarily on fuzzy string matching, because it's computationally cheaper and also ensures that cited text actually occurs in the source documents.

Releases

No releases published

Packages

No packages published

Languages