is DocumentLoaderPyPdf() working normally? #268

katie312 · 2025-02-19T07:47:20Z

katie312
Feb 19, 2025

I tried using DocumentLoaderPyPdf() as the pdf extractor, but saw nothing in the output json. So I tried to print the extracted content of the pdf, it also showed nothing. However, the program still get in the "content not empty" branch.

enoch3712 · 2025-02-19T10:56:41Z

enoch3712
Feb 19, 2025
Maintainer

Hey @katie312!

Just tested, and works just fine! The problem is, maybe a full image?

Please look at this test, works just fine, and i think the problem is really the fact theat is an image, so please try with a pdf content.

Also, try another documentloader, or just activate vision:

result = extractor.extract(test_file_path, InvoiceContract, vision=True)

This will add each page as an image

0 replies

katie312 · 2025-02-19T12:15:23Z

katie312
Feb 19, 2025
Author

Thank you so much for your reply!

Yes, I did input a PDF format image. Actually I edited the code of llm.py, because I have to use the api in LAN from the company(not deployed by ollama). I don't know if I edited it correctly.

However, I changed the ocr function from DocumentLoaderPypdf() to tesseract, it worked successfully, but once I input a multi-page pdf document, the model response "input too long", should I try the splitting files example?

Thanks for the advice. I would try the vision way.

0 replies

enoch3712 · 2025-02-19T12:42:19Z

enoch3712
Feb 19, 2025
Maintainer

You are welcome @katie312 !

So, for on-premise solutions do the following:

Use a proper OCR and DocumentLoader. If you are using a non vision model, I would combine Docling or MarkitDown (docling is better) with Tesseract or EasyOCR.

If is a vision model, use Docling without any configs, and just the with vision=true

If your model have a context window of 8k or something small, just use:

extractor.extract(test_file_path, Model, vision=True, completion_strategy=CompletionStrategy.PAGINATE)

You can read more about strategies here: https://enoch3712.github.io/ExtractThinker/core-concepts/completion-strategies/

0 replies

katie312 · 2025-02-20T02:11:36Z

katie312
Feb 20, 2025
Author

Thanks for the advice!

My model is a non-vison model, but I would try using a vision model later on.

I entered another question when I use

result = extractor.extract(
    file_path,
    ResponseModel,
    completion_strategy=CompletionStrategy.FORBIDDEN # Default
)

What is the ResponseModel in this case? I tried with:

result = extractor.extract("/data/xx.pdf", InvoiceContract, completion_strategy=CompletionStrategy.PAGINATE)

but showed:
AttributeError: 'dict' object has no attribute 'model_dump'

Also, is there a code demo for combining Docling with Tesseract for the extractor?

Thank you so much!

0 replies

katie312 · 2025-02-20T07:56:25Z

katie312
Feb 20, 2025
Author

upadate:

I tried using a vision model(GPT-4-Vision), but the output shows that
ValueError: Failed to extract from source: Model is not supported for vision.

I think maybe it's because the format of my llm api output and input are different from the project's?

0 replies

enoch3712 · 2025-02-20T09:41:08Z

enoch3712
Feb 20, 2025
Maintainer

@enoch3712

https://docs.litellm.ai/docs/providers/openai

yesm the right way is gpt-4o

0 replies

katie312 · 2025-02-20T09:48:55Z

katie312
Feb 20, 2025
Author

@enoch3712

https://docs.litellm.ai/docs/providers/openai

yesm the right way is gpt-4o

Thank you so much! I will try using gpt-4o. Btw, any suggestions on CompletionStrategy.FORBIDDEN with AttributeError: 'dict' object has no attribute 'model_dump'?

0 replies

enoch3712 · 2025-02-20T09:54:59Z

enoch3712
Feb 20, 2025
Maintainer

@katie312 Right, your model have a small context window!

So stategies are always set to default FORBIDDEN. You have 2, and you wanna use PAGINATE.

Is going to look for all the content and then merge it

6 replies

enoch3712 Feb 21, 2025
Maintainer

@katie312

You are using the keywork not in the correct way. The definitions are a bit tricky. Let me show you:

https://docs.litellm.ai/docs/providers/openai

make sure is "gpt-4o". The problem that you are having im fixing now, its a bad exception, is checking the model before checking that exists.

katie312 Feb 24, 2025
Author

Yes, I test the demo of gpt4o api, it's somehow not working well. I just return to gpt3-turbo which means without using vision. However, when I set

result = extractor.extract("/data/0heVOsgyuXT.pdf", InvoiceContract, completion_strategy=CompletionStrategy.PAGINATE)

It still shows

Traceback (most recent call last):
  File "/data/ocr_project/ExtractThinker-main/test.py", line 44, in <module>
    result = extractor.extract("/data/0700heVOsgyuXT.pdf", InvoiceContract, completion_strategy=CompletionStrategy.PAGINATE)
  File "/data/ocr_project/ExtractThinker-main/extract_thinker/extractor.py", line 157, in extract
    return self.extract_with_strategy(source, response_model, vision, completion_strategy)
  File "/data/ocr_project/ExtractThinker-main/extract_thinker/extractor.py", line 323, in extract_with_strategy
    return handler.handle(content, response_model, vision, self.extra_content)
  File "/data/ocr_project/ExtractThinker-main/extract_thinker/pagination_handler.py", line 69, in handle
    return self._merge_results(results, response_model, pages_data)
  File "/data/ocr_project/ExtractThinker-main/extract_thinker/pagination_handler.py", line 90, in _merge_results
    result_dict = result.model_dump()
AttributeError: 'dict' object has no attribute 'model_dump'

but when I unset CompletionStrategy.PAGINATE the code is working fine.

enoch3712 Feb 24, 2025
Maintainer

@katie312 Your use case seems to be a bit peculiar.

Do want do try check this in a call? Just send me an email. Because what you are doing seems to be correct, but since you are not very familiar with ET, you can be making some mistakes.

Take care,

katie312 Feb 24, 2025
Author

thank you so much!

I think I might accidently correct it.

What I did was :

add

            if isinstance(result, BaseModel):
                result_dict = result.model_dump()
            elif isinstance(result, dict):
                result_dict = result

in extractor.py aggregate_results to match the format of my data

and then change into this in pagination_handler's merge_results

       merged = {k: v for k, v in merged.items() if v is not None}
       content_str = merged["choices"][0]["message"]["content"]
       content_str = content_str.strip("```json\n").strip("\n```")
       content_data = json.loads(content_str)
       print("merged content_data:", content_data)
       return response_model(**content_data)

somehow it works and return the content

enoch3712 Feb 24, 2025
Maintainer

@katie312 interesting!

Can you update to 0.1.7? I did a lot of changes related to this.

is DocumentLoaderPyPdf() working normally? #268

Uh oh!

katie312 Feb 19, 2025

Replies: 8 comments · 6 replies

Uh oh!

enoch3712 Feb 19, 2025 Maintainer

Uh oh!

katie312 Feb 19, 2025 Author

Uh oh!

enoch3712 Feb 19, 2025 Maintainer

Uh oh!

katie312 Feb 20, 2025 Author

Uh oh!

katie312 Feb 20, 2025 Author

Uh oh!

enoch3712 Feb 20, 2025 Maintainer

Uh oh!

katie312 Feb 20, 2025 Author

Uh oh!

enoch3712 Feb 20, 2025 Maintainer

Uh oh!

enoch3712 Feb 21, 2025 Maintainer

Uh oh!

katie312 Feb 24, 2025 Author

Uh oh!

Uh oh!

enoch3712 Feb 24, 2025 Maintainer

Uh oh!

katie312 Feb 24, 2025 Author

Uh oh!

enoch3712 Feb 24, 2025 Maintainer

katie312
Feb 19, 2025

Replies: 8 comments 6 replies

enoch3712
Feb 19, 2025
Maintainer

katie312
Feb 19, 2025
Author

enoch3712
Feb 19, 2025
Maintainer

katie312
Feb 20, 2025
Author

katie312
Feb 20, 2025
Author

enoch3712
Feb 20, 2025
Maintainer

katie312
Feb 20, 2025
Author

enoch3712
Feb 20, 2025
Maintainer

enoch3712 Feb 21, 2025
Maintainer

katie312 Feb 24, 2025
Author

enoch3712 Feb 24, 2025
Maintainer

katie312 Feb 24, 2025
Author

enoch3712 Feb 24, 2025
Maintainer