非扫描版PDF是否能根据pymupdf的元数据提取文字? #2277
Closed
manbuheiniu
started this conversation in
Ideas
Replies: 1 comment
-
现在就是这么做的,只是模式要用默认的auto,手动指定ocr后是会强制使用ocr结果代替文本提取的 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
**问题:**现在OCR识别会出现部分符号或文字识别错误及上标变为普通字。
**想法:**使用布局识别后,对于text块包含位置坐标,pymupdf识别的元数据里也包含位置坐标,对于text块内的文字先使用位置坐标选取PYMUPDF包含的文字元素替代OCR,以提高文字或特殊符号的正确率(根据文字元数据还能识别加粗、上标记这种行内样式),如果区域里找不到文字,再转OCR识别。
Beta Was this translation helpful? Give feedback.
All reactions