Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the bug where Python scripts fail to execute PDF text recognition… #11994

Merged
merged 3 commits into from
Apr 25, 2024

Conversation

guangyunms
Copy link
Contributor

  1. 修复Python脚本执行pdf文本识别任务失败的BUG,优化判断pdf文件的逻辑。
  2. 为版面分析的quickstart文档添加案例。

… tasks, optimize the logic of judging PDF files, and add cases to the quickstart document for layout analysis.
Copy link

paddle-bot bot commented Apr 24, 2024

Thanks for your contribution!

@guangyunms
Copy link
Contributor Author

原PR为# 11984,因为原PR的commit message过于混乱,而重新创建PR。

@GreatV GreatV assigned GreatV and unassigned Sunting78 Apr 24, 2024
@GreatV
Copy link
Collaborator

GreatV commented Apr 24, 2024

PaddleOCR/paddleocr.py

Lines 845 to 960 in 00f0d42

def main():
# for cmd
args = parse_args(mMain=True)
image_dir = args.image_dir
if is_link(image_dir):
download_with_progressbar(image_dir, "tmp.jpg")
image_file_list = ["tmp.jpg"]
else:
image_file_list = get_image_file_list(args.image_dir)
if len(image_file_list) == 0:
logger.error("no images find in {}".format(args.image_dir))
return
if args.type == "ocr":
engine = PaddleOCR(**(args.__dict__))
elif args.type == "structure":
engine = PPStructure(**(args.__dict__))
else:
raise NotImplementedError
for img_path in image_file_list:
img_name = os.path.basename(img_path).split(".")[0]
logger.info("{}{}{}".format("*" * 10, img_path, "*" * 10))
if args.type == "ocr":
result = engine.ocr(
img_path,
det=args.det,
rec=args.rec,
cls=args.use_angle_cls,
bin=args.binarize,
inv=args.invert,
alpha_color=args.alphacolor,
)
if result is not None:
lines = []
for idx in range(len(result)):
res = result[idx]
for line in res:
logger.info(line)
val = "["
for box in line[0]:
val += str(box[0]) + "," + str(box[1]) + ","
val = val[:-1]
val += "]," + line[1][0] + "," + str(line[1][1]) + "\n"
lines.append(val)
if args.savefile:
if os.path.exists(args.output) is False:
os.mkdir(args.output)
outfile = args.output + "/" + img_name + ".txt"
with open(outfile, "w", encoding="utf-8") as f:
f.writelines(lines)
elif args.type == "structure":
img, flag_gif, flag_pdf = check_and_read(img_path)
if not flag_gif and not flag_pdf:
img = cv2.imread(img_path)
if args.recovery and args.use_pdf2docx_api and flag_pdf:
from pdf2docx.converter import Converter
docx_file = os.path.join(args.output, "{}.docx".format(img_name))
cv = Converter(img_path)
cv.convert(docx_file)
cv.close()
logger.info("docx save to {}".format(docx_file))
continue
if not flag_pdf:
if img is None:
logger.error("error in loading image:{}".format(img_path))
continue
img_paths = [[img_path, img]]
else:
img_paths = []
for index, pdf_img in enumerate(img):
os.makedirs(os.path.join(args.output, img_name), exist_ok=True)
pdf_img_path = os.path.join(
args.output, img_name, img_name + "_" + str(index) + ".jpg"
)
cv2.imwrite(pdf_img_path, pdf_img)
img_paths.append([pdf_img_path, pdf_img])
all_res = []
for index, (new_img_path, img) in enumerate(img_paths):
logger.info("processing {}/{} page:".format(index + 1, len(img_paths)))
new_img_name = os.path.basename(new_img_path).split(".")[0]
result = engine(img, img_idx=index)
save_structure_res(result, args.output, img_name, index)
if args.recovery and result != []:
from copy import deepcopy
from ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
h, w, _ = img.shape
result_cp = deepcopy(result)
result_sorted = sorted_layout_boxes(result_cp, w)
all_res += result_sorted
if args.recovery and all_res != []:
try:
from ppstructure.recovery.recovery_to_doc import convert_info_docx
convert_info_docx(img, all_res, args.output, img_name)
except Exception as ex:
logger.error(
"error in layout recovery image:{}, err msg: {}".format(
img_name, ex
)
)
continue
for item in all_res:
item.pop("img")
item.pop("res")
logger.info(item)
logger.info("result save to {}".format(args.output))

这里已经有相关的逻辑了,PR里添加的与现有的有什么异同

@GreatV
Copy link
Collaborator

GreatV commented Apr 24, 2024

版面分析的quickstart文档案例确实没有提供pdf格式的处理 demo。感觉可以把main里面相关内容提出来改一改,做一个pdf格式的demo。

@guangyunms
Copy link
Contributor Author

PaddleOCR/paddleocr.py

Lines 845 to 960 in 00f0d42

def main():
# for cmd
args = parse_args(mMain=True)
image_dir = args.image_dir
if is_link(image_dir):
download_with_progressbar(image_dir, "tmp.jpg")
image_file_list = ["tmp.jpg"]
else:
image_file_list = get_image_file_list(args.image_dir)
if len(image_file_list) == 0:
logger.error("no images find in {}".format(args.image_dir))
return
if args.type == "ocr":
engine = PaddleOCR(**(args.__dict__))
elif args.type == "structure":
engine = PPStructure(**(args.__dict__))
else:
raise NotImplementedError
for img_path in image_file_list:
img_name = os.path.basename(img_path).split(".")[0]
logger.info("{}{}{}".format("*" * 10, img_path, "*" * 10))
if args.type == "ocr":
result = engine.ocr(
img_path,
det=args.det,
rec=args.rec,
cls=args.use_angle_cls,
bin=args.binarize,
inv=args.invert,
alpha_color=args.alphacolor,
)
if result is not None:
lines = []
for idx in range(len(result)):
res = result[idx]
for line in res:
logger.info(line)
val = "["
for box in line[0]:
val += str(box[0]) + "," + str(box[1]) + ","
val = val[:-1]
val += "]," + line[1][0] + "," + str(line[1][1]) + "\n"
lines.append(val)
if args.savefile:
if os.path.exists(args.output) is False:
os.mkdir(args.output)
outfile = args.output + "/" + img_name + ".txt"
with open(outfile, "w", encoding="utf-8") as f:
f.writelines(lines)
elif args.type == "structure":
img, flag_gif, flag_pdf = check_and_read(img_path)
if not flag_gif and not flag_pdf:
img = cv2.imread(img_path)
if args.recovery and args.use_pdf2docx_api and flag_pdf:
from pdf2docx.converter import Converter
docx_file = os.path.join(args.output, "{}.docx".format(img_name))
cv = Converter(img_path)
cv.convert(docx_file)
cv.close()
logger.info("docx save to {}".format(docx_file))
continue
if not flag_pdf:
if img is None:
logger.error("error in loading image:{}".format(img_path))
continue
img_paths = [[img_path, img]]
else:
img_paths = []
for index, pdf_img in enumerate(img):
os.makedirs(os.path.join(args.output, img_name), exist_ok=True)
pdf_img_path = os.path.join(
args.output, img_name, img_name + "_" + str(index) + ".jpg"
)
cv2.imwrite(pdf_img_path, pdf_img)
img_paths.append([pdf_img_path, pdf_img])
all_res = []
for index, (new_img_path, img) in enumerate(img_paths):
logger.info("processing {}/{} page:".format(index + 1, len(img_paths)))
new_img_name = os.path.basename(new_img_path).split(".")[0]
result = engine(img, img_idx=index)
save_structure_res(result, args.output, img_name, index)
if args.recovery and result != []:
from copy import deepcopy
from ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
h, w, _ = img.shape
result_cp = deepcopy(result)
result_sorted = sorted_layout_boxes(result_cp, w)
all_res += result_sorted
if args.recovery and all_res != []:
try:
from ppstructure.recovery.recovery_to_doc import convert_info_docx
convert_info_docx(img, all_res, args.output, img_name)
except Exception as ex:
logger.error(
"error in layout recovery image:{}, err msg: {}".format(
img_name, ex
)
)
continue
for item in all_res:
item.pop("img")
item.pop("res")
logger.info(item)
logger.info("result save to {}".format(args.output))

这里已经有相关的逻辑了,PR里添加的与现有的有什么异同

我的贡献参考了这里的代码。异同在于已有的代码是通过命令行方式运行的,而我的贡献是通过Python脚本运行的。开发者可能更习惯Python脚本的方式

@guangyunms
Copy link
Contributor Author

版面分析的quickstart文档案例确实没有提供pdf格式的处理 demo。感觉可以把main里面相关内容提出来改一改,做一个pdf格式的demo。

确实,我目前参考quickstart文档里已有的案例写了一个demo。

@GreatV
Copy link
Collaborator

GreatV commented Apr 25, 2024

main里面是先把pdf文件解析成单个图片,然后再对单个图片处理。可能并不需要直接传pdf到 PPStructure engine。只需要把demo改成先解析pdf,再处理图片的形式。这样改动最小,也解决了用户的疑惑。

@@ -189,7 +190,29 @@ im_show.save('result.jpg')
```

<a name="223"></a>
#### 2.2.3 版面分析
#### 2.2.3 版面分析+文本识别
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该还有版面分析,直接合并到版面分析里吧

@guangyunms
Copy link
Contributor Author

main里面是先把pdf文件解析成单个图片,然后再对单个图片处理。可能并不需要直接传pdf到 PPStructure engine。只需要把demo改成先解析pdf,再处理图片的形式。这样改动最小,也解决了用户的疑惑。

这样子也可以的,我觉得可以把两种方式都写上,一种是直接传pdf,因为现有文档里命令行的运行方式就是直接传入的pdf文件路径,用户看了之后可能觉得这种更符合使用的直觉。另一种是用户自己先对pdf进行处理解析成图片,再处理图片。

写好之后我再合并到之前的版面分析里吧。

您这边觉得如何?
@GreatV

@GreatV
Copy link
Collaborator

GreatV commented Apr 25, 2024

@guangyunms 看了一下,ocr部分是支持pdf infer的,所以这么改也是合理的。可以按照你的想法做。

# for infer pdf file
if isinstance(img, list):
if isinstance(img, list) and flag_pdf:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样改,对处理gif会不会有影响

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

返回 flag_gif 和 flag_pdf是不是很有必要,这里判断它是不是list,应该也是可以达到目标的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 我测试了,不会对gif造成影响,并且从代码中得知判断gif的文件的原因是读取文件的方式与普通图片不同,至于后续处理应该都是一样的,并不会和pdf一样出现因为存在多页而出现错误。
  2. 是的,目前看来判断是不是list也可以达到目标,但是根据数据类型判断感觉不太稳妥,而代码中既然有flag_pdf这个判断标准,感觉还是加上这个判断条件比较符合相关函数的定义和代码逻辑,且不会影响到后续的设计。

Comment on lines +850 to 852
return res_list
res, _ = super().__call__(img, return_ocr_result_in_table, img_idx=img_idx)
return res
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里返回类型发生了改变,会不会对用户使用造成困扰。建议参考ocr部分处理一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于返回类型发生了改变,会不会对用户使用造成困扰。参考ocr部分可知,它对返回类型的处理为,img如果是list,则不做改变,如果不是list,则把他放入一个list里返回,即都处理成一个list。然而,对于PPStructure类,它的定义和ocr不同,似乎是设计为返回单个页面的结果,main函数验证了我的猜想,目前命令行的方式里调用PPStructure是让它返回单个值的,如果按照OCR的部分处理的话,势必要改变main函数,我觉得还是暂时不动比较好。因为您那边可能对后续如何编写有其它设计,我尽量不改变已有的操作方式。

@guangyunms
Copy link
Contributor Author

  1. 我测试了,不会对gif造成影响,并且从代码中得知判断gif的文件的原因是读取文件的方式与普通图片不同,至于后续处理应该都是一样的,并不会和pdf一样出现因为存在多页而出现错误。
  2. 是的,目前看来判断是不是list也可以达到目标,但是根据数据类型判断感觉不太稳妥,而代码中既然有flag_pdf这个判断标准,感觉还是加上这个判断条件比较符合相关函数的定义和代码逻辑,且不会影响到后续的设计。
  3. 此外,关于返回类型发生了改变,会不会对用户使用造成困扰。参考ocr部分可知,它对返回类型的处理为,img如果是list,则不做改变,如果不是list,则把他放入一个list里返回,即都处理成一个list。然而,对于PPStructure类,它的定义和ocr不同,似乎是设计为返回单个页面的结果,main函数验证了我的猜想,目前命令行的方式里调用PPStructure是让它返回单个值的,如果按照OCR的部分处理的话,势必要改变main函数,我觉得还是暂时不动比较好。因为您那边可能对后续如何编写有其它设计,我尽量不改变已有的操作方式。
  4. 我添加了两个样例并合并到版面分析里。

@guangyunms guangyunms requested a review from GreatV April 25, 2024 03:41
paddleocr.py Outdated
@@ -561,6 +561,7 @@ def check_img(img, alpha_color=(255, 255, 255)):
alpha_color: Background color in images in RGBA format
return: numpy.array (h, w, 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个返回类型的描述也需要改一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@GreatV
Copy link
Collaborator

GreatV commented Apr 25, 2024

测试了一下两个demo都能正常工作。

@guangyunms guangyunms requested a review from GreatV April 25, 2024 05:00
Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@GreatV GreatV merged commit f7117ef into PaddlePaddle:main Apr 25, 2024
3 checks passed
@luotao1
Copy link
Collaborator

luotao1 commented Oct 15, 2024

@guangyunms Thanks for your contribution! You will receive a beautiful PaddlePaddle gift. Please provide your mailing address by filling out the following questionnaire before October 18th.

Looking forward to the future, we will walk further together in the world of open source!
Click Here :https://paddle.wjx.cn/vm/h4On9gJ.aspx#

@luotao1
Copy link
Collaborator

luotao1 commented Nov 6, 2024

hi, @guangyunms

  • 非常感谢你对飞桨的贡献,我们正在运营一个PFCC组织,会通过定期分享技术知识与发布开发者主导任务的形式持续为飞桨做贡献,详情可见 https://github.com/luotao1 主页说明。
  • 如果你对PFCC有兴趣,请发送邮件至 ext_paddle_oss@baidu.com,我们会邀请你加入~

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants