Fix the bug where Python scripts fail to execute PDF text recognition… #11994

guangyunms · 2024-04-24T13:58:26Z

修复Python脚本执行pdf文本识别任务失败的BUG，优化判断pdf文件的逻辑。
为版面分析的quickstart文档添加案例。

… tasks, optimize the logic of judging PDF files, and add cases to the quickstart document for layout analysis.

paddle-bot · 2024-04-24T13:58:32Z

Thanks for your contribution!

guangyunms · 2024-04-24T14:01:00Z

原PR为# 11984，因为原PR的commit message过于混乱，而重新创建PR。

GreatV · 2024-04-24T15:41:08Z

PaddleOCR/paddleocr.py

Lines 845 to 960 in 00f0d42

    
           def main(): 
        
               # for cmd 
        
               args = parse_args(mMain=True) 
        
               image_dir = args.image_dir 
        
               if is_link(image_dir): 
        
                   download_with_progressbar(image_dir, "tmp.jpg") 
        
                   image_file_list = ["tmp.jpg"] 
        
               else: 
        
                   image_file_list = get_image_file_list(args.image_dir) 
        
               if len(image_file_list) == 0: 
        
                   logger.error("no images find in {}".format(args.image_dir)) 
        
                   return 
        
               if args.type == "ocr": 
        
                   engine = PaddleOCR(**(args.__dict__)) 
        
               elif args.type == "structure": 
        
                   engine = PPStructure(**(args.__dict__)) 
        
               else: 
        
                   raise NotImplementedError 
        
               for img_path in image_file_list: 
        
                   img_name = os.path.basename(img_path).split(".")[0] 
        
                   logger.info("{}{}{}".format("*" * 10, img_path, "*" * 10)) 
        
                   if args.type == "ocr": 
        
                       result = engine.ocr( 
        
                           img_path, 
        
                           det=args.det, 
        
                           rec=args.rec, 
        
                           cls=args.use_angle_cls, 
        
                           bin=args.binarize, 
        
                           inv=args.invert, 
        
                           alpha_color=args.alphacolor, 
        
                       ) 
        
                       if result is not None: 
        
                           lines = [] 
        
                           for idx in range(len(result)): 
        
                               res = result[idx] 
        
                               for line in res: 
        
                                   logger.info(line) 
        
                                   val = "[" 
        
                                   for box in line[0]: 
        
                                       val += str(box[0]) + "," + str(box[1]) + "," 
        
                                   val = val[:-1] 
        
                                   val += "]," + line[1][0] + "," + str(line[1][1]) + "\n" 
        
                                   lines.append(val) 
        
                           if args.savefile: 
        
                               if os.path.exists(args.output) is False: 
        
                                   os.mkdir(args.output) 
        
                               outfile = args.output + "/" + img_name + ".txt" 
        
                               with open(outfile, "w", encoding="utf-8") as f: 
        
                                   f.writelines(lines) 
        
                   elif args.type == "structure": 
        
                       img, flag_gif, flag_pdf = check_and_read(img_path) 
        
                       if not flag_gif and not flag_pdf: 
        
                           img = cv2.imread(img_path) 
        
                       if args.recovery and args.use_pdf2docx_api and flag_pdf: 
        
                           from pdf2docx.converter import Converter 
        
                           docx_file = os.path.join(args.output, "{}.docx".format(img_name)) 
        
                           cv = Converter(img_path) 
        
                           cv.convert(docx_file) 
        
                           cv.close() 
        
                           logger.info("docx save to {}".format(docx_file)) 
        
                           continue 
        
                       if not flag_pdf: 
        
                           if img is None: 
        
                               logger.error("error in loading image:{}".format(img_path)) 
        
                               continue 
        
                           img_paths = [[img_path, img]] 
        
                       else: 
        
                           img_paths = [] 
        
                           for index, pdf_img in enumerate(img): 
        
                               os.makedirs(os.path.join(args.output, img_name), exist_ok=True) 
        
                               pdf_img_path = os.path.join( 
        
                                   args.output, img_name, img_name + "_" + str(index) + ".jpg" 
        
                               ) 
        
                               cv2.imwrite(pdf_img_path, pdf_img) 
        
                               img_paths.append([pdf_img_path, pdf_img]) 
        
                       all_res = [] 
        
                       for index, (new_img_path, img) in enumerate(img_paths): 
        
                           logger.info("processing {}/{} page:".format(index + 1, len(img_paths))) 
        
                           new_img_name = os.path.basename(new_img_path).split(".")[0] 
        
                           result = engine(img, img_idx=index) 
        
                           save_structure_res(result, args.output, img_name, index) 
        
                           if args.recovery and result != []: 
        
                               from copy import deepcopy 
        
                               from ppstructure.recovery.recovery_to_doc import sorted_layout_boxes 
        
                               h, w, _ = img.shape 
        
                               result_cp = deepcopy(result) 
        
                               result_sorted = sorted_layout_boxes(result_cp, w) 
        
                               all_res += result_sorted 
        
                       if args.recovery and all_res != []: 
        
                           try: 
        
                               from ppstructure.recovery.recovery_to_doc import convert_info_docx 
        
                               convert_info_docx(img, all_res, args.output, img_name) 
        
                           except Exception as ex: 
        
                               logger.error( 
        
                                   "error in layout recovery image:{}, err msg: {}".format( 
        
                                       img_name, ex 
        
                                   ) 
        
                               ) 
        
                               continue 
        
                       for item in all_res: 
        
                           item.pop("img") 
        
                           item.pop("res") 
        
                           logger.info(item) 
        
                       logger.info("result save to {}".format(args.output))

这里已经有相关的逻辑了，PR里添加的与现有的有什么异同

GreatV · 2024-04-24T15:50:03Z

版面分析的quickstart文档案例确实没有提供pdf格式的处理 demo。感觉可以把main里面相关内容提出来改一改，做一个pdf格式的demo。

guangyunms · 2024-04-24T15:56:56Z

PaddleOCR/paddleocr.py

Lines 845 to 960 in 00f0d42

def main():

# for cmd

args = parse_args(mMain=True)

image_dir = args.image_dir

if is_link(image_dir):

download_with_progressbar(image_dir, "tmp.jpg")

image_file_list = ["tmp.jpg"]

else:

image_file_list = get_image_file_list(args.image_dir)

if len(image_file_list) == 0:

logger.error("no images find in {}".format(args.image_dir))

return

if args.type == "ocr":

engine = PaddleOCR(**(args.__dict__))

elif args.type == "structure":

engine = PPStructure(**(args.__dict__))

else:

raise NotImplementedError

for img_path in image_file_list:

img_name = os.path.basename(img_path).split(".")[0]

logger.info("{}{}{}".format("*" * 10, img_path, "*" * 10))

if args.type == "ocr":

result = engine.ocr(

img_path,

det=args.det,

rec=args.rec,

cls=args.use_angle_cls,

bin=args.binarize,

inv=args.invert,

alpha_color=args.alphacolor,

)

if result is not None:

lines = []

for idx in range(len(result)):

res = result[idx]

for line in res:

logger.info(line)

val = "["

for box in line[0]:

val += str(box[0]) + "," + str(box[1]) + ","

val = val[:-1]

val += "]," + line[1][0] + "," + str(line[1][1]) + "\n"

lines.append(val)

if args.savefile:

if os.path.exists(args.output) is False:

os.mkdir(args.output)

outfile = args.output + "/" + img_name + ".txt"

with open(outfile, "w", encoding="utf-8") as f:

f.writelines(lines)

elif args.type == "structure":

img, flag_gif, flag_pdf = check_and_read(img_path)

if not flag_gif and not flag_pdf:

img = cv2.imread(img_path)

if args.recovery and args.use_pdf2docx_api and flag_pdf:

from pdf2docx.converter import Converter

docx_file = os.path.join(args.output, "{}.docx".format(img_name))

cv = Converter(img_path)

cv.convert(docx_file)

cv.close()

logger.info("docx save to {}".format(docx_file))

continue

if not flag_pdf:

if img is None:

logger.error("error in loading image:{}".format(img_path))

continue

img_paths = [[img_path, img]]

else:

img_paths = []

for index, pdf_img in enumerate(img):

os.makedirs(os.path.join(args.output, img_name), exist_ok=True)

pdf_img_path = os.path.join(

args.output, img_name, img_name + "_" + str(index) + ".jpg"

)

cv2.imwrite(pdf_img_path, pdf_img)

img_paths.append([pdf_img_path, pdf_img])

all_res = []

for index, (new_img_path, img) in enumerate(img_paths):

logger.info("processing {}/{} page:".format(index + 1, len(img_paths)))

new_img_name = os.path.basename(new_img_path).split(".")[0]

result = engine(img, img_idx=index)

save_structure_res(result, args.output, img_name, index)

if args.recovery and result != []:

from copy import deepcopy

from ppstructure.recovery.recovery_to_doc import sorted_layout_boxes

h, w, _ = img.shape

result_cp = deepcopy(result)

result_sorted = sorted_layout_boxes(result_cp, w)

all_res += result_sorted

if args.recovery and all_res != []:

try:

from ppstructure.recovery.recovery_to_doc import convert_info_docx

convert_info_docx(img, all_res, args.output, img_name)

except Exception as ex:

logger.error(

"error in layout recovery image:{}, err msg: {}".format(

img_name, ex

)

)

continue

for item in all_res:

item.pop("img")

item.pop("res")

logger.info(item)

logger.info("result save to {}".format(args.output))

这里已经有相关的逻辑了，PR里添加的与现有的有什么异同

我的贡献参考了这里的代码。异同在于已有的代码是通过命令行方式运行的，而我的贡献是通过Python脚本运行的。开发者可能更习惯Python脚本的方式

guangyunms · 2024-04-24T15:57:43Z

版面分析的quickstart文档案例确实没有提供pdf格式的处理 demo。感觉可以把main里面相关内容提出来改一改，做一个pdf格式的demo。

确实，我目前参考quickstart文档里已有的案例写了一个demo。

GreatV · 2024-04-25T00:29:15Z

main里面是先把pdf文件解析成单个图片，然后再对单个图片处理。可能并不需要直接传pdf到 PPStructure engine。只需要把demo改成先解析pdf，再处理图片的形式。这样改动最小，也解决了用户的疑惑。

GreatV · 2024-04-25T00:33:11Z

ppstructure/docs/quickstart.md

@@ -189,7 +190,29 @@ im_show.save('result.jpg')
 ```

 <a name="223"></a>
-#### 2.2.3 版面分析
+#### 2.2.3 版面分析+文本识别


这里应该还有版面分析，直接合并到版面分析里吧

guangyunms · 2024-04-25T02:10:01Z

main里面是先把pdf文件解析成单个图片，然后再对单个图片处理。可能并不需要直接传pdf到 PPStructure engine。只需要把demo改成先解析pdf，再处理图片的形式。这样改动最小，也解决了用户的疑惑。

这样子也可以的，我觉得可以把两种方式都写上，一种是直接传pdf，因为现有文档里命令行的运行方式就是直接传入的pdf文件路径，用户看了之后可能觉得这种更符合使用的直觉。另一种是用户自己先对pdf进行处理解析成图片，再处理图片。

写好之后我再合并到之前的版面分析里吧。

您这边觉得如何？
@GreatV

GreatV · 2024-04-25T02:28:58Z

@guangyunms 看了一下，ocr部分是支持pdf infer的，所以这么改也是合理的。可以按照你的想法做。

GreatV · 2024-04-25T02:30:22Z

paddleocr.py

        # for infer pdf file
-        if isinstance(img, list):
+        if isinstance(img, list) and flag_pdf:


这样改，对处理gif会不会有影响

返回 flag_gif 和 flag_pdf是不是很有必要，这里判断它是不是list，应该也是可以达到目标的。

我测试了，不会对gif造成影响，并且从代码中得知判断gif的文件的原因是读取文件的方式与普通图片不同，至于后续处理应该都是一样的，并不会和pdf一样出现因为存在多页而出现错误。

是的，目前看来判断是不是list也可以达到目标，但是根据数据类型判断感觉不太稳妥，而代码中既然有flag_pdf这个判断标准，感觉还是加上这个判断条件比较符合相关函数的定义和代码逻辑，且不会影响到后续的设计。

GreatV · 2024-04-25T02:35:36Z

paddleocr.py

+            return res_list
        res, _ = super().__call__(img, return_ocr_result_in_table, img_idx=img_idx)
        return res


这里返回类型发生了改变，会不会对用户使用造成困扰。建议参考ocr部分处理一下。

关于返回类型发生了改变，会不会对用户使用造成困扰。参考ocr部分可知，它对返回类型的处理为，img如果是list，则不做改变，如果不是list，则把他放入一个list里返回，即都处理成一个list。然而，对于PPStructure类，它的定义和ocr不同，似乎是设计为返回单个页面的结果，main函数验证了我的猜想，目前命令行的方式里调用PPStructure是让它返回单个值的，如果按照OCR的部分处理的话，势必要改变main函数，我觉得还是暂时不动比较好。因为您那边可能对后续如何编写有其它设计，我尽量不改变已有的操作方式。

…tructure.

guangyunms · 2024-04-25T03:40:20Z

我测试了，不会对gif造成影响，并且从代码中得知判断gif的文件的原因是读取文件的方式与普通图片不同，至于后续处理应该都是一样的，并不会和pdf一样出现因为存在多页而出现错误。
是的，目前看来判断是不是list也可以达到目标，但是根据数据类型判断感觉不太稳妥，而代码中既然有flag_pdf这个判断标准，感觉还是加上这个判断条件比较符合相关函数的定义和代码逻辑，且不会影响到后续的设计。
此外，关于返回类型发生了改变，会不会对用户使用造成困扰。参考ocr部分可知，它对返回类型的处理为，img如果是list，则不做改变，如果不是list，则把他放入一个list里返回，即都处理成一个list。然而，对于PPStructure类，它的定义和ocr不同，似乎是设计为返回单个页面的结果，main函数验证了我的猜想，目前命令行的方式里调用PPStructure是让它返回单个值的，如果按照OCR的部分处理的话，势必要改变main函数，我觉得还是暂时不动比较好。因为您那边可能对后续如何编写有其它设计，我尽量不改变已有的操作方式。
我添加了两个样例并合并到版面分析里。

GreatV · 2024-04-25T04:37:31Z

paddleocr.py

@@ -561,6 +561,7 @@ def check_img(img, alpha_color=(255, 255, 255)):
        alpha_color: Background color in images in RGBA format
        return: numpy.array (h, w, 3)


这个返回类型的描述也需要改一下

GreatV · 2024-04-25T04:44:49Z

测试了一下两个demo都能正常工作。

GreatV

LGTM

luotao1 · 2024-10-15T06:25:16Z

@guangyunms Thanks for your contribution! You will receive a beautiful PaddlePaddle gift. Please provide your mailing address by filling out the following questionnaire before October 18th.

Looking forward to the future, we will walk further together in the world of open source!
Click Here ：https://paddle.wjx.cn/vm/h4On9gJ.aspx#

luotao1 · 2024-11-06T12:04:23Z

hi, @guangyunms

非常感谢你对飞桨的贡献，我们正在运营一个PFCC组织，会通过定期分享技术知识与发布开发者主导任务的形式持续为飞桨做贡献，详情可见 https://github.com/luotao1 主页说明。
如果你对PFCC有兴趣，请发送邮件至 ext_paddle_oss@baidu.com，我们会邀请你加入~

Fix the bug where Python scripts fail to execute PDF text recognition…

49646e3

… tasks, optimize the logic of judging PDF files, and add cases to the quickstart document for layout analysis.

paddle-bot bot added the contributor label Apr 24, 2024

paddle-bot bot assigned Sunting78 Apr 24, 2024

GreatV assigned GreatV and unassigned Sunting78 Apr 24, 2024

GreatV reviewed Apr 25, 2024

View reviewed changes

Add two examples of PDF layout analysis to the quickstart file of pps…

46a94d5

…tructure.

guangyunms requested a review from GreatV April 25, 2024 03:41

GreatV reviewed Apr 25, 2024

View reviewed changes

Add a return comment for the check_img function

704d8dc

guangyunms requested a review from GreatV April 25, 2024 05:00

GreatV approved these changes Apr 25, 2024

View reviewed changes

GreatV merged commit f7117ef into PaddlePaddle:main Apr 25, 2024
3 checks passed

github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the bug where Python scripts fail to execute PDF text recognition… #11994

Fix the bug where Python scripts fail to execute PDF text recognition… #11994

guangyunms commented Apr 24, 2024

paddle-bot bot commented Apr 24, 2024

guangyunms commented Apr 24, 2024

GreatV commented Apr 24, 2024

GreatV commented Apr 24, 2024 •

edited

Loading

guangyunms commented Apr 24, 2024

guangyunms commented Apr 24, 2024

GreatV commented Apr 25, 2024 •

edited

Loading

GreatV Apr 25, 2024

guangyunms commented Apr 25, 2024

GreatV commented Apr 25, 2024

GreatV Apr 25, 2024

GreatV Apr 25, 2024

guangyunms Apr 25, 2024

GreatV Apr 25, 2024

guangyunms Apr 25, 2024

guangyunms commented Apr 25, 2024

GreatV Apr 25, 2024

guangyunms Apr 25, 2024

GreatV commented Apr 25, 2024

GreatV left a comment

luotao1 commented Oct 15, 2024

luotao1 commented Nov 6, 2024

		@@ -561,6 +561,7 @@ def check_img(img, alpha_color=(255, 255, 255)):
		alpha_color: Background color in images in RGBA format
		return: numpy.array (h, w, 3)

Fix the bug where Python scripts fail to execute PDF text recognition… #11994

Fix the bug where Python scripts fail to execute PDF text recognition… #11994

Conversation

guangyunms commented Apr 24, 2024

paddle-bot bot commented Apr 24, 2024

guangyunms commented Apr 24, 2024

GreatV commented Apr 24, 2024

GreatV commented Apr 24, 2024 • edited Loading

guangyunms commented Apr 24, 2024

guangyunms commented Apr 24, 2024

GreatV commented Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

guangyunms commented Apr 25, 2024

GreatV commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guangyunms commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GreatV commented Apr 25, 2024

GreatV left a comment

Choose a reason for hiding this comment

luotao1 commented Oct 15, 2024

luotao1 commented Nov 6, 2024

GreatV commented Apr 24, 2024 •

edited

Loading

GreatV commented Apr 25, 2024 •

edited

Loading