Skip to content

Commit 546f310

Browse files
authored
Merge pull request #344 from pymupdf/0.2.5
Version 0.2.5
2 parents 681673f + 5f51003 commit 546f310

File tree

9 files changed

+260
-87
lines changed

9 files changed

+260
-87
lines changed

CHANGES.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,19 @@
11
# Change Log
22

3+
## Changes in version 0.2.5
4+
5+
### Fixes:
6+
7+
* [341](https://github.com/pymupdf/RAG/issues/341) - Broken markdown parsing for new line directly followed by 'o'...
8+
9+
### Other Changes:
10+
11+
* New parameter `table_format` in method `to_text()` (PyMuPDF-Layout only). This allows selecting the appearance of tables in plain text outputs. The possible values are defined in the list `tabulate.tabulate_formats`. Default is "grid".
12+
* Installaing PyMuPDF4LLM now supports including all optional dependencies in the `pip` command: `pip install --update pymupdf4llm[ocr,layout]`. This will install pymupdf4llm, pymupdf, and pymupdf-layout. The "ocr" parameter - when needed - installs opencv-python for automatic OCR support in PyMuPDF-Layout mode. Combine this with parameters `--update`, `--force-reinstall` or `--no-cache-dir` as necessary.
13+
* Major rework of the heuristics that determine whether a page should be OCR'd.
14+
15+
------
16+
317
## Changes in version 0.2.4
418

519
### Fixes:
@@ -10,6 +24,7 @@
1024

1125

1226
------
27+
1328
## Changes in version 0.2.3
1429

1530
### Fixes:

pymupdf4llm/README.md

Lines changed: 76 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Using PyMuPDF as Data Feeder in LLM / RAG Applications
1+
# Using PyMuPDF as a Data Feeder in LLM / RAG Applications
22

33
This package converts the pages of a PDF to text in Markdown format using [PyMuPDF](https://pypi.org/project/PyMuPDF/).
44

@@ -8,42 +8,105 @@ Header lines are identified via the font size and appropriately prefixed with on
88

99
Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.
1010

11-
By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.
11+
By default, all document pages are processed. If desired, a subset of pages can be specified by providing a sequence of 0-based page numbers.
1212

13+
-----
14+
15+
[PyMuPDF-Layout](https://pypi.org/project/pymupdf-layout/) is an optional extension of PyMuPDF. It offers AI-based improved page layout analysis, for instance entailing a much higher table recognition.
16+
17+
Since version 0.2.0, pymupdf4llm fully supports pymupdf-layout. As part of this, output as plain text or a JSON string is also possible. In addition, every page is automatically OCR'd (based on a number of criteria) provided package [opencv-python](https://pypi.org/project/opencv-python/) is installed and Tesseract is available on the platform.
18+
19+
Layout mode is activated with a simple modification of the import statements - for details, please see below.
1320

1421
# Installation
1522

1623
```bash
1724
$ pip install -U pymupdf4llm
1825
```
1926

20-
> This command will automatically install [PyMuPDF](https://github.com/pymupdf/PyMuPDF) if required.
27+
> This command will automatically install or upgrade [PyMuPDF](https://github.com/pymupdf/PyMuPDF) as required.
28+
29+
To install all Python packages for full support of the layout feature and automatic OCR, you can use the following command version:
30+
31+
```bash
32+
$ pip install -U pymupdf4llm[ocr,layout]
33+
```
34+
35+
This will install opencv-python and pymupdf-layout in addition to pymupdf4llm and pymupdf.
36+
37+
# Execution
38+
## Legacy Mode
39+
For **_standard (legacy) markdown extraction_**, use the following simple script
40+
41+
```python
42+
import pymupdf4llm
43+
44+
md_text = pymupdf4llm.to_markdown("input.pdf")
45+
46+
# now work with the markdown text, e.g. store as a UTF8-encoded file
47+
import pathlib
48+
pathlib.Path("output.md").write_bytes(md_text.encode())
49+
```
50+
51+
Instead of the filename string as above, one can also provide a PyMuPDF `Document`.
2152

22-
Then in your script do:
53+
By default, all pages in the PDF will be processed. If desired, the parameter `pages=<sequence>` can be used to provide a sequence of zero-based page numbers to consider.
54+
55+
## Layout Mode
56+
To **_activate layout mode_**, use the following
2357

2458
```python
59+
import pymupdf.layout # activate PyMuPDF-Layout in pymupdf
2560
import pymupdf4llm
2661

62+
# The remainder of the script is unchanged
2763
md_text = pymupdf4llm.to_markdown("input.pdf")
2864

2965
# now work with the markdown text, e.g. store as a UTF8-encoded file
3066
import pathlib
3167
pathlib.Path("output.md").write_bytes(md_text.encode())
3268
```
3369

34-
Instead of the filename string as above, one can also provide a PyMuPDF `Document`. By default, all pages in the PDF will be processed. If desired, the parameter `pages=[...]` can be used to provide a list of zero-based page numbers to consider.
70+
Here are the JSON and plain text output versions.
71+
72+
### JSON
73+
74+
```python
75+
import pymupdf.layout # activate PyMuPDF-Layout in pymupdf
76+
import pymupdf4llm
77+
78+
json_text = pymupdf4llm.to_json("input.pdf")
79+
80+
# now work with the markdown text, e.g. store as a UTF8-encoded file
81+
import pathlib
82+
pathlib.Path("output.json").write_text(json_text)
83+
```
84+
85+
### Plain Text
86+
87+
```python
88+
import pymupdf.layout # activate PyMuPDF-Layout in pymupdf
89+
import pymupdf4llm
90+
91+
plain_text = pymupdf4llm.to_text("input.pdf")
92+
93+
# now work with the markdown text, e.g. store as a UTF8-encoded file
94+
import pathlib
95+
pathlib.Path("output.txt").write_bytes(plain_text.encode())
96+
```
97+
3598

3699
**Feature Overview:**
37100

38101
* Support for pages with **_multiple text columns_**.
39102
* Support for **_image and vector graphics extraction_**:
40103

41-
1. Specify `pymupdf4llm.to_markdown("input.pdf", write_images=True)`. Default is `False`.
42-
2. Each image or vector graphic on the page will be extracted and stored as an image named `"input.pdf-pno-index.extension"` in a folder of your choice. The image `extension` can be chosen to represent a PyMuPDF-supported image format (for instance "png" or "jpg"), `pno` is the 0-based page number and `index` is some sequence number.
43-
3. The image files will have width and height equal to the values on the page. The desired resolution can be chosen via parameter `dpi` (default: `dpi=150`).
44-
4. Any text contained in the images or graphics will be extracted and **also become visible as part of the generated image**. This behavior can be changed via `force_text=False` (text only apears as part of the image).
104+
1. Specify either `write_images=True` or `embed_images=True`. Default is `False`.
105+
2. Images and vector graphics on the page will be stored as images named `"input.pdf-pno-index.extension"` in a folder of your choice or be embedded in the markdown text as base64-encoded strings. The image `extension` can be chosen to represent a PyMuPDF-supported image format (for instance "png" or "jpg"), `pno` is the 0-based page number and `index` is some sequence number.
106+
3. The image files will have width and height equal to the values on the page. The desired resolution can be chosen via parameter `dpi` (default: `dpi=150`). So this is not an actual **_extraction_** but rather rendering of the respective page area.
107+
4. Any standard text written in image areas will become a visible part of the generated image and otherwise be ignored. This behavior can be changed via `force_text=True` which causes the text to also become part of the output.
45108

46-
* Support for **page chunks**: Instead of returning one large string for the whole document, a list of dictionaries can be generated: one for each page. Specify `data = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)`. Then, for instance the first item, `data[0]` will contain a dictionary for the first page with the text and some metadata.
109+
* Support for **page chunks**: Instead of returning one large string for the whole document, a list of dictionaries can be generated: one for each page. Specify `data = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)`. Then, for instance the first item, `data[0]` will contain a dictionary for the first page with its text and some metadata.
47110

48111
* As a first example for directly supporting LLM / RAG consumers, this version can output **LlamaIndex documents**:
49112

@@ -57,6 +120,7 @@ Instead of the filename string as above, one can also provide a PyMuPDF `Documen
57120
# Every list item contains metadata and the markdown text of 1 page.
58121
```
59122

60-
* A LlamaIndex document essentially corresponds to Python dictionary, where the markdown text of the page is one of the dictionary values. For instance the text of the first page is the the value of `data[0].to_dict().["text"]`.
123+
* A LlamaIndex document essentially corresponds to Python dictionary, where the markdown text of the page is one of the dictionary values. For instance the text of the first page is the value of `data[0].to_dict().["text"]`.
61124
* For details, please consult LlamaIndex documentation.
62-
* Upon creation of the `LlamaMarkdownReader` all necessary LlamaIndex-related imports are executed. Required related package installations must have been done independently and will not be checked during pymupdf4llm installation.
125+
* Upon creation of the `LlamaMarkdownReader` all necessary LlamaIndex-related imports are executed. Required related package installations must have been done independently and will not be checked during pymupdf4llm installation.
126+

pymupdf4llm/pymupdf4llm/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ def to_text(
146146
force_text=True,
147147
ocr_dpi=400,
148148
use_ocr=True,
149+
table_format="grid",
149150
# unsupported options for pymupdf layout:
150151
**kwargs,
151152
):
@@ -164,6 +165,7 @@ def to_text(
164165
footer=footer,
165166
ignore_code=ignore_code,
166167
show_progress=show_progress,
168+
table_format=table_format,
167169
)
168170

169171

pymupdf4llm/pymupdf4llm/helpers/check_ocr.py

Lines changed: 81 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -107,8 +107,48 @@
107107
--------------------------------------------------------------------------
108108
"""
109109

110+
"""
111+
Functions detecting general photos versus text-heavy images.
112+
"""
113+
114+
115+
def entropy_check(img_gray, threshold=4.5):
116+
"""Compute Shannon entropy of grayscale image."""
117+
hist = cv2.calcHist([img_gray], [0], None, [256], [0, 256])
118+
hist = hist.ravel() / hist.sum()
119+
hist = hist[hist > 0]
120+
entropy = -np.sum(hist * np.log2(hist))
121+
return entropy < threshold, entropy
122+
123+
124+
def fft_check(img_gray, threshold=0.15):
125+
"""Check ratio of high-frequency energy in FFT spectrum."""
126+
# Downsample for speed
127+
small = cv2.resize(img_gray, (128, 128))
128+
f = np.fft.fft2(small)
129+
fshift = np.fft.fftshift(f)
130+
magnitude = np.abs(fshift)
131+
h, w = magnitude.shape
132+
center = magnitude[h // 4 : 3 * h // 4, w // 4 : 3 * w // 4]
133+
ratio = center.sum() / magnitude.sum()
134+
return ratio < threshold, ratio
110135

111-
def get_span_ocr(page, bbox, dpi=300):
136+
137+
def components_check(img_gray, min_components=50):
138+
"""Count connected components after thresholding."""
139+
_, bw = cv2.threshold(img_gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
140+
num_labels, _ = cv2.connectedComponents(bw)
141+
return num_labels < min_components, num_labels
142+
143+
144+
def edge_density_check(img_gray, threshold=0.01):
145+
"""Compute edge density using Canny."""
146+
edges = cv2.Canny(img_gray, 100, 200)
147+
density = edges.sum() / 255.0 / edges.size
148+
return density < threshold, density
149+
150+
151+
def get_span_ocr(page, bbox, dpi=400):
112152
"""Return OCR'd span text using Tesseract.
113153
114154
Args:
@@ -127,7 +167,7 @@ def get_span_ocr(page, bbox, dpi=300):
127167
return text
128168

129169

130-
def repair_blocks(input_blocks, page):
170+
def repair_blocks(input_blocks, page, dpi=400):
131171
"""Repair text blocks with missing glyphs using OCR.
132172
133173
TODO: Support non-linear block structure.
@@ -148,7 +188,7 @@ def repair_blocks(input_blocks, page):
148188
if not REPLACEMENT_CHARACTER in span_text:
149189
continue
150190
span_text_len = len(span_text)
151-
new_text = get_span_ocr(page, span["bbox"])[:span_text_len]
191+
new_text = get_span_ocr(page, span["bbox"], dpi=dpi)[:span_text_len]
152192
if "chars" in span:
153193
# rebuild chars array
154194
new_chars = []
@@ -177,25 +217,48 @@ def get_page_image(page, dpi=150, covered=None):
177217
if covered is None:
178218
covered = page.rect
179219
covered = covered.irect
180-
pix = page.get_pixmap(dpi=dpi)
181-
matrix = pymupdf.Rect(pix.irect).torect(page.rect)
182-
183-
# make a sub-pixmap of the covered area
184-
pix_covered = pymupdf.Pixmap(pymupdf.csRGB, covered)
185-
pix_covered.copy(pix, covered) # copy over covered area
220+
# make a gray pixmap of the covered area
221+
pix_covered = page.get_pixmap(colorspace=pymupdf.csGRAY, clip=covered)
186222
# convert to numpy array
187-
img = np.frombuffer(pix_covered.samples, dtype=np.uint8).reshape(
223+
gray = np.frombuffer(pix_covered.samples, dtype=np.uint8).reshape(
188224
pix_covered.height, pix_covered.width, pix_covered.n
189225
)
190-
# cv2 needs the gray image version of this
191-
gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
192-
return gray, matrix, pix
226+
photo_entropy, entropy_val = entropy_check(gray)
227+
photo_fft, fft_val = fft_check(gray)
228+
photo_components, comp_val = components_check(gray)
229+
photo_edges, edge_val = edge_density_check(gray)
230+
231+
# print(f"Entropy: {entropy_val:.3f} → {photo_entropy}")
232+
# print(f"FFT ratio: {fft_val:.3f} → {photo_fft}")
233+
# print(f"Components: {comp_val} → {photo_components}")
234+
# print(f"Edge density: {edge_val:.6f} → {photo_edges}")
235+
236+
# Weighted decision logic
237+
score = 0
238+
if photo_components:
239+
score += 2
240+
if photo_edges:
241+
score += 2
242+
if photo_entropy:
243+
score += 1
244+
if photo_fft:
245+
score += 1
246+
# print(f"{score=}")
247+
if score >= 3:
248+
pix = None
249+
matrix = pymupdf.Identity
250+
photo = True
251+
else:
252+
pix = page.get_pixmap(dpi=dpi)
253+
matrix = pymupdf.Rect(pix.irect).torect(page.rect)
254+
photo = False
255+
256+
return matrix, pix, photo
193257

194258

195259
def should_ocr_page(
196260
page,
197261
dpi=150,
198-
edge_thresh=0.02,
199262
vector_thresh=0.9,
200263
image_coverage_thresh=0.9,
201264
text_readability_thresh=0.9,
@@ -207,7 +270,6 @@ def should_ocr_page(
207270
Parameters:
208271
page: PyMuPDF page object
209272
dpi: DPI used for rasterization
210-
edge_thresh: minimum edge density to suggest text presence
211273
vector_thresh: minimum number of vector paths to suggest glyph simulation
212274
image_coverage_thresh: fraction of page area covered by images to trigger OCR
213275
text_readability_thresh: fraction of readable characters to skip OCR
@@ -225,7 +287,6 @@ def should_ocr_page(
225287
"has_vector_chars": False,
226288
"transform": pymupdf.Identity,
227289
"pixmap": None,
228-
"edge_density": 0.0,
229290
}
230291
page_rect = page.rect
231292
page_area = abs(page_rect) # size of the full page
@@ -279,21 +340,16 @@ def should_ocr_page(
279340
assert decision["should_ocr"] is True
280341

281342
if not decision["has_text"]:
282-
# Rasterize and analyze edge density
283-
img, matrix, pix = get_page_image(page, dpi=dpi, covered=analysis["covered"])
343+
# Rasterize and check for photo versus text-heaviness
344+
matrix, pix, photo = get_page_image(page, dpi=dpi, covered=analysis["covered"])
284345

285-
# Analyze edge density
286-
edges = cv2.Canny(img, 100, 200)
287-
decision["edge_density"] = float(np.sum(edges > 0) / edges.size)
288-
if decision["edge_density"] <= edge_thresh:
346+
if photo:
289347
# this seems to be a non-text picture page
290348
decision["should_ocr"] = False
349+
decision["pixmap"] = None
291350
else:
292351
decision["should_ocr"] = True
293352
decision["transform"] = matrix
294353
decision["pixmap"] = pix
295354

296-
if decision["should_ocr"]:
297-
decision["transform"] = matrix
298-
decision["pixmap"] = pix
299355
return decision

0 commit comments

Comments
 (0)