pymupdf
diff --git a/‎CHANGES.md‎
Lines changed: 15 additions & 0 deletions b/‎CHANGES.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎pymupdf4llm/README.md‎
Lines changed: 76 additions & 12 deletions b/‎pymupdf4llm/README.md‎
Lines changed: 76 additions & 12 deletions
diff --git a/‎pymupdf4llm/pymupdf4llm/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎pymupdf4llm/pymupdf4llm/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎pymupdf4llm/pymupdf4llm/helpers/check_ocr.py‎
Lines changed: 81 additions & 25 deletions b/‎pymupdf4llm/pymupdf4llm/helpers/check_ocr.py‎
Lines changed: 81 additions & 25 deletions
@@ -1,5 +1,19 @@
 # Change Log
 
+## Changes in version 0.2.5
+
+### Fixes:
+
+* [341](https://github.com/pymupdf/RAG/issues/341) - Broken markdown parsing for new line directly followed by 'o'...
+
+### Other Changes:
+
+* New parameter `table_format` in method `to_text()` (PyMuPDF-Layout only). This allows selecting the appearance of tables in plain text outputs. The possible values are defined in the list `tabulate.tabulate_formats`. Default is "grid".
+* Installaing PyMuPDF4LLM now supports including all optional dependencies in the `pip` command: `pip install --update pymupdf4llm[ocr,layout]`. This will install pymupdf4llm, pymupdf, and pymupdf-layout. The "ocr" parameter - when needed - installs opencv-python for automatic OCR support in PyMuPDF-Layout mode. Combine this with parameters `--update`, `--force-reinstall` or `--no-cache-dir` as necessary.
+* Major rework of the heuristics that determine whether a page should be OCR'd.
+
+------
+
 ## Changes in version 0.2.4
 
 ### Fixes:
@@ -10,6 +24,7 @@
 
 
 ------
+
 ## Changes in version 0.2.3
 
 ### Fixes:
 
@@ -1,4 +1,4 @@
-# Using PyMuPDF as Data Feeder in LLM / RAG Applications
+# Using PyMuPDF as a Data Feeder in LLM / RAG Applications
 
 This package converts the pages of a PDF to text in Markdown format using [PyMuPDF](https://pypi.org/project/PyMuPDF/).
 
@@ -8,42 +8,105 @@ Header lines are identified via the font size and appropriately prefixed with on
 
 Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.
 
-By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.
+By default, all document pages are processed. If desired, a subset of pages can be specified by providing a sequence of 0-based page numbers.
 
+-----
+
+[PyMuPDF-Layout](https://pypi.org/project/pymupdf-layout/) is an optional extension of PyMuPDF. It offers AI-based improved page layout analysis, for instance entailing a much higher table recognition.
+
+Since version 0.2.0, pymupdf4llm fully supports pymupdf-layout. As part of this, output as plain text or a JSON string is also possible. In addition, every page is automatically OCR'd (based on a number of criteria) provided package [opencv-python](https://pypi.org/project/opencv-python/) is installed and Tesseract is available on the platform.
+
+Layout mode is activated with a simple modification of the import statements - for details, please see below.
 
 # Installation
 
 ```bash
 $ pip install -U pymupdf4llm
 ```
 
-> This command will automatically install [PyMuPDF](https://github.com/pymupdf/PyMuPDF) if required.
+> This command will automatically install or upgrade [PyMuPDF](https://github.com/pymupdf/PyMuPDF) as required.
+
+To install all Python packages for full support of the layout feature and automatic OCR, you can use the following command version:
+
+```bash
+$ pip install -U pymupdf4llm[ocr,layout]
+```
+
+This will install opencv-python and pymupdf-layout in addition to pymupdf4llm and pymupdf.
+
+# Execution
+## Legacy Mode
+For **_standard (legacy) markdown extraction_**, use the following simple script
+
+```python
+import pymupdf4llm
+
+md_text = pymupdf4llm.to_markdown("input.pdf")
+
+# now work with the markdown text, e.g. store as a UTF8-encoded file
+import pathlib
+pathlib.Path("output.md").write_bytes(md_text.encode())
+```
+
+Instead of the filename string as above, one can also provide a PyMuPDF `Document`.
 
-Then in your script do:
+By default, all pages in the PDF will be processed. If desired, the parameter `pages=<sequence>` can be used to provide a sequence of zero-based page numbers to consider.
+
+## Layout Mode
+To **_activate layout mode_**, use the following
 
 ```python
+import pymupdf.layout  # activate PyMuPDF-Layout in pymupdf
 import pymupdf4llm
 
+# The remainder of the script is unchanged
 md_text = pymupdf4llm.to_markdown("input.pdf")
 
 # now work with the markdown text, e.g. store as a UTF8-encoded file
 import pathlib
 pathlib.Path("output.md").write_bytes(md_text.encode())
 ```
 
-Instead of the filename string as above, one can also provide a PyMuPDF `Document`. By default, all pages in the PDF will be processed. If desired, the parameter `pages=[...]` can be used to provide a list of zero-based page numbers to consider.
+Here are the JSON and plain text output versions.
+
+### JSON
+
+```python
+import pymupdf.layout  # activate PyMuPDF-Layout in pymupdf
+import pymupdf4llm
+
+json_text = pymupdf4llm.to_json("input.pdf")
+
+# now work with the markdown text, e.g. store as a UTF8-encoded file
+import pathlib
+pathlib.Path("output.json").write_text(json_text)
+```
+
+### Plain Text
+
+```python
+import pymupdf.layout  # activate PyMuPDF-Layout in pymupdf
+import pymupdf4llm
+
+plain_text = pymupdf4llm.to_text("input.pdf")
+
+# now work with the markdown text, e.g. store as a UTF8-encoded file
+import pathlib
+pathlib.Path("output.txt").write_bytes(plain_text.encode())
+```
+
 
 **Feature Overview:**
 
 * Support for pages with **_multiple text columns_**.
 * Support for **_image and vector graphics extraction_**:
 
-    1. Specify `pymupdf4llm.to_markdown("input.pdf", write_images=True)`. Default is `False`.
-    2. Each image or vector graphic on the page will be extracted and stored as an image named `"input.pdf-pno-index.extension"` in a folder of your choice. The image `extension` can be chosen to represent a PyMuPDF-supported image format (for instance "png" or "jpg"),  `pno` is the 0-based page number and `index` is some sequence number.
-    3. The image files will have width and height equal to the values on the page. The desired resolution can be chosen via parameter `dpi` (default: `dpi=150`).
-    4. Any text contained in the images or graphics will be extracted and **also become visible as part of the generated image**. This behavior can be changed via `force_text=False` (text only apears as part of the image).
+    1. Specify either `write_images=True` or `embed_images=True`. Default is `False`.
+    2. Images and vector graphics on the page will be stored as images named `"input.pdf-pno-index.extension"` in a folder of your choice or be embedded in the markdown text as base64-encoded strings. The image `extension` can be chosen to represent a PyMuPDF-supported image format (for instance "png" or "jpg"),  `pno` is the 0-based page number and `index` is some sequence number.
+    3. The image files will have width and height equal to the values on the page. The desired resolution can be chosen via parameter `dpi` (default: `dpi=150`). So this is not an actual **_extraction_** but rather rendering of the respective page area.
+    4. Any standard text written in image areas will become a visible part of the generated image and otherwise be ignored. This behavior can be changed via `force_text=True` which causes the text to also become part of the output.
 
-* Support for **page chunks**: Instead of returning one large string for the whole document, a list of dictionaries can be generated: one for each page. Specify `data = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)`. Then, for instance the first item, `data[0]` will contain a dictionary for the first page with the text and some metadata.
+* Support for **page chunks**: Instead of returning one large string for the whole document, a list of dictionaries can be generated: one for each page. Specify `data = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)`. Then, for instance the first item, `data[0]` will contain a dictionary for the first page with its text and some metadata.
 
 * As a first example for directly supporting LLM / RAG consumers, this version can output **LlamaIndex documents**:
 
@@ -57,6 +120,7 @@ Instead of the filename string as above, one can also provide a PyMuPDF `Documen
     # Every list item contains metadata and the markdown text of 1 page.
     ```
 
-    * A LlamaIndex document essentially corresponds to Python dictionary, where the markdown text of the page is one of the dictionary values. For instance the text of the first page is the the value of `data[0].to_dict().["text"]`.
+    * A LlamaIndex document essentially corresponds to Python dictionary, where the markdown text of the page is one of the dictionary values. For instance the text of the first page is the value of `data[0].to_dict().["text"]`.
     * For details, please consult LlamaIndex documentation.
-    * Upon creation of the `LlamaMarkdownReader` all necessary LlamaIndex-related imports are executed. Required related package installations must have been done independently and will not be checked during pymupdf4llm installation.
+    * Upon creation of the `LlamaMarkdownReader` all necessary LlamaIndex-related imports are executed. Required related package installations must have been done independently and will not be checked during pymupdf4llm installation.
+    
@@ -146,6 +146,7 @@ def to_text(
         force_text=True,
         ocr_dpi=400,
         use_ocr=True,
+        table_format="grid",
         # unsupported options for pymupdf layout:
         **kwargs,
     ):
@@ -164,6 +165,7 @@ def to_text(
             footer=footer,
             ignore_code=ignore_code,
             show_progress=show_progress,
+            table_format=table_format,
         )
 
 
 
@@ -107,8 +107,48 @@
 --------------------------------------------------------------------------
 """
 
+"""
+Functions detecting general photos versus text-heavy images.
+"""
+
+
+def entropy_check(img_gray, threshold=4.5):
+    """Compute Shannon entropy of grayscale image."""
+    hist = cv2.calcHist([img_gray], [0], None, [256], [0, 256])
+    hist = hist.ravel() / hist.sum()
+    hist = hist[hist > 0]
+    entropy = -np.sum(hist * np.log2(hist))
+    return entropy < threshold, entropy
+
+
+def fft_check(img_gray, threshold=0.15):
+    """Check ratio of high-frequency energy in FFT spectrum."""
+    # Downsample for speed
+    small = cv2.resize(img_gray, (128, 128))
+    f = np.fft.fft2(small)
+    fshift = np.fft.fftshift(f)
+    magnitude = np.abs(fshift)
+    h, w = magnitude.shape
+    center = magnitude[h // 4 : 3 * h // 4, w // 4 : 3 * w // 4]
+    ratio = center.sum() / magnitude.sum()
+    return ratio < threshold, ratio
 
-def get_span_ocr(page, bbox, dpi=300):
+
+def components_check(img_gray, min_components=50):
+    """Count connected components after thresholding."""
+    _, bw = cv2.threshold(img_gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    num_labels, _ = cv2.connectedComponents(bw)
+    return num_labels < min_components, num_labels
+
+
+def edge_density_check(img_gray, threshold=0.01):
+    """Compute edge density using Canny."""
+    edges = cv2.Canny(img_gray, 100, 200)
+    density = edges.sum() / 255.0 / edges.size
+    return density < threshold, density
+
+
+def get_span_ocr(page, bbox, dpi=400):
     """Return OCR'd span text using Tesseract.
 
     Args:
@@ -127,7 +167,7 @@ def get_span_ocr(page, bbox, dpi=300):
     return text
 
 
-def repair_blocks(input_blocks, page):
+def repair_blocks(input_blocks, page, dpi=400):
     """Repair text blocks with missing glyphs using OCR.
 
     TODO: Support non-linear block structure.
@@ -148,7 +188,7 @@ def repair_blocks(input_blocks, page):
                 if not REPLACEMENT_CHARACTER in span_text:
                     continue
                 span_text_len = len(span_text)
-                new_text = get_span_ocr(page, span["bbox"])[:span_text_len]
+                new_text = get_span_ocr(page, span["bbox"], dpi=dpi)[:span_text_len]
                 if "chars" in span:
                     # rebuild chars array
                     new_chars = []
@@ -177,25 +217,48 @@ def get_page_image(page, dpi=150, covered=None):
     if covered is None:
         covered = page.rect
     covered = covered.irect
-    pix = page.get_pixmap(dpi=dpi)
-    matrix = pymupdf.Rect(pix.irect).torect(page.rect)
-
-    # make a sub-pixmap of the covered area
-    pix_covered = pymupdf.Pixmap(pymupdf.csRGB, covered)
-    pix_covered.copy(pix, covered)  # copy over covered area
+    # make a gray pixmap of the covered area
+    pix_covered = page.get_pixmap(colorspace=pymupdf.csGRAY, clip=covered)
     # convert to numpy array
-    img = np.frombuffer(pix_covered.samples, dtype=np.uint8).reshape(
+    gray = np.frombuffer(pix_covered.samples, dtype=np.uint8).reshape(
         pix_covered.height, pix_covered.width, pix_covered.n
     )
-    # cv2 needs the gray image version of this
-    gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
-    return gray, matrix, pix
+    photo_entropy, entropy_val = entropy_check(gray)
+    photo_fft, fft_val = fft_check(gray)
+    photo_components, comp_val = components_check(gray)
+    photo_edges, edge_val = edge_density_check(gray)
+
+    # print(f"Entropy: {entropy_val:.3f} → {photo_entropy}")
+    # print(f"FFT ratio: {fft_val:.3f} → {photo_fft}")
+    # print(f"Components: {comp_val} → {photo_components}")
+    # print(f"Edge density: {edge_val:.6f} → {photo_edges}")
+
+    # Weighted decision logic
+    score = 0
+    if photo_components:
+        score += 2
+    if photo_edges:
+        score += 2
+    if photo_entropy:
+        score += 1
+    if photo_fft:
+        score += 1
+    # print(f"{score=}")
+    if score >= 3:
+        pix = None
+        matrix = pymupdf.Identity
+        photo = True
+    else:
+        pix = page.get_pixmap(dpi=dpi)
+        matrix = pymupdf.Rect(pix.irect).torect(page.rect)
+        photo = False
+
+    return matrix, pix, photo
 
 
 def should_ocr_page(
     page,
     dpi=150,
-    edge_thresh=0.02,
     vector_thresh=0.9,
     image_coverage_thresh=0.9,
     text_readability_thresh=0.9,
@@ -207,7 +270,6 @@ def should_ocr_page(
     Parameters:
         page: PyMuPDF page object
         dpi: DPI used for rasterization
-        edge_thresh: minimum edge density to suggest text presence
         vector_thresh: minimum number of vector paths to suggest glyph simulation
         image_coverage_thresh: fraction of page area covered by images to trigger OCR
         text_readability_thresh: fraction of readable characters to skip OCR
@@ -225,7 +287,6 @@ def should_ocr_page(
         "has_vector_chars": False,
         "transform": pymupdf.Identity,
         "pixmap": None,
-        "edge_density": 0.0,
     }
     page_rect = page.rect
     page_area = abs(page_rect)  # size of the full page
@@ -279,21 +340,16 @@ def should_ocr_page(
     assert decision["should_ocr"] is True
 
     if not decision["has_text"]:
-        # Rasterize and analyze edge density
-        img, matrix, pix = get_page_image(page, dpi=dpi, covered=analysis["covered"])
+        # Rasterize and check for photo versus text-heaviness
+        matrix, pix, photo = get_page_image(page, dpi=dpi, covered=analysis["covered"])
 
-        # Analyze edge density
-        edges = cv2.Canny(img, 100, 200)
-        decision["edge_density"] = float(np.sum(edges > 0) / edges.size)
-        if decision["edge_density"] <= edge_thresh:
+        if photo:
             # this seems to be a non-text picture page
             decision["should_ocr"] = False
+            decision["pixmap"] = None
         else:
             decision["should_ocr"] = True
             decision["transform"] = matrix
             decision["pixmap"] = pix
 
-    if decision["should_ocr"]:
-        decision["transform"] = matrix
-        decision["pixmap"] = pix
     return decision