Skip to content

Commit

Permalink
chore: change table extraction defaults (Unstructured-IO#2588)
Browse files Browse the repository at this point in the history
Change default values for table extraction - works in pair with
[this](Unstructured-IO/unstructured-api#370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
  • Loading branch information
5 people committed Mar 22, 2024
1 parent 4ff6a5b commit bdfd975
Show file tree
Hide file tree
Showing 16 changed files with 55 additions and 41 deletions.
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## 0.12.7-dev9
## 0.13.0-dev10

### Enhancements
### Enhancements

* **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
* **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
Expand All @@ -13,6 +13,7 @@
### Fixes

* **Clarify IAM Role Requirement for GCS Platform Connectors**. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
* **Change table extraction defaults** Change table extraction defaults in favor of using `skip_infer_table_types` parameter and reflect these changes in documentation.
* **Fix OneDrive dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint
* **Adds tracking for AstraDB** Adds tracking info so AstraDB can see what source called their api.
* **Support AWS Bedrock Embeddings in ingest CLI** The configs required to instantiate the bedrock embedding class are now exposed in the api and the version of boto being used meets the minimum requirement to introduce the bedrock runtime required to hit the service.
Expand Down Expand Up @@ -66,6 +67,7 @@
* **Rename `OpenAiEmbeddingConfig` to `OpenAIEmbeddingConfig`.**
* **Fix partition_json() doesn't chunk.** The `@add_chunking_strategy` decorator was missing from `partition_json()` such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.


## 0.12.4

### Enhancements
Expand Down
2 changes: 1 addition & 1 deletion docs/source/apis/api_parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ languages
pdf_infer_table_structure
-------------------------
- **Type**: boolean
- **Description**: If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, 'text_as_html'.
- **Description**: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.

skip_infer_table_types
----------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/apis/usage_methods.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Method 1: Partition via API (``partition_via_api``)

filename = "example-docs/DA-1p.pdf"
elements = partition_via_api(
filename=filename, api_key="MY_API_KEY", strategy="auto", pdf_infer_table_structure="true"
filename=filename, api_key="MY_API_KEY", strategy="auto"
)

- **Self-Hosting or Local API**::
Expand Down
8 changes: 1 addition & 7 deletions docs/source/best_practices/table_extraction_pdf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ To extract the tables from PDF files using the `partition_pdf <https://unstructu
Method 2: Using Auto Partition or Unstructured API
--------------------------------------------------

By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx`` file types is disabled. To enable table extraction from PDFs and other file types using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ , you can set the ``skip_infer_table_types`` parameter to ``'[]'`` and ``strategy`` parameter to ``hi_res``.
By default, table extraction from all file types is enabled. To extract tables from PDFs and images using `Auto Partition <https://unstructured-io.github.io/unstructured/core/partition.html#partition>`__ or `Unstructured API parameters <https://unstructured-io.github.io/unstructured/apis/api_parameters.html>`__ simply set ``strategy`` parameter to ``hi_res``.


**Usage: Auto Partition**
Expand All @@ -46,7 +46,6 @@ By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx
elements = partition(filename=filename,
strategy='hi_res',
skip_infer_table_types='[]', # don't forget to include apostrophe around the square bracket
)
tables = [el for el in elements if el.category == "Table"]
Expand All @@ -65,9 +64,4 @@ By default, table extraction from ``pdf``, ``jpg``, ``png``, ``xls``, and ``xlsx
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'strategy=hi_res' \
-F 'skip_infer_table_types=[]' \
| jq -C . | less -R
.. warning::

You may get a warning when the ``pdf_infer_table_structure`` parameter is set to **True** AND **pdf** is included in the list of ``skip_infer_table_types`` parameter. However, this function will still extract the tables from PDF despite the conflict.
2 changes: 1 addition & 1 deletion docs/source/core/partition.rst
Original file line number Diff line number Diff line change
Expand Up @@ -872,7 +872,7 @@ settings supported by the API.
filename = "example-docs/DA-1p.pdf"
elements = partition_via_api(
filename=filename, api_key=api_key, strategy="auto", pdf_infer_table_structure="true"
filename=filename, api_key=api_key, strategy="auto"
)
If you are using the `Unstructured SaaS API <https://unstructured-io.github.io/unstructured/apis/saas_api.html>`__, you can use the ``api_url`` kwarg to point the ``partition_via_api`` function at your Unstructured SaaS API URL.
Expand Down
1 change: 0 additions & 1 deletion docs/source/examples/databricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ Extracting PDF Using Unstructured Python SDK
),
# Other partition params
strategy="hi_res",
pdf_infer_table_structure=True,
chunking_strategy="by_title",
)
Expand Down
1 change: 0 additions & 1 deletion docs/source/examples/dict_to_elements.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,6 @@ Configure and run the S3Runner for processing the data.
api_key=UNSTRUCTURED_API_KEY,
strategy="hi_res",
hi_res_model_name="yolox",
pdf_infer_table_structure=True,
),
fsspec_config=FsspecConfig(
remote_url=S3_URL,
Expand Down
2 changes: 1 addition & 1 deletion docs/source/ingest/configs/partition_config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ responsible for coordinating data after processing, including the dynamic metada
Configs for Partitioning
-------------------------

* ``pdf_infer_table_structure``: If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, "text_as_html," where the value (string) is a just a transformation of the data into an HTML <table>. The "text" field for a partitioned Table Element is always present, whether True or False.
* ``pdf_infer_table_structure``: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.
* ``skip_infer_table_types``: List of document types that you want to skip table extraction with.
* ``strategy (default auto)``: The strategy to use for partitioning PDF/image. Uses a layout detection model if set to 'hi_res', otherwise partition simply extracts the text from the document and processes it.
* ``ocr_languages``: The languages present in the document, for use in partitioning and/or OCR. For partitioning image or pdf documents with Tesseract, you'll first need to install the appropriate Tesseract language pack if running via local unstructured library. For other partitions, language is detected using naive Bayesian filter via `langdetect`. Multiple languages indicates text could be in either language.
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured/partition/test_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ def test_auto_partition_pdf_with_fast_strategy(monkeypatch):
languages=None,
metadata_filename=None,
include_page_breaks=False,
infer_table_structure=False,
infer_table_structure=True,
extract_images_in_pdf=False,
extract_image_block_types=None,
extract_image_block_output_dir=None,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
"eng"
],
"page_name": "Stanley Cups",
"page_number": 1
"page_number": 1,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
},
"text": "Stanley Cups",
"type": "Title"
Expand All @@ -42,7 +43,8 @@
"eng"
],
"page_name": "Stanley Cups",
"page_number": 1
"page_number": 1,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
},
"text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n13\n\n\n",
"type": "Table"
Expand All @@ -66,7 +68,8 @@
"eng"
],
"page_name": "Stanley Cups Since 67",
"page_number": 2
"page_number": 2,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
},
"text": "Stanley Cups Since 67",
"type": "Title"
Expand All @@ -90,7 +93,8 @@
"eng"
],
"page_name": "Stanley Cups Since 67",
"page_number": 2
"page_number": 2,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
},
"text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n0\n\n\n",
"type": "Table"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
"eng"
],
"page_name": "Stanley Cups",
"page_number": 1
"page_number": 1,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
},
"text": "Stanley Cups",
"type": "Title"
Expand All @@ -42,7 +43,8 @@
"eng"
],
"page_name": "Stanley Cups",
"page_number": 1
"page_number": 1,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
},
"text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n13\n\n\n",
"type": "Table"
Expand All @@ -66,7 +68,8 @@
"eng"
],
"page_name": "Stanley Cups Since 67",
"page_number": 2
"page_number": 2,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
},
"text": "Stanley Cups Since 67",
"type": "Title"
Expand All @@ -90,7 +93,8 @@
"eng"
],
"page_name": "Stanley Cups Since 67",
"page_number": 2
"page_number": 2,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
},
"text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n0\n\n\n",
"type": "Table"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
"eng"
],
"page_name": "Stanley Cups",
"page_number": 1
"page_number": 1,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
},
"text": "Stanley Cups",
"type": "Title"
Expand All @@ -40,7 +41,8 @@
"eng"
],
"page_name": "Stanley Cups",
"page_number": 1
"page_number": 1,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
},
"text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n13\n\n\n",
"type": "Table"
Expand All @@ -63,7 +65,8 @@
"eng"
],
"page_name": "Stanley Cups Since 67",
"page_number": 2
"page_number": 2,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
},
"text": "Stanley Cups Since 67",
"type": "Title"
Expand All @@ -86,7 +89,8 @@
"eng"
],
"page_name": "Stanley Cups Since 67",
"page_number": 2
"page_number": 2,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
},
"text": "\n\n\nTeam\nLocation\nStanley Cups\n\n\nBlues\nSTL\n1\n\n\nFlyers\nPHI\n2\n\n\nMaple Leafs\nTOR\n0\n\n\n",
"type": "Table"
Expand Down
Loading

0 comments on commit bdfd975

Please sign in to comment.