-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* ESL-155 Add table bbox annotations to tabby reader (#354) * Add table bbox annotations to tabby reader * Fix tests * Review fixes --------- Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru> * TLDR-476 change swagger (#357) * Use fastapi swagger, add pydantic classes and documentation * Fix documentation and examples * TLDR-465 pdf miner new params (#356) * set char_margin to 3 * add pdf miner test script * fix test_pdf_miner script * fix TestApiPdfWithText * add chaching * rename test to benchmark * add benchmark script again * change name * change name * Try to fix documentation pipeline * fix benchmark --------- Co-authored-by: Nikita Shevtsov <shevtsov@ispras.ru> Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru> * ESL-165 table bboxes bug (#358) * ESL-165 Added test with hard tables * ESL-165 fixed bug box extraction in payment_order * ESL-165 after rebase * ESL-165 update README.md * ESL-165 after review --------- Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru> * TLDR-367 refactor metadata extractor (#359) * change add_metadata to extract_metadata in metadata readers * fix usage of extract_metadata * fix docs * change output type to dict * fix code style * fix pr --------- Co-authored-by: Nikita Shevtsov <shevtsov@ispras.ru> * ESL-167 extract only word boxes (#360) * ESL-167 extract only word boxes * ESL-167 extract only words bboxes for tabby reader --------- Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru> * TLDR-502 increase converter timeout (#361) * new version 1.1.0 (#362) --------- Co-authored-by: Andrey Mikhailov <mikhailov@icc.ru> Co-authored-by: Nikita Shevtsov <61932814+Travvy88@users.noreply.github.com> Co-authored-by: Nikita Shevtsov <shevtsov@ispras.ru> Co-authored-by: Oksana Belyaeva <belyaeva@ispras.ru>
- Loading branch information
1 parent
ff26829
commit b79dd4c
Showing
96 changed files
with
719 additions
and
765 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
1.0 | ||
1.1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
from .annotation import Annotation | ||
from .cell_with_meta import CellWithMeta | ||
from .document_content import DocumentContent | ||
from .document_metadata import DocumentMetadata | ||
from .line_metadata import LineMetadata | ||
from .line_with_meta import LineWithMeta | ||
from .parsed_document import ParsedDocument | ||
from .table import Table | ||
from .table_metadata import TableMetadata | ||
from .tree_node import TreeNode | ||
|
||
__all__ = ["Annotation", "CellWithMeta", "DocumentContent", "DocumentMetadata", "LineMetadata", "LineWithMeta", "ParsedDocument", "Table", "TableMetadata", | ||
"TreeNode"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
from pydantic import BaseModel, Field | ||
|
||
|
||
class Annotation(BaseModel): | ||
""" | ||
The piece of information about the text line: it's appearance or links to another document object. | ||
For example Annotation(1, 13, "italic", "True") says that text between 1st and 13th symbol was written in italic. | ||
""" | ||
start: int = Field(description="Start of the annotated text", example=0) | ||
end: int = Field(description="End of the annotated text (end isn't included)", example=5) | ||
name: str = Field(description="Annotation name", example="italic") | ||
value: str = Field(description="Annotation value. For example, it may be font size value for size type", example="True") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
from typing import List | ||
|
||
from pydantic import BaseModel, Field | ||
|
||
from dedoc.api.schema.line_with_meta import LineWithMeta | ||
|
||
|
||
class CellWithMeta(BaseModel): | ||
""" | ||
Holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible). | ||
""" | ||
lines: List[LineWithMeta] = Field(description="Textual lines of the cell with annotations") | ||
rowspan: int = Field(description="Number of rows to span like in HTML format", example=1) | ||
colspan: int = Field(description="Number of columns to span like in HTML format", example=2) | ||
invisible: bool = Field(description="Indicator for displaying or hiding cell text", example=False) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
from typing import List | ||
|
||
from pydantic import BaseModel, Field | ||
|
||
from dedoc.api.schema.table import Table | ||
from dedoc.api.schema.tree_node import TreeNode | ||
|
||
|
||
class DocumentContent(BaseModel): | ||
""" | ||
Content of the document - structured text and tables. | ||
""" | ||
structure: TreeNode = Field(description="Tree structure where content of the document is organized") | ||
tables: List[Table] = Field(description="List of document tables") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from typing import Optional | ||
|
||
from pydantic import BaseModel, ConfigDict, Field | ||
|
||
|
||
class DocumentMetadata(BaseModel): | ||
""" | ||
Document metadata like its name, size, author, etc. | ||
""" | ||
model_config = ConfigDict(extra="allow") | ||
|
||
uid: str = Field(description="Document unique identifier (useful for attached files)", example="doc_uid_auto_ba73d76a-326a-11ec-8092-417272234cb0") | ||
file_name: str = Field(description="Original document name before rename and conversion", example="example.odt") | ||
temporary_file_name: str = Field(description="File name during parsing (unique name after rename and conversion)", example="123.odt") | ||
size: int = Field(description="File size in bytes", example=20060) | ||
modified_time: int = Field(description="Modification time of the document in the UnixTime format", example=1590579805) | ||
created_time: int = Field(description="Creation time of the document in the UnixTime format", example=1590579805) | ||
access_time: int = Field(description="File access time in the UnixTime format", example=1590579805) | ||
file_type: str = Field(description="Mime type of the file", example="application/vnd.oasis.opendocument.text") | ||
other_fields: Optional[dict] = Field(description="Other optional fields") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
from typing import Optional | ||
|
||
from pydantic import BaseModel, ConfigDict, Field | ||
|
||
|
||
class LineMetadata(BaseModel): | ||
""" | ||
Holds information about document node/line metadata, such as page number or line type. | ||
""" | ||
model_config = ConfigDict(extra="allow") | ||
|
||
paragraph_type: str = Field(description="Type of the document line/paragraph (header, list_item, list) and etc.", example="raw_text") | ||
page_id: int = Field(description="Page number of the line/paragraph beginning", example=0) | ||
line_id: Optional[int] = Field(description="Line number", example=1) | ||
other_fields: Optional[dict] = Field(description="Some other fields") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
from typing import List | ||
|
||
from pydantic import BaseModel, Field | ||
|
||
from dedoc.api.schema.annotation import Annotation | ||
|
||
|
||
class LineWithMeta(BaseModel): | ||
""" | ||
Textual line with text annotations. | ||
""" | ||
text: str = Field(description="Text of the line", example="Some text") | ||
annotations: List[Annotation] = Field(description="Text annotations (font, size, bold, italic and etc)") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
from typing import List | ||
|
||
from pydantic import BaseModel, Field | ||
|
||
from dedoc.api.schema.document_content import DocumentContent | ||
from dedoc.api.schema.document_metadata import DocumentMetadata | ||
|
||
|
||
class ParsedDocument(BaseModel): | ||
""" | ||
Holds information about the document content, metadata and attachments. | ||
""" | ||
content: DocumentContent = Field(description="Document text and tables") | ||
metadata: DocumentMetadata = Field(description="Document metadata such as size, creation date and so on") | ||
version: str = Field(description="Version of the program that parsed this document", example="0.9.1") | ||
warnings: List[str] = Field(description="List of warnings and possible errors, arising in the process of document parsing") | ||
attachments: List["ParsedDocument"] = Field(description="Result of analysis of attached files - list of `ParsedDocument`") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
from typing import List | ||
|
||
from pydantic import BaseModel, Field | ||
|
||
from dedoc.api.schema.cell_with_meta import CellWithMeta | ||
from dedoc.api.schema.table_metadata import TableMetadata | ||
|
||
|
||
class Table(BaseModel): | ||
""" | ||
Holds information about tables in the document. | ||
We assume that a table has rectangle form (has the same number of columns in each row). | ||
Table representation is row-based i.e. external list contains list of rows. | ||
""" | ||
cells: List[List[CellWithMeta]] = Field(description="List of lists of table cells (cell has text, colspan and rowspan attributes)") | ||
metadata: TableMetadata = Field(description="Table meta information") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
from typing import Optional | ||
|
||
from pydantic import BaseModel, Field | ||
|
||
|
||
class TableMetadata(BaseModel): | ||
""" | ||
Holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on. | ||
""" | ||
page_id: Optional[int] = Field(description="Number of the page where the table starts", example=0) | ||
uid: str = Field(description="Unique identifier of the table", example="e8ba5523-8546-4804-898c-2f4835a1804f") | ||
rotated_angle: float = Field(description="Value of the rotation angle (in degrees) by which the table was rotated during recognition", example=1.0) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from typing import List | ||
|
||
from pydantic import BaseModel, Field | ||
|
||
from dedoc.api.schema.annotation import Annotation | ||
from dedoc.api.schema.line_metadata import LineMetadata | ||
|
||
|
||
class TreeNode(BaseModel): | ||
""" | ||
Helps to represent document as recursive tree structure. | ||
It has list of children `TreeNode` nodes (empty list for a leaf node). | ||
""" | ||
node_id: str = Field(description="Document element identifier. It is unique within a document content tree. " | ||
"The identifier consists of numbers separated by dots where each number " | ||
"means node's number among nodes with the same level in the document hierarchy.)", example="0.2.1") | ||
text: str = Field(description="Text of the node", example="Some text") | ||
annotations: List[Annotation] = Field(description="Some metadata related to the part of the text (as font size)") | ||
metadata: LineMetadata = Field(description="Metadata for the entire node (as node type)") | ||
subparagraphs: List["TreeNode"] = Field(description="List of children of this node, each child is `TreeNode`") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.