+
+ +
+

BCUL ABBY importer

+

This importer is written to accomodate the ABBY OCR format. +It was developed to handle OCR newspaper data provided by the Bibliothèque Cantonale Universitaire de Lausanne +(BCUL - Lausanne Cantonal University Library), which are part of the Scriptorium interface <https://scriptorium.bcu-lausanne.ch/page/home> and collection.

+
+

BCUL Custom classes

+

This module contains the definition of the BCUL importer classes.

+

The classes define newspaper Issues and Pages objects which convert OCR data in +the ABBYY format to a unified canoncial format.

+
+
+class text_importer.importers.bcul.classes.BculNewspaperIssue(issue_dir)
+

Bases: NewspaperIssue

+

Newspaper Issue in BCUL (Abby) format.

+
+
Parameters:
+

issue_dir (IssueDir) – Identifying information about the issue.

+
+
+
+
+id
+

Canonical Issue ID (e.g. GDL-1900-01-02-a).

+
+
Type:
+

str

+
+
+
+ +
+
+edition
+

Lower case letter ordering issues of the same day.

+
+
Type:
+

str

+
+
+
+ +
+
+journal
+

Newspaper unique identifier or name.

+
+
Type:
+

str

+
+
+
+ +
+
+path
+

Path to directory containing the issue’s OCR data.

+
+
Type:
+

str

+
+
+
+ +
+
+date
+

Publication date of issue.

+
+
Type:
+

datetime.date

+
+
+
+ +
+
+issue_data
+

Issue data according to canonical format.

+
+
Type:
+

dict[str, Any]

+
+
+
+ +
+
+pages
+

list of NewspaperPage instances from this issue.

+
+
Type:
+

list

+
+
+
+ +
+
+rights
+

Access rights applicable to this issue.

+
+
Type:
+

str

+
+
+
+ +
+
+mit_file
+

Path to the ABBY ‘mit’ file that contains the OLR.

+
+
Type:
+

str

+
+
+
+ +
+
+is_json
+

Whether the mit_file has the json file extension.

+
+
Type:
+

bool

+
+
+
+ +
+
+is_xml
+

Whether the mit_file has the xml file extension.

+
+
Type:
+

bool

+
+
+
+ +
+
+iiif_manifest
+

Presentation iiif manifest for this issue.

+
+
Type:
+

str

+
+
+
+ +
+
+content_items
+

List of content items in this issue.

+
+
Type:
+

list[dict]

+
+
+
+ +
+
+query_iiif_api(num_tries: int = 0, max_retries: int = 3) dict[str, Any]
+

Query the Scriptorium IIIF API for the issue’s manifest data.

+

TODO: implement the retry approach with celery package or similar.

+
+
Parameters:
+
    +
  • num_tries (int, optional) – Number of retry attempts. Defaults to 0.

  • +
  • max_retries (int, optional) – Maximum number of attempts. Defaults to 3.

  • +
+
+
Returns:
+

Issue’s IIIF “canvases” for each page.

+
+
Return type:
+

dict[str, Any]

+
+
Raises:
+

Exception – If the maximum number of retry attempts is reached.

+
+
+
+ +
+ +
+
+class text_importer.importers.bcul.classes.BculNewspaperPage(_id: str, number: int, page_path: str, iiif_uri: str)
+

Bases: NewspaperPage

+

Newspaper page in BCUL (Abbyy) format.

+
+
Parameters:
+
    +
  • _id (str) – Canonical page ID.

  • +
  • number (int) – Page number.

  • +
  • page_path (str) – Path to the Abby XML page file.

  • +
  • iiif_uri (str) – URI to image IIIF of this page.

  • +
+
+
+
+
+id
+

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

+
+
Type:
+

str

+
+
+
+ +
+
+number
+

Page number.

+
+
Type:
+

int

+
+
+
+ +
+
+page_data
+

Page data according to canonical format.

+
+
Type:
+

dict[str, Any]

+
+
+
+ +
+
+issue
+

Issue this page is from.

+
+
Type:
+

NewspaperIssue

+
+
+
+ +
+
+path
+

Path to the Abby XML page file.

+
+
Type:
+

str

+
+
+
+ +
+
+iiif_base_uri
+

URI to image IIIF of this page.

+
+
Type:
+

str

+
+
+
+ +
+
+add_issue(issue: NewspaperIssue) None
+

Add to a page object its parent, i.e. the newspaper issue.

+

This allows each page to preserve contextual information coming from +the newspaper issue.

+
+
Parameters:
+

issue (NewspaperIssue) – Newspaper issue containing this page.

+
+
+
+ +
+
+property ci_id: str
+

Create and return the content item ID of the page.

+

Given that BCUL data do not entail article-level segmentation, +each page is considered as a content item. Thus, to mint the content +item ID we take the canonical page ID and simply replace the “p” +prefix with “i”.

+
+
Returns:
+

Content item id.

+
+
Return type:
+

str

+
+
+
+ +
+
+get_ci_divs() list[Tag]
+

Fetch and return the divs of tables and pictures from this page.

+

While BCUL does not entail article-level segmentation, tables and +pictures are still segmented. They can thus have their own content item +objects.

+
+
Returns:
+

List of segmented table and picture elements.

+
+
Return type:
+

list[Tag]

+
+
+
+ +
+
+parse() None
+

Process the page XML file and transform into canonical Page format.

+
+

Note

+

This lazy behavior means that the page contents are not processed +upon creation of the page object, but only once the parse() +method is called.

+
+
+ +
+
+property xml: BeautifulSoup
+
+ +
+ +
+
+

BCUL Detect functions

+

This module contains helper functions to find BCUL OCR data to import.

+
+
+text_importer.importers.bcul.detect.BculIssueDir
+

A light-weight data structure to represent a newspaper issue.

+

This named tuple contains basic metadata about a newspaper issue. They +can then be used to locate the relevant data in the filesystem or to create +canonical identifiers for the issue and its pages.

+
+

Note

+

In case of newspaper published multiple times per day, a lowercase letter +is used to indicate the edition number: ‘a’ for the first, ‘b’ for the +second, etc.

+
+
+
Parameters:
+
    +
  • journal (str) – Newspaper ID.

  • +
  • date (datetime.date) – Publication date or issue.

  • +
  • edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).

  • +
  • path (str) – Path to the directory containing the issue’s OCR data.

  • +
  • rights (str) – Access rights on the data (open, closed, etc.).

  • +
  • rights – Type of mit file for this issue (json or xml).

  • +
+
+
+
>>> from datetime import date
+>>> i = BculIssueDir(
+    journal='FAL',
+    date=datetime.date(1762, 12, 07),
+    edition='a',
+    path='./BCUL/46165',
+    rights='open_public',
+    mit_file_type:'json'
+)
+
+
+
+ +
+
+text_importer.importers.bcul.detect.detect_issues(base_dir: str, access_rights: str) list[IssueDirectory]
+

Detect BCUL newspaper issues to import within the filesystem.

+

This function expects the directory structure that BCUL used to +organize the dump of Abbyy files.

+
+
Parameters:
+
    +
  • base_dir (str) – Path to the base directory of newspaper data.

  • +
  • access_rights (str) – Path to access_rights_and_aliases.json file.

  • +
+
+
Returns:
+

List of BCULIssueDir instances, to be imported.

+
+
Return type:
+

list[BculIssueDir]

+
+
+
+ +
+
+text_importer.importers.bcul.detect.dir2issue(path: str, journal_info: dict[str, str]) IssueDirectory | None
+

Create a BculIssueDir object from a directory.

+
+

Note

+

This function is called internally by detect_issues

+
+
+
Parameters:
+
    +
  • path (str) – The path of the issue.

  • +
  • access_rights (dict) – Dictionary for access rights.

  • +
+
+
Returns:
+

New BculIssueDir object.

+
+
Return type:
+

BculIssueDir | None

+
+
+
+ +
+
+text_importer.importers.bcul.detect.select_issues(base_dir: str, config: dict, access_rights: str) list[IssueDirectory] | None
+

Detect selectively newspaper issues to import.

+

The behavior is very similar to detect_issues() with the only +difference that config specifies some rules to filter the data to +import. See this section for +further details on how to configure filtering.

+
+
Parameters:
+
    +
  • base_dir (str) – Path to the base directory of newspaper data.

  • +
  • config (dict) – Config dictionary for filtering.

  • +
  • access_rights (str) – Not used for this imported, but argument is kept +for uniformity.

  • +
+
+
Returns:
+

List of BculIssueDir to import.

+
+
Return type:
+

list[BculIssueDir] | None

+
+
+
+ +
+
+

BCUL Helper functions

+

Helper functions to parse BCUL OCR files.

+
+
+text_importer.importers.bcul.helpers.find_mit_file(_dir: str) str
+

Given a directory, search for a file with a name ending with mit.

+
+
Parameters:
+

_dir (str) – Directory to look into.

+
+
Returns:
+

Path to the mit file once found.

+
+
Return type:
+

str

+
+
+
+ +
+
+text_importer.importers.bcul.helpers.find_page_file_in_dir(base_path: str, file_id: str) str | None
+

Find the page file in a directory given the name it should have.

+
+
Parameters:
+
    +
  • base_path (str) – The base path of the directory.

  • +
  • file_id (str) – The name of the page file if present.

  • +
+
+
Returns:
+

The path to the page file if found, otherwise None.

+
+
Return type:
+

str | None

+
+
+
+ +
+
+text_importer.importers.bcul.helpers.get_div_coords(div: Tag) list[int]
+

Extract the coordinates from the given element and format them for iiif.

+

In Abbyy format, the coordinates are denoted by the bottom, top (y-axis), +left and right (x-axis) values. +But iiif coordinates should be formatted as [x, y, width, height], where +(x,y) denotes the box’s top left corner: (l, t). Thus they need conversion.

+
+
Parameters:
+

div (Tag) – Element to extract the coordinates from

+
+
Returns:
+

Coordinates converted to the iiif format.

+
+
Return type:
+

list[int]

+
+
+
+ +
+
+text_importer.importers.bcul.helpers.get_page_number(exif_file: str) int
+

Given an exif file, look for the page number inside.

+

This is for the JSON ‘flavour’ of BCUL, in which metadata about the pages +are in JSON files which contain the substring exif.

+
+
Parameters:
+

exif_file (str) – Path to the exif file.

+
+
Raises:
+

ValueError – The page number could not be extracted from the file.

+
+
Returns:
+

Page number extracted from the file.

+
+
Return type:
+

int

+
+
+
+ +
+
+text_importer.importers.bcul.helpers.parse_char_tokens(char_tokens: list[Tag]) list[dict[str, list[int] | str]]
+

Parse a list of div Tag to extract the tokens and coordinates within a line.

+
+
Parameters:
+

char_tokens (list[Tag]) – div Tags corresponding to a line of tokens to parse.

+
+
Returns:
+

List of reconstructed parsed tokens.

+
+
Return type:
+

list[dict[str, list[int] | str]]

+
+
+
+ +
+
+text_importer.importers.bcul.helpers.parse_date(mit_filename: str) date
+

Given the Mit filename, parse the date and ensure it is valid.

+
+
Parameters:
+

mit_filename (str) – Filename of the ‘mit’ file.

+
+
Returns:
+

Publication date of the issue

+
+
Return type:
+

date

+
+
+
+ +
+
+text_importer.importers.bcul.helpers.parse_textblock(block: Tag, page_ci_id: str) dict[str, Any]
+

Parse the given textblock element into a canonical region element.

+
+
Parameters:
+
    +
  • block (Tag) – Text block div element to parse.

  • +
  • page_ci_id (str) – Canonical ID of the CI corresponding to this page.

  • +
+
+
Returns:
+

Parsed region object in canonical format.

+
+
Return type:
+

dict[str, Any]

+
+
+
+ +
+
+text_importer.importers.bcul.helpers.parse_textline(line: Tag) dict[str, list[Any]]
+

Parse the div element corresponding to a textline.

+
+
Parameters:
+

line (Tag) – Textline div element Tag.

+
+
Returns:
+

Parsed line of text.

+
+
Return type:
+

dict[str, list]

+
+
+
+ +
+
+text_importer.importers.bcul.helpers.verify_issue_has_ocr_files(path: str) None
+

Ensure the path to the issue considered contains xml files.

+
+
Parameters:
+

path (str) – Path to the issue considered

+
+
Raises:
+

FileNotFoundError – No XNL OCR files were found in the path.

+
+
+
+ +
+
+ + +
+