BCUL ABBY importer
+This importer is written to accomodate the ABBY OCR format. +It was developed to handle OCR newspaper data provided by the Bibliothèque Cantonale Universitaire de Lausanne +(BCUL - Lausanne Cantonal University Library), which are part of the Scriptorium interface <https://scriptorium.bcu-lausanne.ch/page/home> and collection.
+BCUL Custom classes
+This module contains the definition of the BCUL importer classes.
+The classes define newspaper Issues and Pages objects which convert OCR data in +the ABBYY format to a unified canoncial format.
+-
+
- +class text_importer.importers.bcul.classes.BculNewspaperIssue(issue_dir) +
Bases:
+NewspaperIssue
Newspaper Issue in BCUL (Abby) format.
+-
+
- Parameters: +
issue_dir (IssueDir) – Identifying information about the issue.
+
+
-
+
- +id +
Canonical Issue ID (e.g.
+GDL-1900-01-02-a
).-
+
- Type: +
str
+
+
-
+
- +edition +
Lower case letter ordering issues of the same day.
+-
+
- Type: +
str
+
+
-
+
- +journal +
Newspaper unique identifier or name.
+-
+
- Type: +
str
+
+
-
+
- +path +
Path to directory containing the issue’s OCR data.
+-
+
- Type: +
str
+
+
-
+
- +date +
Publication date of issue.
+-
+
- Type: +
datetime.date
+
+
-
+
- +issue_data +
Issue data according to canonical format.
+-
+
- Type: +
dict[str, Any]
+
+
-
+
- +pages +
list of
+NewspaperPage
instances from this issue.-
+
- Type: +
list
+
+
-
+
- +rights +
Access rights applicable to this issue.
+-
+
- Type: +
str
+
+
-
+
- +mit_file +
Path to the ABBY ‘mit’ file that contains the OLR.
+-
+
- Type: +
str
+
+
-
+
- +is_json +
Whether the mit_file has the json file extension.
+-
+
- Type: +
bool
+
+
-
+
- +is_xml +
Whether the mit_file has the xml file extension.
+-
+
- Type: +
bool
+
+
-
+
- +iiif_manifest +
Presentation iiif manifest for this issue.
+-
+
- Type: +
str
+
+
-
+
- +content_items +
List of content items in this issue.
+-
+
- Type: +
list[dict]
+
+
-
+
- +query_iiif_api(num_tries: int = 0, max_retries: int = 3) dict[str, Any] +
Query the Scriptorium IIIF API for the issue’s manifest data.
+TODO: implement the retry approach with celery package or similar.
+-
+
- Parameters: +
-
+
num_tries (int, optional) – Number of retry attempts. Defaults to 0.
+max_retries (int, optional) – Maximum number of attempts. Defaults to 3.
+
+- Returns: +
Issue’s IIIF “canvases” for each page.
+
+- Return type: +
dict[str, Any]
+
+- Raises: +
Exception – If the maximum number of retry attempts is reached.
+
+
-
+
- +class text_importer.importers.bcul.classes.BculNewspaperPage(_id: str, number: int, page_path: str, iiif_uri: str) +
Bases:
+NewspaperPage
Newspaper page in BCUL (Abbyy) format.
+-
+
- Parameters: +
-
+
_id (str) – Canonical page ID.
+number (int) – Page number.
+page_path (str) – Path to the Abby XML page file.
+iiif_uri (str) – URI to image IIIF of this page.
+
+
-
+
- +id +
Canonical Page ID (e.g.
+GDL-1900-01-02-a-p0004
).-
+
- Type: +
str
+
+
-
+
- +number +
Page number.
+-
+
- Type: +
int
+
+
-
+
- +page_data +
Page data according to canonical format.
+-
+
- Type: +
dict[str, Any]
+
+
-
+
- +issue +
Issue this page is from.
+-
+
- Type: +
- + +
-
+
- +path +
Path to the Abby XML page file.
+-
+
- Type: +
str
+
+
-
+
- +iiif_base_uri +
URI to image IIIF of this page.
+-
+
- Type: +
str
+
+
-
+
- +add_issue(issue: NewspaperIssue) None +
Add to a page object its parent, i.e. the newspaper issue.
+This allows each page to preserve contextual information coming from +the newspaper issue.
+-
+
- Parameters: +
issue (NewspaperIssue) – Newspaper issue containing this page.
+
+
-
+
- +property ci_id: str +
Create and return the content item ID of the page.
+Given that BCUL data do not entail article-level segmentation, +each page is considered as a content item. Thus, to mint the content +item ID we take the canonical page ID and simply replace the “p” +prefix with “i”.
+-
+
- Returns: +
Content item id.
+
+- Return type: +
str
+
+
-
+
- +get_ci_divs() list[Tag] +
Fetch and return the divs of tables and pictures from this page.
+While BCUL does not entail article-level segmentation, tables and +pictures are still segmented. They can thus have their own content item +objects.
+-
+
- Returns: +
List of segmented table and picture elements.
+
+- Return type: +
list[Tag]
+
+
-
+
- +parse() None +
Process the page XML file and transform into canonical Page format.
+++Note
+This lazy behavior means that the page contents are not processed +upon creation of the page object, but only once the
+parse()
+method is called.
-
+
- +property xml: BeautifulSoup +
BCUL Detect functions
+This module contains helper functions to find BCUL OCR data to import.
+-
+
- +text_importer.importers.bcul.detect.BculIssueDir +
A light-weight data structure to represent a newspaper issue.
+This named tuple contains basic metadata about a newspaper issue. They +can then be used to locate the relevant data in the filesystem or to create +canonical identifiers for the issue and its pages.
+++Note
+In case of newspaper published multiple times per day, a lowercase letter +is used to indicate the edition number: ‘a’ for the first, ‘b’ for the +second, etc.
+-
+
- Parameters: +
-
+
journal (str) – Newspaper ID.
+date (datetime.date) – Publication date or issue.
+edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
+path (str) – Path to the directory containing the issue’s OCR data.
+rights (str) – Access rights on the data (open, closed, etc.).
+rights – Type of mit file for this issue (json or xml).
+
+
++>>> from datetime import date +>>> i = BculIssueDir( + journal='FAL', + date=datetime.date(1762, 12, 07), + edition='a', + path='./BCUL/46165', + rights='open_public', + mit_file_type:'json' +) +
-
+
- +text_importer.importers.bcul.detect.detect_issues(base_dir: str, access_rights: str) list[IssueDirectory] +
Detect BCUL newspaper issues to import within the filesystem.
+This function expects the directory structure that BCUL used to +organize the dump of Abbyy files.
+-
+
- Parameters: +
-
+
base_dir (str) – Path to the base directory of newspaper data.
+access_rights (str) – Path to access_rights_and_aliases.json file.
+
+- Returns: +
List of BCULIssueDir instances, to be imported.
+
+- Return type: +
list[BculIssueDir]
+
+
-
+
- +text_importer.importers.bcul.detect.dir2issue(path: str, journal_info: dict[str, str]) IssueDirectory | None +
Create a BculIssueDir object from a directory.
+++Note
+This function is called internally by detect_issues
+-
+
- Parameters: +
-
+
path (str) – The path of the issue.
+access_rights (dict) – Dictionary for access rights.
+
+- Returns: +
New BculIssueDir object.
+
+- Return type: +
BculIssueDir | None
+
+
-
+
- +text_importer.importers.bcul.detect.select_issues(base_dir: str, config: dict, access_rights: str) list[IssueDirectory] | None +
Detect selectively newspaper issues to import.
+The behavior is very similar to
+detect_issues()
with the only +difference thatconfig
specifies some rules to filter the data to +import. See this section for +further details on how to configure filtering.-
+
- Parameters: +
-
+
base_dir (str) – Path to the base directory of newspaper data.
+config (dict) – Config dictionary for filtering.
+access_rights (str) – Not used for this imported, but argument is kept +for uniformity.
+
+- Returns: +
List of BculIssueDir to import.
+
+- Return type: +
list[BculIssueDir] | None
+
+
BCUL Helper functions
+Helper functions to parse BCUL OCR files.
+-
+
- +text_importer.importers.bcul.helpers.find_mit_file(_dir: str) str +
Given a directory, search for a file with a name ending with mit.
+-
+
- Parameters: +
_dir (str) – Directory to look into.
+
+- Returns: +
Path to the mit file once found.
+
+- Return type: +
str
+
+
-
+
- +text_importer.importers.bcul.helpers.find_page_file_in_dir(base_path: str, file_id: str) str | None +
Find the page file in a directory given the name it should have.
+-
+
- Parameters: +
-
+
base_path (str) – The base path of the directory.
+file_id (str) – The name of the page file if present.
+
+- Returns: +
The path to the page file if found, otherwise None.
+
+- Return type: +
str | None
+
+
-
+
- +text_importer.importers.bcul.helpers.get_div_coords(div: Tag) list[int] +
Extract the coordinates from the given element and format them for iiif.
+In Abbyy format, the coordinates are denoted by the bottom, top (y-axis), +left and right (x-axis) values. +But iiif coordinates should be formatted as [x, y, width, height], where +(x,y) denotes the box’s top left corner: (l, t). Thus they need conversion.
+-
+
- Parameters: +
div (Tag) – Element to extract the coordinates from
+
+- Returns: +
Coordinates converted to the iiif format.
+
+- Return type: +
list[int]
+
+
-
+
- +text_importer.importers.bcul.helpers.get_page_number(exif_file: str) int +
Given an exif file, look for the page number inside.
+This is for the JSON ‘flavour’ of BCUL, in which metadata about the pages +are in JSON files which contain the substring exif.
+-
+
- Parameters: +
exif_file (str) – Path to the exif file.
+
+- Raises: +
ValueError – The page number could not be extracted from the file.
+
+- Returns: +
Page number extracted from the file.
+
+- Return type: +
int
+
+
-
+
- +text_importer.importers.bcul.helpers.parse_char_tokens(char_tokens: list[Tag]) list[dict[str, list[int] | str]] +
Parse a list of div Tag to extract the tokens and coordinates within a line.
+-
+
- Parameters: +
char_tokens (list[Tag]) – div Tags corresponding to a line of tokens to parse.
+
+- Returns: +
List of reconstructed parsed tokens.
+
+- Return type: +
list[dict[str, list[int] | str]]
+
+
-
+
- +text_importer.importers.bcul.helpers.parse_date(mit_filename: str) date +
Given the Mit filename, parse the date and ensure it is valid.
+-
+
- Parameters: +
mit_filename (str) – Filename of the ‘mit’ file.
+
+- Returns: +
Publication date of the issue
+
+- Return type: +
date
+
+
-
+
- +text_importer.importers.bcul.helpers.parse_textblock(block: Tag, page_ci_id: str) dict[str, Any] +
Parse the given textblock element into a canonical region element.
+-
+
- Parameters: +
-
+
block (Tag) – Text block div element to parse.
+page_ci_id (str) – Canonical ID of the CI corresponding to this page.
+
+- Returns: +
Parsed region object in canonical format.
+
+- Return type: +
dict[str, Any]
+
+
-
+
- +text_importer.importers.bcul.helpers.parse_textline(line: Tag) dict[str, list[Any]] +
Parse the div element corresponding to a textline.
+-
+
- Parameters: +
line (Tag) – Textline div element Tag.
+
+- Returns: +
Parsed line of text.
+
+- Return type: +
dict[str, list]
+
+
-
+
- +text_importer.importers.bcul.helpers.verify_issue_has_ocr_files(path: str) None +
Ensure the path to the issue considered contains xml files.
+-
+
- Parameters: +
path (str) – Path to the issue considered
+
+- Raises: +
FileNotFoundError – No XNL OCR files were found in the path.
+
+