Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow running with incomplete descriptions #58

Merged
merged 35 commits into from
Jan 10, 2022
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
61c35bd
fix MODS name without roles, ht@kba #51
bertsky Dec 3, 2021
499c3cc
fallback to empty publicationStmt/date and encodingDesc if metsHdr is…
bertsky Dec 3, 2021
8984b1b
get_text_in_line: append HYP content if available
bertsky Dec 3, 2021
7b136c8
log to stderr instead of stdout (to prevent mixing with TEI)
bertsky Dec 3, 2021
6545b16
improve makefile
bertsky Dec 3, 2021
711025a
improve CI
bertsky Dec 3, 2021
605dd89
mets.fromfile: allow missing logical structmap
bertsky Dec 5, 2021
3bfa7c2
mets.fromfile: allow missing mods originInfo
bertsky Dec 5, 2021
559e4c1
mets.fromfile: allow missing mods physicalDescription
bertsky Dec 5, 2021
1a7fe59
mets.fromfile: allow missing mets amdSec provenance dv
bertsky Dec 5, 2021
af1740e
mets.fromfile: simplify physical struct map, allow missing @ORDER
bertsky Dec 5, 2021
18a2dde
mets.fromfile: allow missing struct link
bertsky Dec 5, 2021
dbcc1fe
teil.fill_from_mets: allow empty logical struct map and struct link
bertsky Dec 5, 2021
61c4624
METS to TEI structure: comment urging for more+better mappings
bertsky Dec 5, 2021
15022f5
rename changelog
bertsky Dec 6, 2021
553e0fd
improve+update changelog
bertsky Dec 6, 2021
27dffe8
differentiate image number and page number
bertsky Dec 6, 2021
c39b6c7
allow passing image fileGrp other than DEFAULT
bertsky Dec 6, 2021
71fd269
add params for image fileGrp and output file, more logging
bertsky Dec 6, 2021
5c20f90
update changelog
bertsky Dec 6, 2021
ad261ff
generalize passing URN and VD ID to all identifiers
bertsky Dec 12, 2021
93fb684
improve level, title and idno metadata…
bertsky Dec 13, 2021
9a5f486
fall back to biblFull title level u
bertsky Dec 13, 2021
55353e5
keep going if there is no author and div type
bertsky Dec 14, 2021
0bf8bd3
fix tei:collection
bertsky Dec 20, 2021
7962b8c
fix tei:repository (from list-valued mods:physicalLocation), add tei:…
bertsky Dec 20, 2021
073f2b1
fix 7962b8c5
bertsky Dec 20, 2021
8d2fc41
add tei:notesStmt/tei:note from mods:note
bertsky Dec 20, 2021
06f1ccf
fix tei:editionStmt (does not belong under titleStmt)
bertsky Dec 20, 2021
c49c2a4
add tei:keywords | tei:classCode under tei:textClass (for mods:subjec…
bertsky Dec 20, 2021
27127fe
chdir to METS dir if not URL
bertsky Dec 20, 2021
8ac0747
fix mods:location (only once, but multiple contents)
bertsky Dec 20, 2021
20546af
fix regression in 27127febd
bertsky Dec 20, 2021
f33a4ca
drop Python 3.5
bertsky Jan 6, 2022
8204bfc
Revert regression fix in README.md
wrznr Jan 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 40 additions & 5 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -1,19 +1,54 @@
# Python CircleCI 2.1 configuration file
# for mets-mods2tei
#
# Check https://circleci.com/docs/2.1/language-python/ for more details
# Check https://circleci.com/docs/2.0/language-python/ for more details
#
version: 2.1
orbs:
codecov: codecov/codecov@1.0.5
jobs:
build:
test:
parameters:
version:
type: string
docker:
- image: python:3.6
- image: circleci/python:<< parameters.version >>
working_directory: ~/repo
steps:
- checkout
- run: pip install -r requirements-test.txt
- run: pip install .
- run: make deps deps-test
- run: make install
- run: make test
- run: make coverage
- codecov/upload
pypi:
docker:
- image: circleci/python:3.6
working_directory: ~/repo
steps:
- checkout
- setup_remote_docker
- run: make install
- run: python setup.py sdist
- run: |
pip install cibuildwheel
cibuildwheel --output-dir dist
- store_artifacts:
path: dist/
destination: artifacts
# later: upload to PyPI...
wrznr marked this conversation as resolved.
Show resolved Hide resolved

workflows:
version: 2
test-all:
jobs:
- test:
matrix:
parameters:
version: [3.5.10, 3.6.15, 3.7.12, 3.8.12, 3.9.9]
deploy:
jobs:
- pypi:
filters:
branches:
only: master
59 changes: 59 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Changelog
wrznr marked this conversation as resolved.
Show resolved Hide resolved
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
### Added
- tests for TEI API
- tests for insertion index identification
- more logging
- CLI param for output file
- CLI param for image fileGrp

### Changed
- Add `front`, `body` and `back` per default
- Log to stderr instead of stdout
- Differentiate between (physical) image nr and (logical) page nr

### Fixed
- Evaluate texts from all struct types but `binding` and `colour_checker`, #43
- Handle errors during language code expansion, and fallback to `Unbekannt`, #47
- Add ALTO `HYP` text content if available, #52
- Allow empty logical structMap and structLink, fallback to physical, or empty, #57
- Allow partial dmdSec (MODS) or amdSec, fallback to empty, #46, #51

## [0.1.1] - 2020-05-11
### Added
- Make full text file group selectable by user
- Add poor man's namespace versioning handling

### Changed
- Make extraction of subtitles conditional on their presence
- Use "licence" for all types of licences (even unknown ones), #39

### Fixed
- Handle nested `@ADMID="AMD"` divs in logical `structMap` (i.e. newspaper case), #43
- Allow for local path entries (in addition to URLs) in METS, #41
- Add special treatment for URNs and VD IDs, #37

## [0.1.0] - 2019-12-04
### Added
- Correctly place structures which are not on top of a page
- Set `corresp` and `facs` attributes of `pb` elements
- Store links to `DEFAULT` images in METS
- Tests for new functionality
- Add Changelog file, #28

### Changed
- Retrieve ALTO files via a dedicated struct link member of the class `Mets`
- Move text retrieval to `Alto` class

### Removed
- Get rid of code artifacts carried over from `tocrify`

<!-- link-labels -->
[unreleased]: ../../compare/v0.1.1...master
[0.1.1]: ../../compare/v0.1.0...v0.1.1
[0.1.0]: ../../compare/v1.0...v0.1.0
52 changes: 0 additions & 52 deletions Changelog

This file was deleted.

18 changes: 16 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,26 +1,40 @@
# Python interpreter. Default: '$(PYTHON)'
PYTHON = python
PYTHON ?= python
PIP ?= pip

# BEGIN-EVAL makefile-parser --make-help Makefile

help:
@echo ""
@echo " Targets"
@echo ""
@echo " install Install this package"
@echo " deps Install dependencies only"
@echo " deps-test Install dependencies for testing only"
@echo " test Run all unit tests"
@echo " coverage Run coverage tests"
@echo ""
@echo " Variables"
@echo ""
@echo " PYTHON Python interpreter. Default: '$(PYTHON)'"
@echo " PIP Python packager. Default: '$(PIP)'"

# END-EVAL

#
# Tests
#

.PHONY: test coverage
.PHONY: install test coverage deps deps-test

install:
$(PIP) install .

deps:
$(PIP) install -r requirements.txt

deps-test:
$(PIP) install -r requirements-test.txt

# Run all unit tests
test:
Expand Down
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,17 +106,27 @@ Usage: mm2tei [OPTIONS] METS

METS: File containing or URL pointing to the METS/MODS XML to be converted

Parse given METS and its meta-data, and convert it to TEI.

If `--ocr` is given, then also read the ALTO full-text files from the
fileGrp in `--text-group`, and convert page contents accordingly (in
physical order). Decorate page boundaries with image and page numbers, and
reference the corresponding base image files from `--img-group`.

Output XML to `--output (use '-' for stdout), log to stderr.`

Options:
-O, --output FILENAME File path to write TEI output to
-o, --ocr Serialize OCR into resulting TEI
-T, --text-group TEXT File group which contains the full text
-T, --text-group TEXT File group which contains the full-text
wrznr marked this conversation as resolved.
Show resolved Hide resolved
wrznr marked this conversation as resolved.
Show resolved Hide resolved
-I, --img-group TEXT File group which contains the images
-l, --log-level [DEBUG|INFO|WARN|ERROR|OFF]
--help Show this message and exit.
-h, --help Show this message and exit.
```

It reads METS XML via URL or file argument and prints the resulting TEI,
including the extracted information from the MODS part of the METS.

Example:

mm2tei "https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-453779263"

mm2tei -O tei.xml "https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-453779263"
6 changes: 5 additions & 1 deletion mets_mods2tei/api/alto.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,11 @@ def get_text_in_line(self, line):
Returns the ALTO-encoded text .
:param Element line: The line to extract the text from.
"""
return " ".join(element.get("CONTENT") for element in line.xpath("./alto:String", namespaces=ns))
text = " ".join(element.get("CONTENT") for element in line.xpath("./alto:String", namespaces=ns))
hyp = line.find("alto:HYP", namespaces=ns)
if hyp is not None:
text += hyp.get("CONTENT")
return text

def __compute_fuzzy_distance(self, text1, text2):
"""
Expand Down
Loading