Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow running with incomplete descriptions #58

Merged
merged 35 commits into from
Jan 10, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
61c35bd
fix MODS name without roles, ht@kba #51
bertsky Dec 3, 2021
499c3cc
fallback to empty publicationStmt/date and encodingDesc if metsHdr is…
bertsky Dec 3, 2021
8984b1b
get_text_in_line: append HYP content if available
bertsky Dec 3, 2021
7b136c8
log to stderr instead of stdout (to prevent mixing with TEI)
bertsky Dec 3, 2021
6545b16
improve makefile
bertsky Dec 3, 2021
711025a
improve CI
bertsky Dec 3, 2021
605dd89
mets.fromfile: allow missing logical structmap
bertsky Dec 5, 2021
3bfa7c2
mets.fromfile: allow missing mods originInfo
bertsky Dec 5, 2021
559e4c1
mets.fromfile: allow missing mods physicalDescription
bertsky Dec 5, 2021
1a7fe59
mets.fromfile: allow missing mets amdSec provenance dv
bertsky Dec 5, 2021
af1740e
mets.fromfile: simplify physical struct map, allow missing @ORDER
bertsky Dec 5, 2021
18a2dde
mets.fromfile: allow missing struct link
bertsky Dec 5, 2021
dbcc1fe
teil.fill_from_mets: allow empty logical struct map and struct link
bertsky Dec 5, 2021
61c4624
METS to TEI structure: comment urging for more+better mappings
bertsky Dec 5, 2021
15022f5
rename changelog
bertsky Dec 6, 2021
553e0fd
improve+update changelog
bertsky Dec 6, 2021
27dffe8
differentiate image number and page number
bertsky Dec 6, 2021
c39b6c7
allow passing image fileGrp other than DEFAULT
bertsky Dec 6, 2021
71fd269
add params for image fileGrp and output file, more logging
bertsky Dec 6, 2021
5c20f90
update changelog
bertsky Dec 6, 2021
ad261ff
generalize passing URN and VD ID to all identifiers
bertsky Dec 12, 2021
93fb684
improve level, title and idno metadata…
bertsky Dec 13, 2021
9a5f486
fall back to biblFull title level u
bertsky Dec 13, 2021
55353e5
keep going if there is no author and div type
bertsky Dec 14, 2021
0bf8bd3
fix tei:collection
bertsky Dec 20, 2021
7962b8c
fix tei:repository (from list-valued mods:physicalLocation), add tei:…
bertsky Dec 20, 2021
073f2b1
fix 7962b8c5
bertsky Dec 20, 2021
8d2fc41
add tei:notesStmt/tei:note from mods:note
bertsky Dec 20, 2021
06f1ccf
fix tei:editionStmt (does not belong under titleStmt)
bertsky Dec 20, 2021
c49c2a4
add tei:keywords | tei:classCode under tei:textClass (for mods:subjec…
bertsky Dec 20, 2021
27127fe
chdir to METS dir if not URL
bertsky Dec 20, 2021
8ac0747
fix mods:location (only once, but multiple contents)
bertsky Dec 20, 2021
20546af
fix regression in 27127febd
bertsky Dec 20, 2021
f33a4ca
drop Python 3.5
bertsky Jan 6, 2022
8204bfc
Revert regression fix in README.md
wrznr Jan 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion mets_mods2tei/api/mets.py
Original file line number Diff line number Diff line change
Expand Up @@ -492,7 +492,7 @@ def get_div_structure(self):
for struct_map in self.mets.get_structMap():
if struct_map.get_TYPE() == "LOGICAL":
return struct_map.get_div()
return []
return None

def get_struct_links(self, log_id):
"""
Expand Down
34 changes: 26 additions & 8 deletions mets_mods2tei/api/tei.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,18 @@ def fill_from_mets(self, mets, ocr=True):
# text part

# div structure
self.add_div_structure(mets.get_div_structure())
div = mets.get_div_structure()
if div is not None:
self.logger.info("Found logical structMap for %s", div.get_TYPE())
self.add_div_structure(div)
elif any(mets.alto_map):
self.logger.warning("Found no logical structMap div, falling back to physical")
pages = mets.alto_map.keys()
if any(mets.order_map.values()):
pages = sorted(pages, key=mets.get_order)
self.add_div_structure(None, map(mets.page_map.get, pages))
else:
self.logger.error("Found no logical or physical structMap div")
wrznr marked this conversation as resolved.
Show resolved Hide resolved

# OCR
if ocr:
Expand Down Expand Up @@ -597,6 +608,9 @@ def __add_ocr_to_node(self, node, mets):
for childnode in node.iterchildren():
self.__add_ocr_to_node(childnode, mets)
struct_links = mets.get_struct_links(node.get("id"))
if not struct_links and node.get("id") in mets.page_map:
# already physical
struct_links = [node.get("id")]

# a header will always be on the first page of a div
first = True
Expand Down Expand Up @@ -678,26 +692,30 @@ def __add_ocr_to_node(self, node, mets):
node.insert(0, par)
first = False

def add_div_structure(self, div):
def add_div_structure(self, div, pages=None):
"""
Add div elements to the text body according to the given list of divs
Add logical div elements to the text font/body/back according to the given div hierarchy
"""

# div structure has to be added to text
text = self.tree.xpath('//tei:text', namespaces=ns)[0]
front = etree.SubElement(text, "%sfront" % TEI)
body = etree.SubElement(text, "%sbody" % TEI)
back = etree.SubElement(text, "%sback" % TEI)

if pages:
for page in pages:
self.__add_div(body, page, 1)
return

# decent to the deepest AMD
# descend to the deepest AMD
while div.get_ADMID() is None:
div = div.get_div()[0]
start_div = div.get_div()[0]
while start_div.get_div() and start_div.get_div()[0].get_ADMID() is not None:
div = start_div
start_div = start_div.get_div()[0]

front = etree.SubElement(text, "%sfront" % TEI)
body = etree.SubElement(text, "%sbody" % TEI)
back = etree.SubElement(text, "%sback" % TEI)

entry_point = front

for sub_div in div.get_div():
Expand Down