Skip to content

Get the language right

Giacomo Marchioro edited this page Nov 22, 2021 · 11 revisions

pyIIIFpres checks the language subtags of labels, summaries and other text content against the language subtag registry.

NOTE: only language subtags are checked not variants or composite strings.

In this registry, there are more than 190 two-letters subtags and 8022 three-letters subtags, hence you have 28% chance that inserting a random two-letter string will result in a valid subtag and 45% chance that inserting a random three-letters string will result in a valid three-letter tag.

You might want to limit the check to a subset of languages you know are in your document to avoid these errors. This can be achieved by reassigning the LANGUAGES global variable:

from IIIFpres import iiifpapi3,BCP47lang 
iiifpapi3.LANGUAGES = [BCP47lang.english,BCP47lang.spanish]
# all the rest of your script

Add subtags with variants and composite strings

(Or how to solve AssertionError: Language must be a valid BCP47 language tag or none)

pyIIIFpres allows only language subtags. If you think that a single sub-tag is not enough for describing the language of the document you can add your custom language string in this way:

from IIIFpres import iiifpapi3,BCP47lang 
iiifpapi3.LANGUAGES.append("de-DE-u-co-phonebk")
# all the rest of your script

But keep in mind the golden W3C golden rule:

Always bear in mind that the golden rule is to keep your language tag as short as possible. Only add further subtags to your language tag if they are needed to distinguish the language from something else in the context where your content is used.

(before inserting you could check it using, for instance, https://schneegans.de/lv/)

Language map object

Remember that add_metadata and set_requiredStatement if left empty return a lanaguagemap object that can help building multilanguage support.

reqst = manifest.set_requiredStatement()
reqst.add_label('Provided by','en') 
reqst.add_value('Univeristy of Verona','en') 
reqst.add_label('Contenuto fornito da','it')
reqst.add_value('Università di Verona','it')

Using a language detector

Another possible approach could be to use a language detector. There are many different alternatives to accomplish this task, this StackOverflow answer gives a good overview of the panorama.

This example shows a basic implementation using langdetect.

The output of using the language detector on the manifest iiifpapi3 object of 0065-opera-multiple-canvases recipe is the following:

In [3]: check_languages(manifest)                                               
❌  L'Elisir D'Amore seems not to be: it
✅  The Elixir of Love is : en
✅  Date Issued is : en
⚠️  Could not detect language for 2019 but is set to: en
✅  Publisher is : en
✅  Indiana University Jacobs School of Music is : en
❌  Atto Primo seems not to be: en
❌  Atto Secondo seems not to be: en
✅  Gaetano Donizetti, L'Elisir D'Amore is : it
❌  Atto Primo seems not to be: en
✅  Preludio e Coro d'introduzione – Bel conforto al mietitore is : it
✅  Remainder of Atto Primo is : en
❌  Atto Secondo seems not to be: en