Skip to content

Get the language right

Giacomo Marchioro edited this page Jul 21, 2021 · 11 revisions

pyIIIFpres checks the language subtags of labels, summaries and other text content against the language subtag registry.

NOTE: only language subtags are checked not variants.

In this registry, there are more than 190 two-letters subtags and 8022 three-letters subtags, hence you have 28% chance that inserting a random two-letter string will result in a valid subtag and 45% chance that inserting a random three-letters string will result in a valid three-letter tag.

You might want to limit the check to a subset of languages you know are in your document to avoid these errors. This can be achieved by reassigning the LANGUAGES global variable:

from IIIFpres import iiifpapi3,BCP47lang 
iiifpapi3.LANGUAGES = [BCP47lang.english,BCP47lang.spanish]
# all the rest of your script

Using a language detector

Another possible approach could be to use a language detector. There are many different alternatives to accomplish this task, this StackOverflow answer gives a good overview of the panorama.

This example shows a basic implementation using langdetect.

The output of using the language detector on the manifest iiifpapi3 object of 0065-opera-multiple-canvases recipe is the following:

In [3]: check_languages(manifest)                                               
❌  L'Elisir D'Amore seems not to be: it
✅  The Elixir of Love is : en
✅  Date Issued is : en
⚠️  Could not detect language for 2019 but is set to: en
✅  Publisher is : en
✅  Indiana University Jacobs School of Music is : en
❌  Atto Primo seems not to be: en
❌  Atto Secondo seems not to be: en
✅  Gaetano Donizetti, L'Elisir D'Amore is : it
❌  Atto Primo seems not to be: en
✅  Preludio e Coro d'introduzione – Bel conforto al mietitore is : it
✅  Remainder of Atto Primo is : en
❌  Atto Secondo seems not to be: en