Add Le Figaro corpus definition #1692

BeritJanssen · 2024-11-06T15:18:03Z

Related to #1089, this branch adds a general gallica corpus definition, as well as a subcorpus, Le Figaro.

Right now, still has custom requirements, will be adjusted once the corresponding branch in ianalyzer-readers is merged and released.

NB: failing test is using the existing Docker image, which doesn't have the ianalyzer-readers update. So that's as expected.

lukavdplas · 2024-11-06T15:37:48Z

backend/corpora/gallica/images/figaro.png

Won't this crop to "ga" in the interface?

By the way, this image might be a good alternative? https://commons.wikimedia.org/wiki/File:Mary_Cassatt_Reading_Le_Figaro.jpg

Meesch

Two minor things in the comments, looks good! I am excited to see this implementation of a generalized corpus definition for a set of corpora instead of a folder with utils.

Meesch · 2024-11-13T13:55:25Z

backend/corpora/gallica/gallica.py

+
+    languages = ["fr"]
+    data_url = "https://gallica.bnf.fr"
+    corpus_ark = ""


Is corpus_ark the subdirectory of a specific corpus? It might be useful to include some documentation on what this variable is and what function it serves within Gallica, so that other developers know exactly what to put here.

Meesch · 2024-11-13T14:00:07Z

backend/corpora/gallica/gallica.py

+            display_name="Publication ID",
+            description="Identifier of the publication on Gallica",
+            es_mapping=keyword_mapping(),
+            extractor=XML(Tag("dc:identifier"), transform=lambda x: x.split("/")[-1]),


I think this might result in an IndexError if it cannot split the contents of the dc:identifier element.

JeltevanBoheemen · 2024-11-14T09:22:29Z

backend/corpora/gallica/gallica.py

+            if int(year.string) >= start.year and int(year.string) <= end.year
+        ]
+        for year in years:
+            response = requests.get(


This is half question half comment:

What happens in case of server being unresponsive, internet connection failing for a split second, stuff like that?
In the ideal case harvesting and indexing would be split operations so one failing does not lead to starting over completely. Unsure how we approached this in previous API-exposed corpora, hence the question.

Discussed this a bit with Luka, and this is how we usually approach online corpora. So ignore the comment!

lukavdplas · 2024-11-14T14:31:45Z

backend/addcorpus/validation/indexing.py

-        raise CorpusNotIndexableError(
-            'Configured data directory does not exist.'
-        )
+    if corpus.data_dircetory and not os.path.isdir(config.data_directory):


@BeritJanssen note the typo here

lukavdplas · 2024-11-14T14:46:21Z

backend/addcorpus/validation/indexing.py

+    if corpus.data_dircetory and not os.path.isdir(config.data_directory):
+        raise CorpusNotIndexableError('Configured data directory does not exist.')
+
+    if corpus.data_url:


This function body opens with if corpus.has_python_definition: return True, so this block only runs for corpora without a Python definition - which don't support API sources.

…ure/gallica" This reverts commit a9cec96, reversing changes made to bb5d3f1.

lukavdplas · 2024-11-15T12:02:32Z

@BeritJanssen I've reverted the merge commit because it's causing errors that are interfering with other development. Can you fix them and then re-open the PR?

These errors are caused by issues in backend/addcorpus/validation/indexing.py. See the test output.

The test output reports a typo, but the same tests will also fail if you correct it. The lines that follow it seem to expect that the corpus argument is a CorpusDefinition (i.e. the custom Python class defining the corpus), rather than a Corpus (i.e. the database object).

However, these lines are run after verifying that the corpus does not have a CorpusDefinition class, which also suggests a deeper issue with what the code was meant to do. (This makes the fix less straightforward; hence why I reverted the merge instead of pushing a quick fix.)

The reason why this validation is skipped for corpora with a Python definition, is that it functions as a pre-check for reading source files. But since Python corpora implement a custom script for that, it's hard to set universal expectations.

In my view, the validation logic that is added here describes a decent default, but will cause issues if it has no options for customisation. It includes some assumptions that may not be universal, like data_url using HTTP protocol.

If we do want to enforce these conditions, they should at least be documented clearly. The current documentation describes how to implement sources() and data_directory but does not suggest the kind of restrictions that are implemented here.

…ies/feature/gallica"" This reverts commit bc3194a.

BeritJanssen added 5 commits October 4, 2024 20:39

add Figaro

a105a0d

add test data and test

05744d0

update requirements.txt

0a23863

fix issue retrieval, add contributor field

24defa2

add unit test

c29058e

BeritJanssen mentioned this pull request Nov 6, 2024

Allow source as bytes when first part of tuple CentreForDigitalHumanities/ianalyzer-readers#26

Merged

Merge branch 'develop' into feature/gallica

55cb1be

BeritJanssen requested a review from Meesch November 6, 2024 15:33

lukavdplas reviewed Nov 6, 2024

View reviewed changes

update figaro image

12fa435

Meesch approved these changes Nov 13, 2024

View reviewed changes

JeltevanBoheemen reviewed Nov 14, 2024

View reviewed changes

BeritJanssen added 7 commits November 14, 2024 11:23

rename and document corpus identifier; catch ConnectionErrors

097dc3c

catch failure to extraction publication id

d0da697

adjust requirements.txt

2ae9b06

bugfix: rename corpus_ark -> corpus_id

3e7bf73

fix: allow data_url in lieu of data_directory

827d133

fix: let API requests sleep to avoid 429 response

5419166

Merge branch 'develop' into feature/gallica

a87817a

BeritJanssen merged commit a9cec96 into develop Nov 14, 2024
2 of 3 checks passed

BeritJanssen deleted the feature/gallica branch November 14, 2024 14:24

lukavdplas reviewed Nov 14, 2024

View reviewed changes

lukavdplas added a commit that referenced this pull request Nov 15, 2024

Revert "Merge pull request #1692 from CentreForDigitalHumanities/feat…

bc3194a

…ure/gallica" This reverts commit a9cec96, reversing changes made to bb5d3f1.

lukavdplas restored the feature/gallica branch November 15, 2024 12:00

BeritJanssen added a commit that referenced this pull request Nov 25, 2024

Revert "Revert "Merge pull request #1692 from CentreForDigitalHumanit…

13a4bf7

…ies/feature/gallica"" This reverts commit bc3194a.

lukavdplas mentioned this pull request Nov 26, 2024

Fixed Gallica / Figaro corpus definition #1715

Closed

BeritJanssen added a commit that referenced this pull request Dec 4, 2024

Revert "Revert "Merge pull request #1692 from CentreForDigitalHumanit…

f568a00

…ies/feature/gallica"" This reverts commit bc3194a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Le Figaro corpus definition #1692

Add Le Figaro corpus definition #1692

Uh oh!

BeritJanssen commented Nov 6, 2024 •

edited

Loading

Uh oh!

lukavdplas Nov 6, 2024

Uh oh!

lukavdplas Nov 7, 2024

Uh oh!

Meesch left a comment

Uh oh!

Meesch Nov 13, 2024

Uh oh!

Meesch Nov 13, 2024

Uh oh!

JeltevanBoheemen Nov 14, 2024 •

edited

Loading

Uh oh!

JeltevanBoheemen Nov 14, 2024

Uh oh!

Uh oh!

lukavdplas Nov 14, 2024 •

edited

Loading

Uh oh!

lukavdplas Nov 14, 2024

Uh oh!

lukavdplas commented Nov 15, 2024

Uh oh!

Uh oh!

Add Le Figaro corpus definition #1692

Add Le Figaro corpus definition #1692

Uh oh!

Conversation

BeritJanssen commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukavdplas Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

lukavdplas Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

Meesch left a comment

Choose a reason for hiding this comment

Uh oh!

Meesch Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

Meesch Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

JeltevanBoheemen Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeltevanBoheemen Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukavdplas Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukavdplas Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

lukavdplas commented Nov 15, 2024

Uh oh!

Uh oh!

BeritJanssen commented Nov 6, 2024 •

edited

Loading

JeltevanBoheemen Nov 14, 2024 •

edited

Loading

lukavdplas Nov 14, 2024 •

edited

Loading