Skip to content

Add Le Figaro corpus definition #1692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Nov 14, 2024
Merged

Add Le Figaro corpus definition #1692

merged 14 commits into from
Nov 14, 2024

Conversation

BeritJanssen
Copy link
Contributor

@BeritJanssen BeritJanssen commented Nov 6, 2024

Related to #1089, this branch adds a general gallica corpus definition, as well as a subcorpus, Le Figaro.

Right now, still has custom requirements, will be adjusted once the corresponding branch in ianalyzer-readers is merged and released.

NB: failing test is using the existing Docker image, which doesn't have the ianalyzer-readers update. So that's as expected.

@BeritJanssen BeritJanssen requested a review from Meesch November 6, 2024 15:33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this crop to "ga" in the interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, this image might be a good alternative? https://commons.wikimedia.org/wiki/File:Mary_Cassatt_Reading_Le_Figaro.jpg

Copy link
Contributor

@Meesch Meesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor things in the comments, looks good! I am excited to see this implementation of a generalized corpus definition for a set of corpora instead of a folder with utils.


languages = ["fr"]
data_url = "https://gallica.bnf.fr"
corpus_ark = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is corpus_ark the subdirectory of a specific corpus? It might be useful to include some documentation on what this variable is and what function it serves within Gallica, so that other developers know exactly what to put here.

display_name="Publication ID",
description="Identifier of the publication on Gallica",
es_mapping=keyword_mapping(),
extractor=XML(Tag("dc:identifier"), transform=lambda x: x.split("/")[-1]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might result in an IndexError if it cannot split the contents of the dc:identifier element.

if int(year.string) >= start.year and int(year.string) <= end.year
]
for year in years:
response = requests.get(
Copy link
Contributor

@JeltevanBoheemen JeltevanBoheemen Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is half question half comment:

What happens in case of server being unresponsive, internet connection failing for a split second, stuff like that?
In the ideal case harvesting and indexing would be split operations so one failing does not lead to starting over completely. Unsure how we approached this in previous API-exposed corpora, hence the question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this a bit with Luka, and this is how we usually approach online corpora. So ignore the comment!

@BeritJanssen BeritJanssen merged commit a9cec96 into develop Nov 14, 2024
2 of 3 checks passed
@BeritJanssen BeritJanssen deleted the feature/gallica branch November 14, 2024 14:24
raise CorpusNotIndexableError(
'Configured data directory does not exist.'
)
if corpus.data_dircetory and not os.path.isdir(config.data_directory):
Copy link
Contributor

@lukavdplas lukavdplas Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BeritJanssen note the typo here

if corpus.data_dircetory and not os.path.isdir(config.data_directory):
raise CorpusNotIndexableError('Configured data directory does not exist.')

if corpus.data_url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function body opens with if corpus.has_python_definition: return True, so this block only runs for corpora without a Python definition - which don't support API sources.

lukavdplas added a commit that referenced this pull request Nov 15, 2024
…ure/gallica"

This reverts commit a9cec96, reversing
changes made to bb5d3f1.
@lukavdplas lukavdplas restored the feature/gallica branch November 15, 2024 12:00
@lukavdplas
Copy link
Contributor

@BeritJanssen I've reverted the merge commit because it's causing errors that are interfering with other development. Can you fix them and then re-open the PR?

These errors are caused by issues in backend/addcorpus/validation/indexing.py. See the test output.

The test output reports a typo, but the same tests will also fail if you correct it. The lines that follow it seem to expect that the corpus argument is a CorpusDefinition (i.e. the custom Python class defining the corpus), rather than a Corpus (i.e. the database object).

However, these lines are run after verifying that the corpus does not have a CorpusDefinition class, which also suggests a deeper issue with what the code was meant to do. (This makes the fix less straightforward; hence why I reverted the merge instead of pushing a quick fix.)

The reason why this validation is skipped for corpora with a Python definition, is that it functions as a pre-check for reading source files. But since Python corpora implement a custom script for that, it's hard to set universal expectations.

In my view, the validation logic that is added here describes a decent default, but will cause issues if it has no options for customisation. It includes some assumptions that may not be universal, like data_url using HTTP protocol.

If we do want to enforce these conditions, they should at least be documented clearly. The current documentation describes how to implement sources() and data_directory but does not suggest the kind of restrictions that are implemented here.

BeritJanssen added a commit that referenced this pull request Nov 25, 2024
BeritJanssen added a commit that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants