-
Notifications
You must be signed in to change notification settings - Fork 3
Add Le Figaro corpus definition #1692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this crop to "ga" in the interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, this image might be a good alternative? https://commons.wikimedia.org/wiki/File:Mary_Cassatt_Reading_Le_Figaro.jpg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor things in the comments, looks good! I am excited to see this implementation of a generalized corpus definition for a set of corpora instead of a folder with utils.
backend/corpora/gallica/gallica.py
Outdated
|
||
languages = ["fr"] | ||
data_url = "https://gallica.bnf.fr" | ||
corpus_ark = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is corpus_ark
the subdirectory of a specific corpus? It might be useful to include some documentation on what this variable is and what function it serves within Gallica, so that other developers know exactly what to put here.
backend/corpora/gallica/gallica.py
Outdated
display_name="Publication ID", | ||
description="Identifier of the publication on Gallica", | ||
es_mapping=keyword_mapping(), | ||
extractor=XML(Tag("dc:identifier"), transform=lambda x: x.split("/")[-1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might result in an IndexError if it cannot split the contents of the dc:identifier
element.
backend/corpora/gallica/gallica.py
Outdated
if int(year.string) >= start.year and int(year.string) <= end.year | ||
] | ||
for year in years: | ||
response = requests.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is half question half comment:
What happens in case of server being unresponsive, internet connection failing for a split second, stuff like that?
In the ideal case harvesting and indexing would be split operations so one failing does not lead to starting over completely. Unsure how we approached this in previous API-exposed corpora, hence the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this a bit with Luka, and this is how we usually approach online corpora. So ignore the comment!
raise CorpusNotIndexableError( | ||
'Configured data directory does not exist.' | ||
) | ||
if corpus.data_dircetory and not os.path.isdir(config.data_directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BeritJanssen note the typo here
if corpus.data_dircetory and not os.path.isdir(config.data_directory): | ||
raise CorpusNotIndexableError('Configured data directory does not exist.') | ||
|
||
if corpus.data_url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function body opens with if corpus.has_python_definition: return True
, so this block only runs for corpora without a Python definition - which don't support API sources.
@BeritJanssen I've reverted the merge commit because it's causing errors that are interfering with other development. Can you fix them and then re-open the PR? These errors are caused by issues in The test output reports a typo, but the same tests will also fail if you correct it. The lines that follow it seem to expect that the However, these lines are run after verifying that the corpus does not have a The reason why this validation is skipped for corpora with a Python definition, is that it functions as a pre-check for reading source files. But since Python corpora implement a custom script for that, it's hard to set universal expectations. In my view, the validation logic that is added here describes a decent default, but will cause issues if it has no options for customisation. It includes some assumptions that may not be universal, like If we do want to enforce these conditions, they should at least be documented clearly. The current documentation describes how to implement |
…ies/feature/gallica"" This reverts commit bc3194a.
…ies/feature/gallica"" This reverts commit bc3194a.
Related to #1089, this branch adds a general
gallica
corpus definition, as well as a subcorpus, Le Figaro.Right now, still has custom requirements, will be adjusted once the corresponding branch in ianalyzer-readers is merged and released.
NB: failing test is using the existing Docker image, which doesn't have the ianalyzer-readers update. So that's as expected.