Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion backend/corpora/gallica/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ def gallica_corpus_settings(settings):
settings.CORPORA = {
"caricature": os.path.join(here, "caricature.py"),
"figaro": os.path.join(here, "figaro.py"),
"journauxresistance": os.path.join(here, "resistance.py"),
"journaux-resistance": os.path.join(here, "resistance.py"),
"journaux-tranchees": os.path.join(here, "tranchees.py"),
}


Expand Down
11 changes: 11 additions & 0 deletions backend/corpora/gallica/description/tranchees.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
This corpus contains journals spread in the trenches of the French military during World War 1. These journals were written and disseminated by soldiers for their comrades.

Many of these journals only published a few issues; others continued after the war, such as *Crapouillot* and *Canard enchaîné*. Some journals were printed, while others were copied using makeshift methods. Printed journals would end up in the legal deposit, but public knowledge of these makeshift, "ephemeral" journals could be incomplete, as it dependended on donations.

The various journals were digitised and made publicly available by [Gallica](https://gallica.bng.fr); these journals are part of Gallica's [Journaux de tranchées collection](https://gallica.bnf.fr/selections/fr/html/journaux-de-tranchees).

Only journals which were findable through the overview of the collection "Journaux de Tranchees" on Gallica were added, and only if they had plain text available.

## Corpus image

First World War Sketchbook Volume 1 - Communication Trench: a view along a trench with a soldier (1915). Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:First_World_War_Sketchbook_Volume_1_-_Communication_Trench_Art.IWMART16707A26a.jpg)
2 changes: 1 addition & 1 deletion backend/corpora/gallica/gallica.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def get_publication_id(identifier: str) -> str:

def join_issue_strings(issue_description: Union[list[str], None]) -> Union[str, None]:
if issue_description:
return "".join(issue_description[:2])
return " ".join(issue_description[:2])


class Gallica(XMLCorpusDefinition):
Expand Down
Binary file added backend/corpora/gallica/images/tranchees.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions backend/corpora/gallica/images/tranchees.jpg.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
First World War Sketchbook Volume 1 - Communication Trench: a view along a trench with a soldier (1915)

This image was created and released by the Imperial War Museum on the IWM Non Commercial Licence. Photographs taken, or artworks created, by a member of the forces during their active service duties are covered by Crown Copyright provisions. Faithful reproductions may be reused under that licence, which is considered expired 50 years after their creation.

Source: https://commons.wikimedia.org/wiki/File:First_World_War_Sketchbook_Volume_1_-_Communication_Trench_Art.IWMART16707A26a.jpg
1 change: 0 additions & 1 deletion backend/corpora/gallica/resistance.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
from datetime import datetime
from itertools import chain

from django.conf import settings

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<issues compile_time="0:00:01.435" date="1917" listType="issue" parentArk="cb32735055z/date">
<issue ark="bpt6k57549079" dayOfYear="46">15 février 1917</issue>
</issues>
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
<results ResultsGenerationSearchTime="0:00:00.008" countResults="1" resultType="CVOAIRecordSearchService" searchTime="">
<visibility_rights>all</visibility_rights>
<notice>
<record>
<header>
<identifier>oai:bnf.fr:gallica/ark:/12148/bpt6k57549079</identifier>
<datestamp>2023-09-22</datestamp>
<setSpec>gallica:corpus:Aquit1</setSpec>
<setSpec>gallica:theme:9:90</setSpec>
<setSpec>gallica:typedoc:periodiques:fascicules</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:identifier>https://gallica.bnf.fr/ark:/12148/bpt6k57549079</dc:identifier>
<dc:date>1917-02-15</dc:date>
<dc:description>15 février 1917</dc:description>
<dc:description>1917/02/15 (A1,N1).</dc:description>
<dc:description/>
<dc:title>Le Cafard muselé : organe des foyers du soldat n° 23, 27, 42 et 43</dc:title>
<dc:publisher>[s.n.] (Bordeaux)</dc:publisher>
<dc:type xml:lang="fre">texte</dc:type>
<dc:type xml:lang="eng">text</dc:type>
<dc:type xml:lang="fre">publication en série imprimée</dc:type>
<dc:type xml:lang="eng">printed serial</dc:type>
<dc:language>fre</dc:language>
<dc:relation>Notice du catalogue : http://catalogue.bnf.fr/ark:/12148/cb32735055z</dc:relation>
<dc:source>Bibliothèque nationale de France</dc:source>
<dc:rights xml:lang="fre">domaine public</dc:rights>
<dc:rights xml:lang="eng">public domain</dc:rights>
<dc:relation>http://gallica.bnf.fr/ark:/12148/cb32735055z/date</dc:relation>
<dc:description>Appartient à l’ensemble documentaire : Aquit1</dc:description>
<dc:format>Nombre total de vues : 709</dc:format>
</oai_dc:dc>
</metadata>
</record>
</notice>
<provenance>bnf.fr</provenance>
<sdewey>90</sdewey>
<dewey>9</dewey>
<source>Bibliothèque nationale de France</source>
<typedoc>fascicule</typedoc>
<nqamoyen>84.59</nqamoyen>
<mode_indexation>text</mode_indexation>
<title>Le Cafard muselé : organe des foyers du soldat n° 23, 27, 42 et 43</title>
<date nbIssue="1">1917-02-15</date>
<first_indexation_date>19/01/2011</first_indexation_date>
<streamable>false</streamable>
<listBibVirt>
<label>gallica</label>
</listBibVirt>
</results>

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<issues compile_time="0:00:00.001" listType="years" parentArk="cb32735055z/date" totalIssues="45" uc3="no">
<year>1917</year>
</issues>
26 changes: 21 additions & 5 deletions backend/corpora/gallica/tests/test_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
],
"date": "1930-01-01",
"id": "bpt6k296099q",
"issue": "01 janvier 19301930/01/01 (Numéro 1).",
"issue": "01 janvier 1930 1930/01/01 (Numéro 1).",
"url": "https://gallica.bnf.fr/ark:/12148/bpt6k296099q",
}
],
Expand All @@ -38,30 +38,46 @@
],
"date": "1830-11-04",
"id": "bpt6k1048832g",
"issue": "04 novembre 18301830/11/04 (T1,N1).",
"issue": "04 novembre 1830 1830/11/04 (T1,N1).",
"publisher": "Aubert (Paris)[s.n.][s.n.] (Paris)",
"url": "https://gallica.bnf.fr/ark:/12148/bpt6k1048832g",
}
],
},
'journauxresistance': {
'journaux-resistance': {
'n_documents': 10,
'documents': [
{
'content': 'CENTRE D f INFORMATION ET DE DOCUMENTATION cid/ïïx Document fi* 2 LM!',
'contributor': [],
'date': '1943-09-01',
'id': 'bpt6k8724474',
'issue': '01 septembre 19431943/09/01 (N2)-1943/09/30.',
'issue': '01 septembre 1943 1943/09/01 (N2)-1943/09/30.',
'title': "Document n°... / Centre d'information et de documentation ; Éditions MUR",
'url': 'https://gallica.bnf.fr/ark:/12148/bpt6k8724474',
}
],
},
'journaux-tranchees': {
'n_documents': 10,
'documents': [
{
'content': 'Première Année. No 1. 15 Février 1917; BABmiiflGB > Salut, mon vieux !',
'contributor': [],
'date': '1917-02-15',
'id': 'bpt6k57549079',
'issue': '15 février 1917 1917/02/15 (A1,N1).',
'title': 'Le Cafard muselé : organe des foyers du soldat n° 23, 27, 42 et 43',
'url': 'https://gallica.bnf.fr/ark:/12148/bpt6k57549079',
}
],
},
}


@pytest.mark.parametrize('corpus_name', ['caricature', 'figaro', 'journauxresistance'])
@pytest.mark.parametrize(
'corpus_name', ['caricature', 'figaro', 'journaux-resistance', 'journaux-tranchees']
)
def test_gallica_import(corpus_name, monkeypatch, gallica_corpus_settings):
mock = MockResponseFactory(corpus_name)
monkeypatch.setattr(requests, "get", mock.mock_response)
Expand Down
120 changes: 120 additions & 0 deletions backend/corpora/gallica/tranchees.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
from datetime import datetime

from django.conf import settings

from corpora.gallica.gallica import Gallica


class JournauxTranchees(Gallica):
title = "Journaux de tranchées"
description = "Publications distributed in the trenches of World War 1"
min_date = datetime(year=1914, month=1, day=1)
max_date = datetime(year=1924, month=12, day=31)
publication_ids = [
"cb32735055z", # Cafard muselé
"cb32736738j", # Le Camouflet
"cb327369759", # Le Canard du biffin
"cb32736976n", # Le Canard du boyau
"cb32737002v", # Le Canard poilu
"cb327404552", # Le Chat pelottant
"cb34384155m", # Le Crapouillot
"cb32752343g", # Le Cri de ralliement des Gromort
"cb44415603t", # La Colonne double
"cb32754250z", # De la lumière
"cb327615837", # L'Écho des guitounes
"cb327619882", # L'Écho du boyau
"cb32715529c", # La Bourguignotte
"cb32715529c", # Le Filon (Blois)
"cb32775728t", # Flambeau
"cb32775703g", # Le Flambeau
"cb32775703g", # La Fourragère
"cb32778911c", # Le Front
"cb32779223p", # La Fusée à retards
"cb32779220n", # La Fusée: journal anti-boche
"cb327803261", # Gazette de l'Académie Julian
"cb327803470", # Gazette de l'Atelier André
"cb32780348b", # Gazette de l'Aterlier Bernier
"cb32780349p", # Gazette de l'Atelier Defrasse
"cb327803579", # Gazette de l'École régionale
"cb32780350w", # Gazette de l'Atelier Héraud
"cb327813028", # Gazette de l'Atelier Laloux
"cb327803517", # Gazette de l'Atelier Lambert
"cb32780358n", # Gazette de l'Écoke des Beaux-Arts
"cb32780494s", # Gazette de Lemar
"cb32780668s", # Gazette Deglane
"cb32780698q" # Gazette des arts déco
"cb43639008g" # Gazette des classes de composition
"cb32780766c" # Gazette des Cormon, Collin, Flameng
"cb42750804k", # La Gazette des JPL
"cb32781243k", # Gazette Godefroy-Freynet
"cb32781466s", # Gazette Pauline
"cb32781583n", # Gazette Woillez de la Bouglise
"cb32783911c", # La Greffe générale
"cb444156285", # La Gazette de nos Poilus
"cb32787828s", # Hurle obus
"cb327882825", # Les idées noires
"cb32803692w", # Le Klaxon
"cb32804817x", # La Lacrymogène
"cb32809265b", # Le Looping
"cb32811709h", # Marmita
"cb38688428k", # La Marmite
"cb42429408w", # Le Marsouin du 53e
"cb32817015z", # La Mitraille (Nancy)
"cb32821051k", # La Musette (Toulouse)
"cb32823952w", # Le Nonante
"cb32824500s", # Nos filleuls
"cb32808132v", # Le Lion d'Arras
"cb32796106h", # Les Jeunes patriotes
"cb32824610c", # Nos tanks
"cb444157409", # Jaussely's gazette
"cb444157064", # Journal des soldats du "Choral moderne"
"cb328350287", # Le Pépère
"cb32835057t", # Le Perco
"cb328352410", # Le Périscope
"cb328406026", # Le Poilu sans poil
"cb32836282t", # Le Petit écho du 21e
"cb328362835", # Le "Petit écho" en campagne
"cb32840414n", # Le Plus-que-Torial
"cb32840486h", # Le Poilu du 37
"cb32840582d", # Le Poilu marmité
"cb32840601v", # Le Poilu st-émilionnais
"cb44415720p", # Le Poilu du Petit Parisie
"cb38688438w", # Le Panseur
"cb328405216", # Le Poilu (Châlons-sur-Marne)
"cb32840513k", # Le Poil civil
"cb328465460", # Le Quatrième vitrier
"cb32848663b", # Le Rayon
"cb32849865t", # Le Redon
"cb328638993", # Le Sac à terre
"cb328709646", # Le Son du cor
"cb328713356", # Le Souvenir
"cb328637943", # L'S. P... rance : journal gai
"cb328764559", # Télé-mail
"cb32876691q", # Le Temps buté
"cb32879433q", # Le Trait d'union
"cb444157618", # La Trompette des marécage
]
category = "periodical"
es_index = getattr(settings, 'TRANCHEES_INDEX', 'tranchees')
image = "tranchees.jpg"
description_page = "tranchees.md"

def sources(self, start, end):
for pub_id in self.publication_ids:
self.corpus_id = pub_id
docs = super().sources(start, end)
if not docs:
continue
for doc in docs:
yield doc

def __init__(self):
self.fields = [
self.content(),
self.contributor(),
self.date(self.min_date, self.max_date),
self.identifier(),
self.issue(),
self.periodical_title(),
self.url(),
]
4 changes: 2 additions & 2 deletions backend/corpora/parliament/utils/parlamint_v4.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
from string import punctuation
from typing import Iterable

from ianalyzer_readers.extract import XML, Combined, Metadata
from ianalyzer_readers.xml_tag import Tag
from bs4.element import NavigableString, Tag as Node
from bs4 import BeautifulSoup
from ianalyzer_readers.extract import Combined, Metadata, XML
from ianalyzer_readers.xml_tag import Tag

from addcorpus.es_mappings import non_indexed_text_mapping, keyword_mapping
from addcorpus.python_corpora.corpus import FieldDefinition
Expand Down