Skip to content

Commit

Permalink
Merge pull request #129 from impresso/BCUL-acquisition
Browse files Browse the repository at this point in the history
Bcul acquisition
  • Loading branch information
piconti authored May 1, 2024
2 parents 213e304 + e31ff5b commit 999b06a
Show file tree
Hide file tree
Showing 68 changed files with 57,265 additions and 79 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
*pyc
text_importer/data/tmp/
text_importer/data/temp/
text_importer/data/out_pauline/
text_importer/data/run_logs/
text_importer/data/sample_data/BCUL/171722
.pytest_cache
*.egg-info
.ipynb_checkpoints/
Expand Down
Binary file modified docs/_build/doctrees/architecture.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/importers.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/importers/bcul.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/install.doctree
Binary file not shown.
15 changes: 9 additions & 6 deletions docs/_build/html/_sources/importers.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,18 @@ The following importer CLI scripts are already available:

- :py:mod:`text_importer.scripts.oliveimporter`: importer for the *Olive XML format*, used by
`RERO <https://www.rero.ch/>`_ to encode and deliver the majority of its newspaper data.
- :py:mod:`text_importer.scripts.reroimporter`: importer for the Mets/ALTO flavor used by `RERO <https://www.rero.ch/>`_
- :py:mod:`text_importer.scripts.reroimporter`: importer for the *Mets/ALTO flavor* used by `RERO <https://www.rero.ch/>`_
to encode and deliver part of its data.
- :py:mod:`text_importer.scripts.luximporter`: importer for the Mets/ALTO flavor used by the `Bibliothèque nationale de Luxembourg (BNL)
- :py:mod:`text_importer.scripts.luximporter`: importer for the *Mets/ALTO flavor* used by the `Bibliothèque nationale de Luxembourg (BNL)
<https://bnl.public.lu/>`_ to encode and deliver its newspaper data.
- :py:mod:`text_importer.scripts.bnfimporter`: importer for the Mets/ALTO flavor used by the `Bibliothèque nationale de France (BNF)
- :py:mod:`text_importer.scripts.bnfimporter`: importer for the *Mets/ALTO flavor* used by the `Bibliothèque nationale de France (BNF)
<https://www.bnf.fr/en/>`_ to encode and deliver its newspaper data.
- :py:mod:`text_importer.scripts.bnfen_importer`: importer for the Mets/ALTO flavor used by the `Bibliothèque nationale de France (BNF)
- :py:mod:`text_importer.scripts.bnfen_importer`: importer for the *Mets/ALTO flavor* used by the `Bibliothèque nationale de France (BNF)
<https://www.bnf.fr/en/>`_ to encode and deliver its newspaper data for the Europeana collection.
- :py:mod:`text_importer.scripts.swaimporter`: ALTO flavor of the `Basel University Library`.
- :py:mod:`text_importer.scripts.blimporter`: importer for the Mets/ALTO flavor used by the `British Library (BL) <https://www.bl.uk/>`_
- :py:mod:`text_importer.scripts.bcul_importer`: importer for the *ABBY format* used by the `Bibliothèque Cantonale Universitaire de Lausanne (BCUL)
<https://www.bcu-lausanne.ch/en/>`_ to encode and deliver the newspaper data which is on the `Scriptorium interface <https://scriptorium.bcu-lausanne.ch/page/home>`.
- :py:mod:`text_importer.scripts.swaimporter`: *ALTO flavor* of the `Basel University Library`.
- :py:mod:`text_importer.scripts.blimporter`: importer for the *Mets/ALTO flavor* used by the `British Library (BL) <https://www.bl.uk/>`_
to encode and deliver its newspaper data.
- :py:mod:`text_importer.scripts.tetml`: generic importer for the *TETML format*, produced by `PDFlib TET <https://www.pdflib.com/products/tet/overview/>`_.
- :py:mod:`text_importer.scripts.fedgaz`: importer for the *TETML format* with separate metadata file and a heuristic article segmentation,
Expand All @@ -40,6 +42,7 @@ For further details on any of these implementations, please do refer to its docu
importers/bl
importers/bnf
importers/bnf-en
importers/bcul
importers/tetml
importers/fedgaz

Expand Down
30 changes: 30 additions & 0 deletions docs/_build/html/_sources/importers/bcul.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
BCUL ABBY importer
=========================

This importer is written to accomodate the ABBY OCR format.
It was developed to handle OCR newspaper data provided by the `Bibliothèque Cantonale Universitaire de Lausanne
(BCUL - Lausanne Cantonal University Library) <https://www.bcu-lausanne.ch/en/>`_, which are part of the `Scriptorium interface <https://scriptorium.bcu-lausanne.ch/page/home>` and collection.

BCUL Custom classes
---------------------

.. automodule:: text_importer.importers.bcul.classes
:members:
:undoc-members:
:show-inheritance:

BCUL Detect functions
-----------------------

.. automodule:: text_importer.importers.bcul.detect
:members:
:undoc-members:
:show-inheritance:

BCUL Helper functions
-----------------------

.. automodule:: text_importer.importers.bcul.helpers
:members:
:undoc-members:
:show-inheritance:
2 changes: 1 addition & 1 deletion docs/_build/html/_sources/install.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ General installation:

.. code-block:: bash
pip install text-importer
pip install impresso-text-importer
24 changes: 20 additions & 4 deletions docs/_build/html/architecture.html
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#module-text_importer.importers.core">Processing</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#text_importer.importers.core.cleanup"><code class="docutils literal notranslate"><span class="pre">cleanup()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#text_importer.importers.core.compress_issues"><code class="docutils literal notranslate"><span class="pre">compress_issues()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#text_importer.importers.core.compress_pages"><code class="docutils literal notranslate"><span class="pre">compress_pages()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#text_importer.importers.core.dir2issue"><code class="docutils literal notranslate"><span class="pre">dir2issue()</span></code></a></li>
Expand Down Expand Up @@ -186,6 +187,21 @@ <h3>Image data<a class="headerlink" href="#image-data" title="Link to this headi
<p>The function <a class="reference internal" href="#text_importer.importers.core.import_issues" title="text_importer.importers.core.import_issues"><code class="xref py py-func docutils literal notranslate"><span class="pre">import_issues()</span></code></a> is the most important in this module
as it keeps everything together, by calling all other functions.</p>
</div>
<dl class="py function">
<dt class="sig sig-object py" id="text_importer.importers.core.cleanup">
<span class="sig-prename descclassname"><span class="pre">text_importer.importers.core.</span></span><span class="sig-name descname"><span class="pre">cleanup</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">upload_success</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">filepath</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">None</span></span></span><a class="headerlink" href="#text_importer.importers.core.cleanup" title="Link to this definition"></a></dt>
<dd><p>Remove a file if it has been successfully uploaded to S3.</p>
<p>Copied and adapted from impresso-pycommons.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>upload_success</strong> (<em>bool</em>) – Whether the upload was successful</p></li>
<li><p><strong>filepath</strong> (<em>str</em>) – Path to the uploaded file</p></li>
</ul>
</dd>
</dl>
</dd></dl>

<dl class="py function">
<dt class="sig sig-object py" id="text_importer.importers.core.compress_issues">
<span class="sig-prename descclassname"><span class="pre">text_importer.importers.core.</span></span><span class="sig-name descname"><span class="pre">compress_issues</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">key</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tuple</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">int</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">issues</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">list</span><span class="p"><span class="pre">[</span></span><a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperIssue" title="text_importer.importers.classes.NewspaperIssue"><span class="pre">NewspaperIssue</span></a><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_dir</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">failed_log</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">Tuple</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">list</span><span class="p"><span class="pre">[</span></span><span class="pre">dict</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">int</span><span class="p"><span class="pre">]</span></span><span class="p"><span class="pre">]</span></span><span class="p"><span class="pre">]</span></span></span></span><a class="headerlink" href="#text_importer.importers.core.compress_issues" title="Link to this definition"></a></dt>
Expand Down Expand Up @@ -435,7 +451,7 @@ <h3>Image data<a class="headerlink" href="#image-data" title="Link to this headi
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>sort_key</strong> (<em>str</em>) – the key used to group articles (e.g. “GDL-1900”).</p></li>
<li><p><strong>sort_key</strong> (<em>str</em>) – the key used to group articles (e.g. “GDL-1900-01-01-a”).</p></li>
<li><p><strong>filepath</strong> (<em>str</em>) – Path of the file to upload to S3.</p></li>
<li><p><strong>bucket_name</strong> (<em>str</em><em> | </em><em>None</em><em>, </em><em>optional</em>) – Name of S3 bucket where to upload
the file. Defaults to None.</p></li>
Expand All @@ -457,13 +473,13 @@ <h3>Image data<a class="headerlink" href="#image-data" title="Link to this headi

<dl class="py function">
<dt class="sig sig-object py" id="text_importer.importers.core.write_error">
<span class="sig-prename descclassname"><span class="pre">text_importer.importers.core.</span></span><span class="sig-name descname"><span class="pre">write_error</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">thing</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperIssue" title="text_importer.importers.classes.NewspaperIssue"><span class="pre">NewspaperIssue</span></a><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperPage" title="text_importer.importers.classes.NewspaperPage"><span class="pre">NewspaperPage</span></a><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">IssueDir</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">error</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Exception</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">failed_log</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">None</span></span></span><a class="headerlink" href="#text_importer.importers.core.write_error" title="Link to this definition"></a></dt>
<span class="sig-prename descclassname"><span class="pre">text_importer.importers.core.</span></span><span class="sig-name descname"><span class="pre">write_error</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">thing</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperIssue" title="text_importer.importers.classes.NewspaperIssue"><span class="pre">NewspaperIssue</span></a><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperPage" title="text_importer.importers.classes.NewspaperPage"><span class="pre">NewspaperPage</span></a><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">IssueDir</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">str</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">error</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Exception</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">str</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">failed_log</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">None</span></span></span><a class="headerlink" href="#text_importer.importers.core.write_error" title="Link to this definition"></a></dt>
<dd><p>Write the given error of a failed import to the <cite>failed_log</cite> file.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>thing</strong> (<a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperIssue" title="text_importer.importers.classes.NewspaperIssue"><em>NewspaperIssue</em></a><em> | </em><a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperPage" title="text_importer.importers.classes.NewspaperPage"><em>NewspaperPage</em></a><em> | </em><em>IssueDir</em>) – Object for which
the error occurred.</p></li>
<li><p><strong>thing</strong> (<a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperIssue" title="text_importer.importers.classes.NewspaperIssue"><em>NewspaperIssue</em></a><em> | </em><a class="reference internal" href="custom_importer.html#text_importer.importers.classes.NewspaperPage" title="text_importer.importers.classes.NewspaperPage"><em>NewspaperPage</em></a><em> | </em><em>IssueDir</em><em> | </em><em>str</em>) – Object for which
the error occurred, or corresponding canonical ID.</p></li>
<li><p><strong>error</strong> (<em>Exception</em>) – Error that occurred and should be logged.</p></li>
<li><p><strong>failed_log</strong> (<em>str</em>) – Path to log file for failed imports.</p></li>
</ul>
Expand Down
Loading

0 comments on commit 999b06a

Please sign in to comment.