Deep-Agora

Supervised by Prof. Dr. Jean-Yves RAMEL and conducted by Théo BOISSEAU at Ecole Polytechnique de l'Université de Tours.

Introduction

Overhaul of Agora from the PaRADIIT Project: Analyzing Pattern Redundancy in texts of document images using Incremental Segmentation.

PaRADIIT is a project initiated and sponsored by 2 successive Google DH awards. It aims to turn ancient books, especially from the Renaissance, into accessible digital libraries.

The collaboration with the CESR resulted in the Agora software which simultaneously performs page layout analysis, text/graphics separation and pattern extraction.

The objective of this project is to start an overhaul of the Agora software with a new approach oriented towards deep learning.

Project structure

deep-agora
├── deep_learning/                  # working directory for development of the data science project
│   ├── deep_learning_lab/          # package for deep learning lab
│   │   ├── data_preparation/       # subpackage of deep learning lab for data preparation
│   │   │   ├── __init__.py         # file to indicate this directory can be used as a package
│   │   │   ├── orchestration.py    # module for coordinating the data preparation process
│   │   │   ├── patch.py            # module for applying patches to images
│   │   │   └── xml_parser.py       # module for parsing xml files
│   │   ├── __init__.py             # file to indicate this directory can be used as a package
│   │   ├── gpu_setup.py            # module for setting up GPU
│   │   ├── logging.py              # module for logging information
│   │   └── model.py                # module for defining deep learning model
│   ├── raw_datasets/               # location to download raw data sets
│   ├── tests/                      # tests of deep_learning_lab designed for Pytest
│   │   ├── integration/            # integration tests
│   │   └── unit/                   # unit tests
│   ├── download_data.sh            # example of script to download data (incomplete)
│   └── segmentation.ipynb          # Jupyter notebook for semantic segmentation
├── ...                             # future working directories (e.g. software development)
├── dependencies/                   # project dependencies
│   ├── dhSegment-torch/            # sub-module and framework for semantic segmentation
│   ├── environment.yml             # conda environment file adapted to sm_86 CUDA architecture
│   └── setup.py                  # setup file adapted to sm_86 CUDA architecture
├── .gitignore                      # specifies files to ignore when committing to git
├── .gitmodules                     # specifies submodules in dependencies/
└── README.md                       # readme file for the project

deep_learning/ is a working directory of data science. It is designed to develop deep-learning models that will be used in the future working directory deep_agora/ of software development. It includes the deep_learning_lab package that allows a data scientist to prepare data, train deep neural networks and use them for inference on images. image_segmentation.ipynb can be used as an example or an application to use the package.
dhSegment-torch/ is an external Deep-Learning framework cloned from the GitHub repository dhSegment-torch. Its environment files have been edited to adapt to sm_86 CUDA architecture.

dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different type of documents.

Deep Learning working directory

Requirements

You need to use a Linux or WSL machine and we highly recommend using a machine with a GPU to work in the deep_learning directory as the processing time can be very long (many hours).

Check if you have a GPU and CUDA installed via the NVIDIA System Management Interface (NVIDIA-SMI) driver by entering in your terminal:

nvidia-smi

You must also have Conda installed in order to perform the following installation.

Installation

The deep_learning_lab package uses the sub-module and framework dhSegment-torch [2].

Go to dependencies/ and clone the sub-module(s) as follows:

cd dependencies/
git submodule update --init --recursive

We edited its environment files (environment.yml and setup.py) for compatibility with the sm_86 CUDA architecture of our machine. You can match your GPU name to your CUDA architecture. To apply such changes, do as follows:

cp environment.yml dhSegment-torch/
cp setup.py dhSegment-torch/

Now, to install package of the sub-module, go to dhSegment-torch/ as follows:

cd dhSegment-torch

And follow its installation guide:

Installation

dhSegment will not work properly if the dependencies are not respected. In particular, inaccurate dependencies may result in an inability to converge, even if no error is displayed. Therefore, we highly recommend to create a dedicated environment as follows:
conda env create --name dhs --file environment.yml
source activate dhs
python setup.py install

Data preparation

Raw datasets can be placed in a raw_datasets/ folder, located in the working directory. They usually contain images and XML files containing their annotations. These annotation files cannot be used directly to train the model because they must be converted to masks.

For this reason, the deep_learning_lab.data_preparation package allows the developer to select and use labels from their raw datasets to build masks. A dataset to be patched must be specified by its main directory, its image directory and its annotation directory.

For the moment, some default datasets are implemented inside the source code of the deep_learning_lab.data_preparation.orchestration module. To be used, these default datasets must already have been downloaded. Additional datasets can either be added to the defaults datasets in the source code of the module or via the Orchestrator.ingestDatasets method.

In addition, to easily analyse the contents of a raw dataset, the Orchestrator.ingestLabels method provides a prompt parameter that allows the user to choose their labels and the Orchestrator.validate method prints statistics on each dataset and its contents.

By default, the patched dataset is in the results folder and in the sub-folder named after the "specified labels", under the name training_data. For example: if you patched a dataset with TextLine label, the dataset will be at the location results/TextLine/training_data/.

Note that the deep_learning_lab.data_preparation.patch module running in the backend of the Orchestrator class has not been validated for multi-labels at the moment.

Training

Before training, you must specify if you want to use a CPU, a GPU and which one. To do this, the deep_learning_lab.gpu_setup module allows the selection of a GPU/CPU in backend when instanciating the Trainer class.

The Trainer class from the deepl_learning_lab.model module can then be instanciated with a specified set of labels to segment and the dataset to use. The trainer can be configured with many parameters relating to the split of the validation and test sets or to the training of the model itself. Note that as few labels as possible should be specified at a time so that a model can be developed for each of them. This allows greater modularity for future Deep-Agora software.

By default, the dataset to be used is training_data (results/"specified labels"/training_data). The model and tensorboard directories should be in the same location. model contains the best serialized models and tensorboard contains the logs of the metrics acquired during the training.

Inference

The Predictor class from the deepl_learning_lab.model module can be instanciated with a specified set of labels to segment.

By default, the input data is inference_data and the output directory is predictions. The output directory contains the vignettes extracted from the images. The output of the Predictor.start method returns additional data such as the original image on which the bounding boxes and polygons are drawn.

Note that the inference post-processing has only been validated for one label at the moment.

Data

download_data.sh is just an example of how to download data for the project. It is unlikely that such a script could include all the datasets needed for good model performance, as many datasets cannot be downloaded as easily.

Most of the datasets bellow have been chosen from A survey of historical document image datasets [3]:

Some sources of datasets to patch are:

And here is their content:

Dataset	TextLine	TextRegion	Word	ImageRegion
FCR_500/data	32177	1701	-	-
ABP_FirstTestCollection	961	226	-	-
Bohisto_Bozen_SetP	815	152	-	-
EPFL_VTM_FirstTestCollection	252	38	-	-
HUB_Berlin_Humboldt	693	81	-	-
NAF_FirstTestCollection	930	164	-	-
StAM_Marburg_Grimm_SetP	857	214	-	-
UCL_Bentham_SetP	1024	191	-	-
unibas_e-Manuscripta	848	96	-	-
ABP_FirstTestCollection	4230	30	-	-
Bohisto_Bozen_SetP	910	26	-	-
BHIC_Akten	2339	30	-	-
EPFL_VTM_FirstTestCollection	2790	28	-	-
HUB_Berlin_Humboldt	885	28	-	-
NAF_FirstTestCollection	6147	29	-	-
StAM_Marburg_Grimm_SetP	1064	30	-	-
UCL_Bentham_SetP	2294	31	-	-
unibas_e-Manuscripta	1081	20	-	-
pinkas_dataset	1013	175	13744	-
IEHHR-XMLpages	3070	968	31501	-
ImageCLEF 2016 pages_train_jpg	9645	765	-	-
REID2019	-	454	-	3

Some dataset that are already pixel-labeled (with arbitrary label and color):

HBA (very diverse)
SynDoc (text lines, red)
IllusHisDoc (illustrations, red)
DIVA-HisDB (various labels, red)

Some datasets that are accessible via IIIF (not supported yet):

HORAE

Pixel-labeled datasets have not yet been used because they require manual intervention: the colours and the labels must be the same as those patched.

Tests

For the moment, only integration tests of the classes of the deep_learning_lab package have been implemented. To execute them, simply do as follows:

cd deep_learning/
pytest

References

[1] Details about the specifications of the project.

[2] S. Ares Oliveira, B.Seguin, and F. Kaplan, “dhSegment: A generic deep-learning approach for document segmentation”, in Frontiers in Handwriting Recognition (ICFHR), 2018 16th International Conference on, pp. 7-12, IEEE, 2018.

[3] K. Nikolaidou, M. Seuret, H. Mokayed and M. Liwicki, “A survey of historical document image datasets”, IJDAR 25, 305–338 (2022).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep-Agora

Introduction

Project structure

Deep Learning working directory

Requirements

Installation

Installation

Data preparation

Training

Inference

Data

Tests

References

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
deep_learning		deep_learning
dependencies		dependencies
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

theoboiss/Deep-Agora

Folders and files

Latest commit

History

Repository files navigation

Deep-Agora

Introduction

Project structure

Deep Learning working directory

Requirements

Installation

Installation

Data preparation

Training

Inference

Data

Tests

References

About

Topics

Resources

Stars

Watchers

Forks

Languages