Writer Identification and Writer Retrieval using Vision Transformer

Source code of

Koepf, M., Kleber, F., Sablatnig, R. (2022). Writer Identification and Writer Retrieval Using Vision Transformer for Forensic Documents. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_24

We present the first offline writer identification and writer retrieval method based on vision transformer (ViT), and show that a ViT with 3.7M parameter trained from scratch competes with state-of-the-art methods.

Important

Please kindly note that this repository contains source code of a research paper. Publishing the code should allow anyone to easily verify and reproduce our results. We acknowledge that the code is far from perfect and certainly can be improved from a software engineering perspective.

Project timeline:

The experiments of this work were originally done in Spring and Summer 2021.
The source code was made publicly available after result publication in 2022.

Methodology

Results on Publicly Available Datasets

Note: The below listed retrieval results use the standardized Euclidean distance, which is a decent default choice. We also analyzed if other distance metrics give better results. For further details, please consult the published paper.

`CVL w/ enrollment` (Classification)

Top k	Accuracy
1	99.0
2	99.3
3	99.6
5	99.9
10	99.9

`CVL w/o enrollment` (Retrieval)

Measure	Accuracy
Top 1	97.4
Soft top k
2	97.9
3	98.2
5	98.4
10	98.7
Hard top k
2	95.0
mAP	92.8

`ICDAR 2013` (Retrieval)

Measure	Accuracy
Top 1	96.7
Soft top k
2	98.4
3	98.6
5	98.8
10	99.2
Hard top k
2	76.7
3	54.7
mAP	85.2

Repository Structure

├── data             # default directory for datasets
├── dataset_splits   # dataset metadata
├── img              # images used in README.md
├── requirements     # dependencies
├── runs             # default directory for TensorBoard
├── saved_models     # default directory for model saving during training
├── src              # source code for preprocessing, training and evaluation
├── Dockerfile       # run training and evaluation in a containerized environment
├── eval.py          # CLI tool for model evaluation
├── README.md        # this file
└── train.py         # CLI tool for model training

Reproducing Results

Setup

We used the following setup:

Python 3.8
CUDA 11.1

In the following, we give instructions how to get up and running with a minimal (1) local setup (directly on a host machine) (2) Docker setup. We recommend to use Docker.

Local Setup (Directly on a Host Machine)

Create a new virtual environment and activate it

python3 -m venv ./venv && source venv/bin/activate

Choose an appropriate constraints file (-c option) for installing PyTorch:
- CUDA 11.1:
```
pip3 install -r requirements/requirements.txt -c requirements/constraints-cuda111.txt
```
- CPU:
```
pip3 install -r requirements/requirements.txt -c requirements/constraints-cpu.txt
```
In case the installation fails, use the --no-cache-dir option. For further information, please consult the official PyTorch documentation.

`Docker` Image

We also provide a Dockerfile based on the python3.8:slim-buster image.

CPU

Build the image

docker build --build-arg HOST_GID=$(id -g) --build-arg HOST_UID=$(id -u) --build-arg CONSTRAINTS_FILE=constraints-cpu.txt -t wi-wr-vit .

Run a container in interactive mode (all necessary directories are mounted and the container is deleted on shutdown)

docker run --rm -v $(pwd)/data:/app/data -v $(pwd)/runs:/app/runs -v $(pwd)/saved_models:/app/saved_models -it wi-wr-vit

GPU

Build the image

docker build --build-arg HOST_GID=$(id -g) --build-arg HOST_UID=$(id -u) --build-arg CONSTRAINTS_FILE=constraints-cuda111.txt -t wi-wr-vit .

Run a container in interactive mode with GPU support (all necessary directories are mounted and the container is deleted on shutdown)

docker run --rm -v $(pwd)/data:/app/data -v $(pwd)/runs:/app/runs -v $(pwd)/saved_models:/app/saved_models --gpus all -it wi-wr-vit

Training and Evaluation

Warning

When using an IDE (e.g. PyCharm, IntelliJ) or desktop environment with indexing functionality, make sure to exclude data from indexing before preprocessing any files. For GNOME, the folder is automatically excluded (a .trackerignore file is already placed in data). This also applies when you are using a custom directories.

For training use train.py (see./train.py -h for further information).

Example:

python3 train.py --model vit-lite-7-4  --optim adamw --lr 0.0005 --num-epochs-warmup 5 --batch-size 128 --num-epochs 60 --num-epochs-patience 10 --num-workers 4 cvl-1-1_with-enrollment_pages

For evaluating a trained model, use eval.py (see ./eval.py -h for further information).

Examples:

CVL w/ enrollment split

python3 eval.py --classification --skip-retrieval --soft-top-k 1 2 3 4 5 6 7 8 9 10 --weights <path to trained model> --num-workers 4 -- cvl-1-1_with-enrollment_pages cvl-1-1_with-enrollment_pages

CVL w/o enrollment split using the standardized Euclidean distance for retrieval

python3 eval.py --soft-top-k 1 2 3 4 5 6 7 8 9 10 --hard-top-k 1 2 --weights <path to trained model> --metrics seuclidean --num-workers 4 -- icdar-2013_pages cvl-1-1_without-enrollment_pages

Acknowledgements

The computational results presented have been achieved in part using the Vienna Scientific Cluster (VSC).

License

The code in this repository is licensed under Apache 2.0.

Citing this Work

@inproceedings{koepf_writer-identification_2022,
   author = {Koepf, Michael and Kleber, Florian and Sablatnig, Robert},
   title = {Writer Identification and Writer Retrieval Using Vision Transformer for Forensic Documents},
   year = {2022},
   isbn = {978-3-031-06554-5},
   publisher = {Springer-Verlag},
   address = {Berlin, Heidelberg},
   doi = {10.1007/978-3-031-06555-2_24},
   abstract = {Writer identification and writer retrieval deal with the analysis of handwritten documents regarding the authorship and are used, for example, in forensic investigations. In this paper, we present a writer identification and writer retrieval method based on Vision Transformers. This is in contrast to the current state of the art, which mainly uses traditional Convolutional-Neural-Network-approaches. The evaluation of our self-attention-based and convolution-free method is done on two public datasets (CVL Database and dataset of the ICDAR 2013 Competition on Writer Identification) as well as a forensic dataset (WRITE dataset). The proposed system achieves a top-1 accuracy up to 99\% (CVL) and 97\% (ICDAR 2013). In addition, the impact of the used script (Latin and Greek) and the used writing style (cursive handwriting and block letters) on the recognition rate are analyzed and presented.},
   booktitle = {Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25, 2022, Proceedings},
   pages = {352–366},
   numpages = {15},
   keywords = {Forensics, Vision Transformer, Writer identification, Writer retrieval},
   location = {La Rochelle, France}
}

References

Datasets

@inproceedings{louloudis_icdar_2013,
	address = {Washington, DC, USA},
	title = {{ICDAR} 2013 {Competition} on {Writer} {Identification}},
	isbn = {978-0-7695-4999-6},
	doi = {10.1109/ICDAR.2013.282},
	booktitle = {2013 12th {International} {Conference} on {Document} {Analysis} and {Recognition}},
	publisher = {IEEE},
	author = {Louloudis, G. and Gatos, B. and Stamatopoulos, N. and Papandreou, A.},
	month = aug,
	year = {2013},
	keywords = {dataset},
	pages = {1397--1401},
}

@inproceedings{kleber_cvl-database_2013,
	address = {Washington, DC, USA},
	title = {{CVL}-{DataBase}: {An} {Off}-{Line} {Database} for {Writer} {Retrieval}, {Writer} {Identification} and {Word} {Spotting}},
	isbn = {978-0-7695-4999-6},
	shorttitle = {{CVL}-{DataBase}},
	doi = {10.1109/ICDAR.2013.117},
	booktitle = {2013 12th {International} {Conference} on {Document} {Analysis} and {Recognition}},
	publisher = {IEEE},
	author = {Kleber, Florian and Fiel, Stefan and Diem, Markus and Sablatnig, Robert},
	month = aug,
	year = {2013},
	keywords = {dataset},
	pages = {560--564}
}

@article{he_fragnet_2020,
	title = {{FragNet}: {Writer} {Identification} {Using} {Deep} {Fragment} {Networks}},
	volume = {15},
	issn = {1556-6013, 1556-6021},
	shorttitle = {{FragNet}},
	doi = {10.1109/TIFS.2020.2981236},
	journal = {IEEE Transactions on Information Forensics and Security},
	author = {He, Sheng and Schomaker, Lambert},
	year = {2020},
	pages = {3013--3022},
}

Used Vision Transformer (ViT) Model Variant (ViT-Lite-7/4)

@article{hassani_escaping_2021,
   author = {Hassani, Ali and Walton, Steven and Shah, Nikhil and Abuduweili, Abulikemu and Li, Jiachen and Shi, Humphrey},
   title = {Escaping the {Big} {Data} {Paradigm} with {Compact} {Transformers}},
   url = {http://arxiv.org/abs/2104.05704},
   abstract = {With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are "data hungry" and therefore can only be applied to large sets of data. We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our model eliminates the requirement for class token and positional embeddings through a novel sequence pooling strategy and the use of convolutions. We show that compared to CNNs, our compact transformers have fewer parameters and MACs, while obtaining similar accuracies. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results. It can reach an accuracy of 95.29 \% when training from scratch on CIFAR-10, which is comparable with modern CNN based approaches, and a significant improvement over previous Transformer based models. Our simple and compact design democratizes transformers by making them accessible to those equipped with basic computing resources and/or dealing with important small datasets. Our method works on larger datasets, such as ImageNet (80.28\% accuracy with 29\% parameters of ViT), and NLP tasks as well. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.},
   urldate = {2021-07-19},
   journal = {arXiv:2104.05704 [cs]},
   month = jun,
   year = {2021},
   note = {arXiv: 2104.05704},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Writer Identification and Writer Retrieval using Vision Transformer

Methodology

Results on Publicly Available Datasets

`CVL w/ enrollment` (Classification)

`CVL w/o enrollment` (Retrieval)

`ICDAR 2013` (Retrieval)

Repository Structure

Reproducing Results

Setup

Local Setup (Directly on a Host Machine)

`Docker` Image

CPU

GPU

Training and Evaluation

Acknowledgements

License

Citing this Work

References

Datasets

Used Vision Transformer (ViT) Model Variant (ViT-Lite-7/4)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
dataset_splits		dataset_splits
img		img
requirements		requirements
runs		runs
saved_models		saved_models
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
train.py		train.py

License

michaelkoepf/writer-identification-and-writer-retrieval-using-vision-transformer

Folders and files

Latest commit

History

Repository files navigation

Writer Identification and Writer Retrieval using Vision Transformer

Methodology

Results on Publicly Available Datasets

CVL w/ enrollment (Classification)

CVL w/o enrollment (Retrieval)

ICDAR 2013 (Retrieval)

Repository Structure

Reproducing Results

Setup

Local Setup (Directly on a Host Machine)

Docker Image

CPU

GPU

Training and Evaluation

Acknowledgements

License

Citing this Work

References

Datasets

Used Vision Transformer (ViT) Model Variant (ViT-Lite-7/4)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`CVL w/ enrollment` (Classification)

`CVL w/o enrollment` (Retrieval)

`ICDAR 2013` (Retrieval)

`Docker` Image

Packages