Skip to content

vayvi/HDV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Historical Diagram Vectorization

This repo is the official implementation for Historical Astronomical Diagrams Decomposition in Geometric Primitives.

This repo builds on the code for DINO-DETR, the official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection".

Introduction

We present a model which modifies DINO-DETR to perform historical astronomical diagram vectorization by predicting simple geometric primitives, such as lines, circles, and arcs.

method

Getting Started

1. Installation

The model was trained with python=3.11.0, pytorch=2.1.0, cuda=11.8 and builds on the DETR-variants DINO/DN/DAB and Deformable-DETR.

  1. Clone this repository and create virtual environment
    git clone git@github.com:vayvi/HDV.git
    cd HDV/
    python3 -m venv venv
    source venv/bin/activate
  2. Follow instructions to install a Pytorch version compatible with your system and CUDA version
  3. Install other dependencies
    pip install -r requirements.txt
  4. Compiling CUDA operators
    python src/models/dino/ops/setup.py build install # 'cuda not availabel', run => export CUDA_HOME=/usr/local/cuda-<version>
    # unit test (should see all checking is True) # could output an outofmemory error
    python src/models/dino/ops/test.py
  5. Installing the local package for synthetic data generation
    pip install -e synthetic/.
2. Annotated Dataset and Model Checkpoint

Our annotated dataset along with our main model checkpoints can be found here. Annotations are in SVG format. We provide helper functions for parsing svg files in Python if you would like to process a custom annotated dataset.

To download the manually annotated dataset, run:

bash scripts/download_eida_data.sh

Datasets should be organized as follows:

HDV/
  data/
    └── eida_dataset/
      └── images_and_svgs/
    └── custom_dataset/
      └── images_and_svgs/

To download the pretrained models, run:

bash scripts/download_pretrained_models.sh

Checkpoints should be organized as follows:

HDV/
  logs/
    └── main_model/
      └── checkpoint0012.pth
      └── checkpoint0036.pth
      └── config_cfg.py
    └── other_model/
      └── checkpoint0044.pth
      └── config_cfg.py
    ...

You can process the ground-truth data for evaluation using:

bash scripts/process_annotated_data.sh "eida_dataset" # or "custom_dataset", etc.
3. Synthetic Dataset

Generate Synthetic Dataset

The synthetic dataset generation process requires a resource of text and document backgrounds. We use the resources available in docExtractor and diagram-extraction. The code for generating the synthetic data is also heavily based on docExtractor.

To get the synthetic resource (backgrounds) for the synthetic dataset you can launch:

bash scripts/download_synthetic_resource.sh

Or download it

Download the synthetic resource folder here and unzip it in the data folder.

Evaluation and Testing

1. Evaluate our pretrained models

After downloading and processing the evaluation dataset, you can evaluate the pretrained model as follows. Download a model checkpoint:

  • model_name corresponds to the folder inside logs/ where the checkpoint file is located
  • epoch_number epoch number of the checkpoint file to be used
  • data_folder_name is the name of the folder inside data/ where the evaluation dataset is located (default to eida_dataset)
bash scripts/evaluate_on_eida_final.sh <model_name> <epoch_number> <data_folder_name>

# for logs/main_model/checkpoint0036.pth on eida_dataset
bash scripts/evaluate_on_eida_final.sh main_model 0036 eida_dataset

# for logs/eida_demo_model/checkpoint0044.pth on eida_dataset
bash scripts/evaluate_on_eida_final.sh eida_demo_model 0044 eida_dataset

You should get the AP for different primitives and for different distance thresholds.

If you want to run evaluation on all checkpoints available for a given model, you can use the following script:

bash scripts/evaluate_models_on_gt.sh <ground_truth> <?model_name> <?device_nb> <?batch_size> <?max_size>

# to evaluate all available models on ground truth (cf. svg_to_train.py script)
bash scripts/evaluate_models_on_gt.sh eida_dataset/groundtruth

# to evaluate only one model
bash scripts/evaluate_models_on_gt.sh eida_dataset/groundtruth main_model
2. Inference and Visualization

For inference and visualizing results over custom images, you can use this notebook.

You can also use the following script to run inference on a whole dataset (jpg images located in data/<data_set>/images/):

bash scripts/run_inference.sh <model_name> <epoch_number> <data_set> <export_formats>

# for logs/main_model/checkpoint0036.pth on eida_dataset with svg and npz export formats
bash scripts/run_inference.sh main_model 0036 eida_dataset svg+npz

Results will be saved in data/<data_set>/<export_format>_preds_<model_name><epoch_number>/.

You can compare different inferences on the same dataset with (outputs an HTML file data/<data_set>/<filename>.html):

python src/util/html.py --data_set <data_set> --filename <filename>

Training

1. Training from scratch on synthetic data

To re-train the model from scratch on the synthetic dataset (created on the fly), you can launch

bash scripts/train_model.sh
2. Training on a custom dataset

Turn SVG files into COCO-like annotations using the following script:

  • data_set folder inside data/ where the evaluation dataset is located (default to eida_dataset)
  • sanity_check add it whether you want to visualize the processed annotations (will save the images in data/<data_set>/svgs/)
  • train_portion float value in between 0 and 1 to split the dataset into train and val (default to 0.8)
  data/
    └── <dataset_name>/
      └── images/     # folder containing annotated images in the svgs folder
      └── svgs/       # folder containing SVG files containing ground truth for training
python src/svg_to_train.py --data_set <dataset_name> --sanity_check

# for eida_dataset
python src/svg_to_train.py --data_set eida_dataset --sanity_check

Training data will be created in data/<dataset_name>/groundtruth/. You can use it to run the finetuning script. To train on a custom dataset, the ground truth annotations should be in a COCO-like format, thus be structured as follows:

  data/
    └── <groundtruth_data>/
      └── annotations/     # folder containing JSON files (one for train, one for val) in COCO-like format
      └── train/           # train images (corresponding to train.json)
      └── val/             # val images (corresponding to val.json)

Run the following script to train the model on the custom dataset:

  • model_name corresponds to the folder inside logs/ where the checkpoint file is located (will take the last checkpoint)
  • groundtruth_dir relative path to a folder inside data/ where the ground truth dataset is located
  • device_nb GPU device number to use for training (default to 0)
  • batch_size batch size for training (default to 2)
  • max_size maximum image size for data augmentation (default to 1000), to prevent out of memory errors
  • learning_rate learning rate for training (default to 0.0001)
  • epoch_nb number of epochs to train (default to 50)
bash scripts/finetune_model.sh <model_dirname> <groundtruth_dir> <device_nb> <batch_size> <max_size> <learning_rate> <epoch_nb>

# to use the data generated by the previous script to finetuning main_model on device #2
bash scripts/finetune_model.sh main_model eida_dataset/groundtruth 2

The outputs of your run will be logged with wandb.

Bibtex

If you find this work useful, please consider citing:

@misc{kalleli2024historical,
    title={Historical Astronomical Diagrams Decomposition in Geometric Primitives},
    author={Syrine Kalleli and Scott Trigg and Ségolène Albouy and Matthieu Husson and Mathieu Aubry},
    year={2024},
    eprint={2403.08721},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Releases

No releases published

Packages

No packages published