This repository contains the artifacts and source code for MaGRiTTE: a project to Machine Generate Representation of Tabular files with Transformer Encoders.
To reproduce the results of the paper, please follow the instructions in the following sections.
To setup the environment, we recommend using a virtual environment to avoid conflicts with other packages. If using conda, run:
conda env create -f environment.yml
to create the environment, and then
conda activate magritte
to activate it. If using pip, run:
pip3 install -r requirements.txt
The input data to reproduce the use case presented in the paper is available in the data/massbay
folder.
Two scripts can be used to integrate these files.
- A hand-crafted script can be launched with
python3 use_case_manual.py
. The script will generate the result fileresults/massbay/manual_integrated.csv
. - The script that uses MaGRiTTE can be launched with
python3 use_case_magritte.py
. The script will generate the result fileresults/massbay/magritte_integrated.csv
.
Running the MaGRiTTE version of the script requires downloading the corresponding weights for the model. The weights can be downloaded at url See also section below.
The folder data
contains the datasets used for the experiments. The datasets are organized in subfolders, each containing the data for a specific task.
Due to the space policy of GitHub, we publish some of our datasets in compressed folders in tar.gz
formats, or on external servers.
To simplify download and extraction, we provide a script to automatically downloads and extracts the data (download_data.sh
)
Run the download_data.sh
script in the root repository folder to download and extract the data (requires a *nix system with megatools
installed)
The datasets are organized as follows:
/gittables
: contains the data from the gittables dataset, organized for the pretraining task. To download and extract the dataset, rundownload.sh
in the folder0.dialect_detection
: contains the data for the finetuning task for dialect detection. To extract the files automatically, runextract.sh
in the folder.row_classification
: contains the data for the finetuning task for row classification. Every dataset is contained in a separatetar.gz
file, and their annotations are contained in the jsonl filestrain_dev_annotations.jsonl
andtest_annotations.jsonl
.estimate
: contains the data for the finetuning task for estimate.columntype
: contains the data for the finetuning task for column type detection classification. There are three versions of the dataset, one for each scenario: (1) theunprepared
folder contains raw files, (2) theautoclean
folder contains the files after automated cleaning with MaGRiTTE, and (3) theclean
folder contains the ground truth cleaned files. The annotations for the column types of each dataset are contained incsv
files in each folder.
The folder weights
contains the weights for the model used for the experiments.
Due to the space policy of GitHub, we publish the weights on external servers.
The main weights can be found at https://mega.nz/file/wJMhDbIa#Zmr23xd67xktcZtpvuu781Om2uwb1PWtQina6a_zwKg.
To simplify download and extraction, we provide a script to automatically downloads and extracts the data (download_weights.sh
)
Run the download_weights.sh
script in the root repository folder to download and extract the weights (requires a *nix system with megatools
installed).
Alternatively, the weights can be manually downloaded from the links provided in the links.txt
file.
The data
folder contains the data used for the experiments, arranged in several subfolders.
- Once the environment is set up and the data has been downloaded, the three folders
configs
,embedder
, andtraining
can be used to pretrain/finetune MaGRiTTE on several tasks. configs
: contains the configuration files in .jsonnet format used for to train the models and run the experimentsembedder
: contains the source code of the MaGRiTTE model, organized in subfolders depending on the pretraining/finetuning tasks.training
: contains the scripts to train the models.data
: contains the datasets with the ground truths for the finetuning tasks.experiments
: contains the scripts to run the experiments for testing purposes after finetuning.weights
: contains the weights of the models used for the paper.
For example, to finetune the model to the dialect detection task, it is sufficient to run:
python3 training/train_dialect.py
which reads the configuration file stored in configs/dialect.jsonnet
.
Each training will saves intermediate artifacts in a corresponding tensorboard
folder under results\{pretrain,dialect,rowclass,estimate}
. To visualize the state of training, you can run e.g.
tensorboard --logdir results\pretrain\tensorboard
and open the browser at localhost:6006
.
The model weights are saved in the folder weights
. If you would like to skip the training phase, we provide the weights of the models used for the paper. Refer to the section Model Weights
for instructions on how to download the weights.
The experiments
folder contains the scripts to run the experiments for testing purposes after finetuning. The scripts are organized in subfolders depending on the finetuning tasks.
Each scripts loads the corresponding trained model from the weights
folder and runs the experiments on the dev/test set.
To run the experiments, run e.g.
python3 experiments/dialect/magritte.py
The results for each task are saved in the folder results
, under a corresponding subfolder.
The folder plots
can be used after the experiments to generate the plots for the paper. The plots are generated using Jupyter notebooks, one for each of the tasks, which read the results from the results
folder and save the corresponding image in .png
format.