This repository contains Python code to generate the Tough Tables (2T) dataset, a dataset for benchmarking table annotation algorithms on the CEA and CTA tasks (as defined in the SemTab challenge). The target KG is DBpedia 2016-10.
The 2T dataset is available in Zenodo
The 2T dataset is compliant with the SemTab 2019 format. It is possible to evaluate all the annotation algorithms that produce a results file compatible with the SemTab challenge submission file format. For details, see SemTab 2019 (CEA, CTA).
This work is based on the following paper:
Cutrona, V., Bianchi, F., Jimenez-Ruiz, E. and Palmonari, M. (2020). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data. ISWC 2020, LNCS 12507, pp. 1–16.
The code is developed for Python 3.8.
Install all the required packages listed in the requirements.txt
file.
virtualenv -p python3.8 venv # we suggest to create a virtual environment
source venv/bin/activate
pip install -r requirements.txt
The following command reads the tables under the control
and tough
directories,
and generates the gold standard (GS).
python tough_tables.py make_gs --output_folder ./gs \
--endpoint http://dbpedia.org/sparql
Note: the resultant GS may differ in different executions, due to the unsorted results of SPARQL queries.
Starting from the GS tables, the following command generates a) the set of tables to annotate, and b) the ground truth file.
python tough_tables.py to_cea --input_folder ./gs \
--output_tables_folder ./2T/tables \
--output_gs_folder ./2T/gt \
--output_target_folder ./2T/targets \
--endpoint http://dbpedia.org/sparql \
--sameas_file dbp_sameas.json
The resources/dbp_sameas.json
file contains the collection of all the sameAs links used to build 2T.
It is possible to derive the CTA ground truth from the CEA ground truth using a majority voting strategy.
python tough_tables.py cta_from_cea --cea_gs_file ./2T/gt/CEA_2T_gt.csv \
--output_gs_folder ./2T/gt \
--output_target_folder ./2T/targets \
--instance_types_file ./instance_types_en.ttl \
--ontology_file ./dbpedia_2016-10.nt
The command requires two external sources:
- the
instance_types_en
file containing the list of all the DBpedia instances and their types (.ttl) - the DBpedia ontology (.nt)
To score an algorithm, run:
python tough_tables.py score_cea --annotations_file <your_annotation_file.csv> \
--gs_file ./2T_cea/2T_gt.csv
The annotations file format must be the same used in the SemTab 2019 challenge (tab_id, col_id, row_id, annotation).
Along with the overall result (ALL), all the performance metrics are computed for each category of tables.
A radar plot (<your_annotation_file>.pdf
) is saved in the submission file directory.
Other utility commands are available in the script. See the full list by executing:
python tough_tables.py --help
The 2T dataset has been converted into its corresponding Wikidata version and it has been adopted as part of the SemTab2020 challenge - Round 4.
NOTE: the new format for CEA is <tab_id, row_id, col_id, entity>. Check out the SemTab 2020 website for more details.
The conversion script to_wikidata.py
requires the following files to be downloaded and put in the resources
directory to generate a conversion map:
NOTE: commented lines (e.g., "# started 2017-07-06T12:05:32Z") must be removed from the above files.
A pre-computed conversion map is available under the resources
directory (db_wd_conversion_map.pickle
).
Along with packages listed in the requirements, this repository uses the
tabular-data-semantics-py package to query
SPARQL endpoints. We slightly adapted the package to meet our needs (the resultant version is available under the
tabular_semantics
directory).
In previous versions, we exploited the py-sparql-transformer package for querying the DBpedia SPARQL endpoint.