(Code) | (Axioms & Assertions)
Cat2Ax is an approach for the extraction of axioms and assertions for Wikipedia categories to enrich the ontology of a Wikipedia-based knowledge graph.
In this package you can also find implementations of the two most closely related approaches Catriple and C-DF.
How to cite?
@inproceedings{heist2019uncovering,
title={Uncovering the Semantics of Wikipedia Categories},
author={Heist, Nicolas and Paulheim, Heiko},
booktitle={International Semantic Web Conference},
pages={219--236},
year={2019},
organization={Springer}
}
A preprint of the paper can be found on arXiv.
- Python 3
- pipenv (https://pipenv.readthedocs.io/en/latest/)
Note: If you have problems with your pipenv installation, you can also run the code directly via python. Just make sure to install all the dependencies given in Pipfile
and Pipfile.lock
.
- You need a machine with at least 100 GB of RAM as we load most of DBpedia in memory to speed up the extraction
- If that is not possible for you and you nevertheless want to run the extraction, you can change the functionalities in
impl.category.store
andimpl.dbpedia.store
to use a database instead
- If that is not possible for you and you nevertheless want to run the extraction, you can change the functionalities in
- During the first execution of an extraction you need a stable internet connection as the required DBpedia files are downloaded automatically
- In the project source directory, create and initialize a virtual environment with pipenv (run in terminal):
pipenv install
- If you have not downloaded them already, you have to fetch the corpora for spaCy and nltk-wordnet (run in terminal):
pipenv run python -m spacy download en_core_web_lg # download the most recent corpus of spaCy
pipenv run python -c 'import nltk; nltk.download("wordnet")' # download the wordnet corpus of ntlk
You can configure the application-specific parameters as well as logging- and file-related parameters in config.yaml
.
Run the extraction methods with pipenv:
(Hint: A parallel execution of the extraction scripts is not recommended and might lead to incorrect results or errors)
pipenv run python3 cat2ax.py # approx. 7h runtime
pipenv run python3 catriple.py # approx. 8h runtime
pipenv run python3 cdf.py # approx. 12h runtime
All the required resources, like DBpedia files, will be downloaded automatically during execution.
The extracted axioms and assertions are placed in the results
folder.
If you want to extract the type lexicalisations (not necessary as we provide them as cache file), run the following:
pipenv run type_lexicalisations.py # runtime of several days!
To reproduce the main graphs of the paper, first run the extraction (or download the results and place them in the results
folder) and then run the following:
pipenv run graphs.py # approx. 0.5h runtime
The produced graphs are placed in the results
folder.
If you don't want to run the extraction yourself, you can find the results here.
The code is documented on function-level. For an overview of the architecture of the extraction tool, refer to this diagram: