Wiktionary Data on Hugging Face Datasets

license

pretty_name

language

configs

Wiktionary Data on Hugging Face Datasets

wiktionary-data is a sub-data extraction of the English Wiktionary that currently supports the following languages:

Deutsch - German
Latinum - Latin
Ἑλληνική - Ancient Greek
한국어 - Korean
𐎠𐎼𐎹 - Old Persian
𒀝𒅗𒁺𒌑(𒌝) - Akkadian
Elamite
संस्कृतम् - Sanskrit, or Classical Sanskrit

wiktionary-data was originally a sub-module of wilhelm-graphdb. While the dataset it's getting bigger, I noticed a wave of more exciting potentials this dataset can bring about that stretches beyond the scope of the containing project. Therefore I decided to promote it to a dedicated module; and here comes this repo.

The Wiktionary language data is available on 🤗 Hugging Face Datasets.

from datasets import load_dataset
dataset = load_dataset("QubitPi/wiktionary-data")

There are two data subsets:

Languages subset that contains extraction of a subset of supported languages:
```
dataset = load_dataset("QubitPi/wiktionary-data", "Wiktionary")
```
The subset contains the following splits
- German
- Latin
- AncientGreek
- Korean
- OldPersian
- Akkadian
- Elamite
- Sanskrit
Graph subset that is useful for constructing knowledge graphs:
```
dataset = load_dataset("QubitPi/wiktionary-data", "Knowledge Graph")
```
The subset contains the following splits
- AllLanguage: all the languages listed above in a giant graph
The Graph data ontology is the following:

Tip

Two words are structurally similar if and only if the two shares the same stem

Development

Data Source

Although the original Wiktionary dump is available, parsing it from scratch involves rather complicated process. For example, acquiring the inflection data of most Indo-European languages on Wiktionary has already triggered some research-level efforts. We would probably do it in the future. At present, however, we would simply take the awesome works by tatuylonen which has already processed it and presented it in in JSONL format. wiktionary-data sources the data from raw Wiktextract data (JSONL, one object per line) option there.

Environment Setup

Get the source code:

git clone git@github.com:QubitPi/wiktionary-data.git
cd wiktionary-data

It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python environment by

python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv

To activate this environment:

source .venv/bin/activate

or, on Windows

./venv\Scripts\activate

Tip

To deactivate this environment, use

deactivate

Installing Dependencies

pip3 install -r requirements.txt

License

The use and distribution terms for wiktionary-data are covered by the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
tests		tests
wiktionary		wiktionary
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.yamllint		.yamllint
LICENSE		LICENSE
README.md		README.md
extract.py		extract.py
ontology.png		ontology.png
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiktionary Data on Hugging Face Datasets

Development

Data Source

Environment Setup

Installing Dependencies

License

About

Releases

Packages

Languages

License

QubitPi/wiktionary-data

Folders and files

Latest commit

History

Repository files navigation

Wiktionary Data on Hugging Face Datasets

Development

Data Source

Environment Setup

Installing Dependencies

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages