Açık Seminer 2020 - Turkish NLP Seminar and Workshop
This repo includes the notebooks and slides for the Turkish Natural Language workshop. The implemented modules are:
- Text preprocessing
- Named Entity Recognition with SpaCy
- Unsupervised text classification with K-Means
TWNERTC (Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset ) by Sahin, et al. is used for Named Entity Recognition. The TWNERTC dataset contains approximately 300K named entities in 77 domains with more than 1000 fine-grained entity types. A subset of the dataset (the astronomy domain) is provided in the repo and the full clean version of the dataset in json format can be downloaded here.
JSON schema
[ {
TOPIC_1: {
SENTENCE_1: {
"entities": [
[
START_INDEX,
END_INDEX,
ENTITY_LABEL
], ...
]
},
SENTENCE_2: {...}
TOPIC_2 : {...}
} ]
A small Turkish news dataset crawled from various news websites is used for text clustering. This dataset contains news in 5 categories (economy, arts, politics, sports, technology) with 100 samples per category.
Clone the repo and install the requirements before running the notebooks:
git clone https://github.com/alaradirik/TR-NLP-workshop.git
pip install -r requirements.txt