Language | ISO Code | Websites | Download Link of Data | Links referenced these websites |
---|---|---|---|---|
Balochi | bal | sunnionline, kissah, baask | GlotSparse | PALI, mlops.systems |
Gilaki | glk | gilaki_twitter | GlotSparse | - |
Brahui | brh | talarbrahui | GlotSparse | PALI |
Southern-Kurdish | sdh | shafaq.com/ku (Feyli) | GlotSparse | PALI |
Gurani | hac | anfsorani.com/هۆرامی (Hawrami) | GlotSparse | Zaza–Gorani |
Kirmanjki | kiu | anfkirmancki.com | GlotSparse | Zaza–Gorani |
Fanti | fat | akannews.com/fante/ | GlotSparse | - |
Twi | twi | akannews.com/asante-twi/ | GlotSparse | - |
South-Azerbaijani | azb | trt.net.tr/turki | GlotSparse | - |
Southern Uzbek | uzs | trt.net.tr/afghaniuzbek | GlotSparse | - |
We do not own any of the text from which these data has been extracted. We license the actual packaging, the metadata and the annotations of these data under the cc0-1.0 (waiving all of the rights under copyright law).
If you are a website/dataset owner and do not want your data to be included in this table, please send us an email at amir@cis.lmu.de .
-
Biases: The text corpus may reflect the perspectives, opinions, or demographics of its sources or creators. It is important for users to critically evaluate the text in context especially for news sources and social medias.
-
Representativeness: While we have aimed for diversity and inclusivity, the text corpus may not fully represent all native speakers. Users should be mindful of any potential underrepresentation.
-
Ethics: We acknowledge that the collection and use of text data can have ethical implications. We have strived to handle the data responsibly, but we encourage users to consider the broader ethical implications of their own research or applications.
-
We respect robots.txt, https://palewi.re/docs/news-homepages/openai-gptbot-robotstxt.html
If you use any part of this metadata or GlotSparse data in your research, please cite it using the following BibTeX entry. This work is compiled as part of the GlotLID project and may progress based on the progress of GlotWeb project.
@inproceedings{
kargaran2023glotlid,
title={{GlotLID}: Language Identification for Low-Resource Languages},
author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
year={2023},
url={https://openreview.net/forum?id=dl4e3EBz5j}
}