Skip to content

Latest commit

 

History

History
45 lines (34 loc) · 4.08 KB

README.md

File metadata and controls

45 lines (34 loc) · 4.08 KB

GlotSparse: Building Corpora in Under-Resourced Languages

Language ISO Code Websites Download Link of Data Links referenced these websites
Balochi bal sunnionline, kissah, baask GlotSparse PALI, mlops.systems
Gilaki glk gilaki_twitter GlotSparse -
Brahui brh talarbrahui GlotSparse PALI
Southern-Kurdish sdh shafaq.com/ku (Feyli) GlotSparse PALI
Gurani hac anfsorani.com/هۆرامی (Hawrami) GlotSparse Zaza–Gorani
Kirmanjki kiu anfkirmancki.com GlotSparse Zaza–Gorani
Fanti fat akannews.com/fante/ GlotSparse -
Twi twi akannews.com/asante-twi/ GlotSparse -
South-Azerbaijani azb trt.net.tr/turki GlotSparse -
Southern Uzbek uzs trt.net.tr/afghaniuzbek GlotSparse -

License

We do not own any of the text from which these data has been extracted. We license the actual packaging, the metadata and the annotations of these data under the cc0-1.0 (waiving all of the rights under copyright law).

If you are a website/dataset owner and do not want your data to be included in this table, please send us an email at amir@cis.lmu.de .

Ethical Considerations

  1. Biases: The text corpus may reflect the perspectives, opinions, or demographics of its sources or creators. It is important for users to critically evaluate the text in context especially for news sources and social medias.

  2. Representativeness: While we have aimed for diversity and inclusivity, the text corpus may not fully represent all native speakers. Users should be mindful of any potential underrepresentation.

  3. Ethics: We acknowledge that the collection and use of text data can have ethical implications. We have strived to handle the data responsibly, but we encourage users to consider the broader ethical implications of their own research or applications.

  4. We respect robots.txt, https://palewi.re/docs/news-homepages/openai-gptbot-robotstxt.html

Citation

If you use any part of this metadata or GlotSparse data in your research, please cite it using the following BibTeX entry. This work is compiled as part of the GlotLID project and may progress based on the progress of GlotWeb project.

@inproceedings{
  kargaran2023glotlid,
  title={{GlotLID}: Language Identification for Low-Resource Languages},
  author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=dl4e3EBz5j}
}