Paper: Dataset for Automatic Summarization of Russian News
- v1, Full dataset: gazeta_jsonl.tar.gz
- v1, Train: gazeta_train.jsonl
- v1, Validation: gazeta_val.jsonl
- v1, Test: gazeta_test.jsonl
- v1, Spacy analyses: gazeta_v1_spacy.tar.gz
- v1, Raw version without any cleaning (not recommended): gazeta_raw.txt
- v1, Train with oracle summaries: gazeta_train_oracle.jsonl
- v1, Validation with oracle summaries: gazeta_val_oracle.jsonl
- v1, Test with oracle summaries: gazeta_test_oracle.jsonl
- v1, Preprocessed data for mBART: gazeta_data_mbart_600_160_v2.tar.gz
UPDATE:
- v2, Full dataset: gazeta_jsonl_v2.tar.gz
- v2, Spacy analyses: gazeta_v2_spacy.tar.gz
- v2, Preprocessed data for mBART: gazeta_v2_data_mbart_600_160_v2.tar.gz
- v2/v1 diff: gazeta_new.jsonl
- v2/v1 raw diff without any clenaning: gazeta_raw_new.jsonl
- GitHub: https://github.com/IlyaGusev/gazeta/releases/tag/2.0
- Kaggle: https://www.kaggle.com/phoenix120/gazeta-summaries
- Huggingface: https://huggingface.co/datasets/IlyaGusev/gazeta
https://huggingface.co/IlyaGusev/mbart_ru_sum_gazeta
- Legal basis for distribution of the dataset:
- Gazeta Terms of Use, paragraph 2.1.2.
- Russian Civil Code 1274, part 1
- All rights belong to "www.gazeta.ru". This dataset can be removed at the request of the copyright holder. Usage of this dataset is possible only on a non-commercial basis.
- Cleaning:
- Data analysis:
- Summarization methods:
- Other Russian summarization datasets:
- Russian part of XL-Sum, parsed from www.bbc.com/russian, 77803 samples
- Russian part of MLSUM, parsed from www.mk.ru, 27063 samples
- Russian part of WikiLingua, parsed from WikiHow, 52928 samples
- Telegram: @YallenGusev
@InProceedings{Gusev2020gazeta,
author="Gusev, Ilya",
title="Dataset for Automatic Summarization of Russian News",
booktitle="Artificial Intelligence and Natural Language",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="{122--134}",
isbn="978-3-030-59082-6",
doi={10.1007/978-3-030-59082-6_9}
}