Multi-GeNews

Due to the lack of German multi-document summarization data, we have created a new mds dataset called 'Multi-GeNews'. This dataset contains news articles sourced from the news portal of SRF, a Swiss media company. The included articles span from January to March 2020.
The dataset includes hyperlinks directing to the webpages of the original source articles. We provide a Python script named source_article_text_downloader.py, which downloads the textual data from these linked webpages. The script also automatically integrates this downloaded text into the existing dataset.

The dataset can be found under:Multi-GeNews.jsonl.

Each line of the jsonl file corresponds to a single cluster with the following json format:

{
  "articles": [
    {
      "title": "Title of Article 1",
      "text": "Text of Article 1",
      "article_link": "Link to Article 1"
    },
    {
      "title": "Title of Article 2",
      "text": "Text of Article 2",
      "article_link": "Link to Article 2"
    },
    ...
  ],
  "summary": "Summary of the cluster"
}

Installation

Clone this repository to your local machine.
Install requirements:

pip install requirements.txt

Downloading Source Articles

Run:

python3 source_article_text_downloader.py

The command will take some time to download all the data. At the end, a file called Multi-GeNews-With-Text.jsonl will be created, containing the downloaded source articles.

Reference

When using the Multi-GeNews dataset, please cite:

@inproceedings{mascarell-etal-2023-entropy,
    title = "Entropy-based Sampling for Abstractive Multi-document Summarization in Low-resource Settings.",
    author = {Mascarell, Laura and Chalumattu, Ribin and Heitmann, Julien},
    booktitle = "Proceedings of the 16th International Conference on Natural Language Generation",
    month = sep,
    year = "2023",
    address = "Prague, Czech Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://doi.org/10.3929/ethz-b-000624074"
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
images		images
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Multi-GeNews.jsonl		Multi-GeNews.jsonl
README.md		README.md
requirements.txt		requirements.txt
source_article_text_downloader.py		source_article_text_downloader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-GeNews

Installation

Downloading Source Articles

Reference

About

Releases

Packages

Languages

License

mediatechnologycenter/Multi-GeNews

Folders and files

Latest commit

History

Repository files navigation

Multi-GeNews

Installation

Downloading Source Articles

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages