Skip to content

Commit

Permalink
Merge pull request #114 from Yale-LILY/nick/xlsum
Browse files Browse the repository at this point in the history
Add XLSum and Massivesumm datasets
  • Loading branch information
niansong1996 authored Feb 20, 2022
2 parents f086816 + 639842e commit e6207c2
Show file tree
Hide file tree
Showing 9 changed files with 988 additions and 7 deletions.
28 changes: 24 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,9 @@ SummerTime supports different summarization datasets across different domains (e
| ScisummNet | Scientific articles | 1k | 4.7k | 150 | | | | |
| SummScreen | TV shows | 26.9k | 6.6k | 337.4 | | | :heavy_check_mark: | |
| XSum | News | 226k | 431 | 23.3 | | | | |
| XLSum | News | 1.35m | ??? | ??? | | | | 45 languages ([see documentation](https://huggingface.co/datasets/csebuetnlp/xlsum)) |
| MassiveSumm | News | 12m+ | ??? | ??? | | | | 78 languages (see Multilingual Summarization section of README for details) |


To see all supported datasets, run:

Expand Down Expand Up @@ -346,10 +349,27 @@ corpus = itertools.islice(mlsum.train_set, 5)
corpus = [instance.source for instance in train_set]

# mt5 model will automatically detect Spanish as the language and indicate that this is supported!
mt5_model.summarize()
```

Soon to come: a simple pipeline model to first translate input text to English and then use monolingual models!
mt5_model.summarize(corpus)
```

The following languages are currently supported in our implementation of the MassiveSumm dataset:
Afrikaans, Amharic, Arabic, Assamese, Aymara,
Azerbaijani, Bambara, Bengali, Tibetan, Bosnian,
Bulgarian, Catalan, Czech, Welsh, Danish, German,
Greek, English, Esperanto, Persian, Filipino, French,
Fulah, Irish, Gujarati, Haitian, Hausa, Hebrew,
Hindi, Croatian, Hungarian, Armenian,Igbo, Indonesian,
Icelandic, Italian, Japanese, Kannada, Georgian, Khmer,
Kinyarwanda, Kyrgyz, Korean, Kurdish, Lao, Latvian,
Lingala, Lithuanian, Malayalam, Marathi, Macedonian,
Malagasy, Mongolian, Burmese, South Ndebele, Nepali,
Dutch, Oriya, Oromo, Punjabi, Polish, Portuguese,
Dari, Pashto, Romanian, Rundi, Russian, Sinhala,
Slovak, Slovenian, Shona, Somali, Spanish, Albanian,
Serbian, Swahili, Swedish, Tamil, Telugu, Tetum,
Tajik, Thai, Tigrinya, Turkish, Ukrainian, Urdu,
Uzbek, Vietnamese, Xhosa, Yoruba, Yue Chinese,
Chinese, Bislama, and Gaelic.

## Evaluation
SummerTime supports different evaluation metrics including: BertScore, Bleu, Meteor, Rouge, RougeWe
Expand Down
4 changes: 4 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ gdown
gensim==3.8.3
sklearn
gdown
readability-lxml
beautifulsoup4
orjson
py7zr==0.16.1
prettytable==2.2.1
mpi4py==3.0.3
Expand All @@ -28,3 +31,4 @@ easynmt==2.0.1
black
flake8
progressbar

7 changes: 5 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,18 @@
"sentencepiece~=0.1.95",
"summ_eval==0.70",
"jupyter",
"gdown",
"gdown~=4.2.0",
"readability-lxml",
"beautifulsoup4",
"orjson",
"gensim~=3.8.3",
"sklearn",
"py7zr~=0.16.1",
"tqdm~=4.49.0",
"tensorboard~=2.4.1",
"fasttext~=0.9.2",
"black~=21.12b0",
"easynmt~=2.0.1",
"black",
"flake8",
"progressbar",
"prettytable",
Expand Down
4 changes: 4 additions & 0 deletions summertime/dataset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@
XsumDataset,
PubmedqaDataset,
MlsumDataset,
XlsumDataset,
ScisummnetDataset,
SummscreenDataset,
QMsumDataset,
ArxivDataset,
MassivesummDataset,
)

from summertime.dataset.st_dataset import CustomDataset
Expand All @@ -20,10 +22,12 @@
XsumDataset,
PubmedqaDataset,
MlsumDataset,
XlsumDataset,
ScisummnetDataset,
SummscreenDataset,
QMsumDataset,
ArxivDataset,
MassivesummDataset,
]


Expand Down
Loading

0 comments on commit e6207c2

Please sign in to comment.