Download: Link1
Since the start of COVID-19, several relevant corpora from various sources are presented in the literature that contain millions of data points. While these corpora are valuable in supporting many analysis on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. During our other efforts on COVID-19 related work, we discover very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks.
Here we present EPIC, a large-scale epidemic corpus that contains over 20 millions tweets, spanned from year 2006 to 2020. There are two subsets within the corpus, namely general and outbreaks. The general set contains 3 epidemics, namely:
The outbreak set contains 6 epidemic outbreaks, as follows:
- 2009 H1N1 Swine Flu
- 2010 Haiti Cholera
- 2012 Middle-East Respiratory Syndrome (MERS)
- 2013 West African Ebola
- 2016 Yemen Cholera
- 2018 Kivu Ebola
Each class of data is contained by a single CSV file named after the respective event. Each file contains comma-separated fields per line, where not all files may have a value. The list of fields are as follows:
Field | Type | Description |
---|---|---|
date | datetime | The date and time (in UTC) that the tweet was posted. Example: 4/2/09 17:06 |
username | string | Unique username of the user account that posted the tweet Example: douance_quebec |
to | string | The twitter account's username that the tweet that was posted to Example: CedricFontaine |
replies | integer | The number of replies that the tweet has. A reply is a response to another person’s Tweet. You can reply by clicking or tapping the reply icon from a Tweet. Example: 3 |
retweets | integer | The number of retweets that the tweet has. A tweet that a user shares publicly with his/her followers is known as a Retweet, which is a conventional way to pass along news and interesting discoveries on Twitter. Example: 3 |
favorites | integer | The number of favorites the tweet receives. Favourites are represented by a small heart and are used to show appreciation for a Tweet. Example: 3 |
text | string | The content of a tweet that contains up to 280 characters. Example: H1N1 + H1N5 = Trouble... |
mentions | string | Another account's Twitter username preceded by the "@" symbol. A mention is a Tweet that contains another person’s username anywhere in the body of the Tweet. Example: @cyberlou33 |
hashtags | string | The hashtags that the tweet includes. A hashtag is formed by a symbol (#) followed by a relevant keyword. Hashtags are commonly used or phrase in their Tweet to categorize those Tweets and help them show more easily in Twitter search. Example: #panflu |
id | string | A unique identifier of a tweet. Example: 1136281607 |
permalink | string | The unique URL of a tweet. Whenever you view a Tweet's permanent link, you can see The exact time and date the Tweet was posted and the number of likes and Retweets the Tweet received. Example: https://twitter.com/douance_quebec/status/1096080744 |
This data is intended to support only for academic research purporses and may not be used for any commercial purposes, by any commercial entity, or by any party, unless otherwised authorized by the authors.
If your publication uses the data, either in full or in part, you should cite the paper below:
@article{liu2020epic,
title={EPIC30M: An Epidemics Corpus Of Over 30 Million Relevant Tweets},
author={Liu, Junhua and Singhal, Trisha and Blessing, Lucienne T.M. and Wood, Kristin L. and Lim, Kwan Hui},
journal={arXiv preprint arXiv:2006.08369},
year={2020}
}