The data is structured as follows:
.
├── NICE
│ ├── NICE
│ ├── NICE_binary
│ └── source
├── R8
├── STOPS
│ ├── STOPS
│ ├── STOPS-2
│ └── source
│ ├── mave
│ └── yelp
├── TREC
├── corpus
├── data-web-snippets
├── mr
├── nltk_data
│ └── corpora
│ └── twitter_samples
└── sst2
Due to uncertainties regarding licensing, the data for Twitter,SearchSnippets, NICE and STOPS is not included in this repository.
For instructions on how to obtain the data, see the README files in the respective folders:
The data for R8, MR and TREC was retrieved from here.