The directory contains all the data collected and parsed throughout the WSDL group's news similarity project. We retrieved stories from the following websites:
- https://www.washingtonpost.com/
- http://www.foxnews.com
- http://abcnews.go.com/
- https://www.nytimes.com/
- https://www.usatoday.com/
- https://www.cbsnews.com/
- http://www.chicagotribune.com/
- https://www.nbcnews.com/
- http://www.latimes.com/
- https://www.npr.org/
- https://www.wsj.com/
The directories are described as follows:
- The
timemaps
directory contains the timemap of each of the news sites. - The
mementos
directory contains the mementos closest to 1AM GMT every day from 2016-05-01 to 2017-05-31, collected from the Internet Archive. The directories are named according to a website's md5 hash which can be seen innews-websites-hashes.json
. - The
stories/if_/
directory contains the news stories retrieved from the Internet Archive without banner/HTML injections. - The
col_sim
directory contains the similarity calculations per day for the links where k = 1, 3, 10. - The
error
directory contains files related to failed requests to the Internet Archive.
Aside from those directories there are also some JSON and CSV files that are subsets for other parts of this project, usually summarizing the data.
For example, links_per_day.json
describes the links used per day to find the similarity for k = 10 stories from each news site.
The col_sim
directories also contain summary files named col_sim_summary.csv
of the similarity values.