This fork modifies the preprocessed output to JSON format to allow using non-Tensorflow libraries to work with the CNN/DailyMail summarization dataset
Note: requires Python 3
This fork is primarily developed in order to work with this repository which uses pytorch
--
Download and unzip the stories directories from here for both CNN and Daily Mail.
Warning: These files contain a few (114, in a dataset of over 300,000) examples for which the article text is missing - see for example cnn/stories/72aba2f58178f2d19d3fae89d5f3e9a4686bc4bb.story. The PyTorch code works fine with it unless in an extreme case such that all data sampled in a batch is empty.
We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:
export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
replacing /path/to/ with the path to where you saved the stanford-corenlp-full-2016-10-31 directory. You can check if it's working by running
echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer
You should see something like:
Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.
Run
python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories
replacing /path/to/cnn/stories with the path to where you saved the cnn/stories directory that you downloaded; similarly for dailymail/stories.
This script will do several things:
- The directories
cnn_stories_tokenizedanddm_stories_tokenizedwill be created and filled with tokenized versions ofcnn/storiesanddailymail/stories. This may take some time. Note: you may see severalUntokenizable:warnings from Stanford Tokenizer. These seem to be related to Unicode characters in the data; so far it seems OK to ignore them. - For each of the url lists
all_train.txt,all_val.txtandall_test.txt, the corresponding tokenized stories are read from file, lowercased and written to tarball filestrain.tar,val.tarandtest.tar. These will be placed in the newly-createdfinished_filesdirectory. This may take some time. - Additionally, a
vocab_cnt.pklfile is created from the training data. This is also placed infinished_files. This is a python Counter of all words, which could be useful for determining the vocabulary by word appearance count.