This project aims to analyse textual data. Initially it downloads a set of published papers from a given list of URLs. The papers are then pre-processed after which, a set of features are derived from them. These features are nothing but a numerical representation of the paper which could be used in downstream modelling tasks.
The dataset Urls.pdf contains 200 URLs of published papers from a popular AI conference. The features extracted from the processed documents are stored in:
- vocab.txt : contains the unigrams and bigrams. These tokens are stored alphabetically as token_string:token:index.
- count_vectors.txt : each row of the file contains the sparse representation of a particular paper. The format - paper id, token 1 index, token1 word count, token2 index, token2 word count, and so on is used respectively.
A preliminary analysis on the processed data is done and the results are stored in stats.csv.
Please Note : For Pyhon 2, pdfminer needs to be installed and for Python 3 pdfminer.six needs to be installed (both through pip).