Skip to content

Latest commit

 

History

History
14 lines (9 loc) · 1.13 KB

README.md

File metadata and controls

14 lines (9 loc) · 1.13 KB

Text Pre-Processing and Feature extraction

This project aims to analyse textual data. Initially it downloads a set of published papers from a given list of URLs. The papers are then pre-processed after which, a set of features are derived from them. These features are nothing but a numerical representation of the paper which could be used in downstream modelling tasks.

The dataset Urls.pdf contains 200 URLs of published papers from a popular AI conference. The features extracted from the processed documents are stored in:

  1. vocab.txt : contains the unigrams and bigrams. These tokens are stored alphabetically as token_string:token:index.
  2. count_vectors.txt : each row of the file contains the sparse representation of a particular paper. The format - paper id, token 1 index, token1 word count, token2 index, token2 word count, and so on is used respectively.

A preliminary analysis on the processed data is done and the results are stored in stats.csv.

Please Note : For Pyhon 2, pdfminer needs to be installed and for Python 3 pdfminer.six needs to be installed (both through pip).