Text Pre-Processing and Feature extraction

This project aims to analyse textual data. Initially it downloads a set of published papers from a given list of URLs. The papers are then pre-processed after which, a set of features are derived from them. These features are nothing but a numerical representation of the paper which could be used in downstream modelling tasks.

The dataset Urls.pdf contains 200 URLs of published papers from a popular AI conference. The features extracted from the processed documents are stored in:

vocab.txt : contains the unigrams and bigrams. These tokens are stored alphabetically as token_string:token:index.
count_vectors.txt : each row of the file contains the sparse representation of a particular paper. The format - paper id, token 1 index, token1 word count, token2 index, token2 word count, and so on is used respectively.

A preliminary analysis on the processed data is done and the results are stored in stats.csv.

Please Note : For Pyhon 2, pdfminer needs to be installed and for Python 3 pdfminer.six needs to be installed (both through pip).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Text Pre-Processing and Feature extraction

Files

README.md

Latest commit

History

README.md

File metadata and controls

Text Pre-Processing and Feature extraction