GitHub - akchi/Text_pre_processing: This project aims to analyse textual data. Initially it downloads a set of published papers from a given list of URLs. The papers are then pre-processed after which, a set of features are derived from them. These features are nothing but a numerical representation of the paper which could be used in downstream modelling tasks.

Text Pre-Processing and Feature extraction

This project aims to analyse textual data. Initially it downloads a set of published papers from a given list of URLs. The papers are then pre-processed after which, a set of features are derived from them. These features are nothing but a numerical representation of the paper which could be used in downstream modelling tasks.

The dataset Urls.pdf contains 200 URLs of published papers from a popular AI conference. The features extracted from the processed documents are stored in:

vocab.txt : contains the unigrams and bigrams. These tokens are stored alphabetically as token_string:token:index.
count_vectors.txt : each row of the file contains the sparse representation of a particular paper. The format - paper id, token 1 index, token1 word count, token2 index, token2 word count, and so on is used respectively.

A preliminary analysis on the processed data is done and the results are stored in stats.csv.

Please Note : For Pyhon 2, pdfminer needs to be installed and for Python 3 pdfminer.six needs to be installed (both through pip).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Dataset		Dataset
Features		Features
analysis		analysis
README.md		README.md
_config.yml		_config.yml
stopwords_en.txt		stopwords_en.txt
text_processor.ipynb		text_processor.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Pre-Processing and Feature extraction

About

Releases

Packages

Languages

akchi/Text_pre_processing

Folders and files

Latest commit

History

Repository files navigation

Text Pre-Processing and Feature extraction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages