This repository contains a brief introduction about feature extraction of text based data.
The textual data is present in resort.txt file.
The pre-processing steps of textual data are explained in Pre-processing of Data.py file. The basic pre-processing steps includes: Tokenization of words and sentences Removal of punctuations Removal of stop-words Stemming of words Lemmatization of words
The binary feature of data: A particular word exsists in a sentence:1, not exists in a sentence:0, is explained in Binary Features.py
The computation of count vector, that stores the frequency of words in a sentence, is explained in CountVector.py
The calculation of TF Matrix: Term Frequency matrix and TF-IDF: Term Frequency and Inverse Document Frequency matrix is explained in TF_matrix.py and TF-IDF_Matrix.py