In this task, we are using semantically-close books from the Gutenberg project and we aim to classify text segments to its corresponding book name.
The versions are the defaults set by colaB
- Python
- NLTK
- SKlearn
- Eli5
- Lime
- Matplotlib
- Jupyter/spider/colaB
An example for the output of eli5 for the top 10 words of 5 books:
- We start by 5 books that are semantically close to each other.
- Extract 200 samples from each book, each sample comprises 100 words.
- Data preprocessing is performed on these segments:
- Tokenization
- Punctuation and stop words Removal
- Lowercasing
- Lemmatization
- Feature Engineering on the clean data from (3):
- Bag of Words
- TF-IDF
- Splitting the data into train/test splits (80/20) and 10 fold cross validation.
- Modelling for TF-IDF features:
- Decision Tree
- KNN
- SVM
- Logistic Regression
- Evaluation: Accuracy, Bias-Variance tradeoff
- Error Analysis for Misclassified segments:
- eli5
- Lime
- Insights, Analysis then modify some hyperparameters (e.g. number of words per segment) and retrain.