NLP_Gutenberg_Book_Classification

Problem Overview

In this task, we are using semantically-close books from the Gutenberg project and we aim to classify text segments to its corresponding book name.

Libraries and Dependencies

The versions are the defaults set by colaB

Python
NLTK
SKlearn
Eli5
Lime
Matplotlib
Jupyter/spider/colaB

Output Example:

An example for the output of eli5 for the top 10 words of 5 books:

Steps:

We start by 5 books that are semantically close to each other.
Extract 200 samples from each book, each sample comprises 100 words.
Data preprocessing is performed on these segments:

Tokenization
Punctuation and stop words Removal
Lowercasing
Lemmatization

Feature Engineering on the clean data from (3):

Bag of Words
TF-IDF

Splitting the data into train/test splits (80/20) and 10 fold cross validation.
Modelling for TF-IDF features:

Decision Tree
KNN
SVM
Logistic Regression

Evaluation: Accuracy, Bias-Variance tradeoff
Error Analysis for Misclassified segments:

eli5
Lime

Insights, Analysis then modify some hyperparameters (e.g. number of words per segment) and retrain.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
NLP_Book_classification.ipynb		NLP_Book_classification.ipynb
README.md		README.md
Report_Book_Classification.pdf		Report_Book_Classification.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP_Gutenberg_Book_Classification

Problem Overview

Libraries and Dependencies

Output Example:

Steps:

About

Releases

Packages

Languages

hosnaa/NLP_Gutenberg_Book_Classification

Folders and files

Latest commit

History

Repository files navigation

NLP_Gutenberg_Book_Classification

Problem Overview

Libraries and Dependencies

Output Example:

Steps:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages