Enhancing Gutenberg Book Clustering using Advanced NLP Techniques

Text clustering, an unsupervised ML technique in NLP, groups similar texts based on content. Techniques like hierarchical, k-means, or density-based clustering categorize unstructured data, unveiling insights and patterns in diverse datasets. This exploration was part of the NLP course in my University of Ottawa master's program in 2023.

Required libraries: scikit-learn, pandas, matplotlib.
Execute cells in a Jupyter Notebook environment.
The uploaded code has been executed and tested successfully within the Google Colab environment.

Unsupervised Text Clustering problem

Text clustering involves grouping comparable texts based on content similarity, a crucial unsupervised technique.(chose 5 differnet books for 5 differnet author and genre)

selected_books=['austen-emma.txt','whitman-leaves.txt','milton-paradise.txt', 'melville-moby_dick.txt','chesterton-thursday.txt']

Key Tasks Undertaken

Data Preparation, Preprocessing and, Cleaning:
- Listing all the books in Gutenberg’s library.
```
{'austen-emma.txt': 'Jane Austen',
'austen-persuasion.txt': 'Jane Austen',
'austen-sense.txt': 'Jane Austen',
'carroll-alice.txt': 'Lewis Carroll',
'chesterton-ball.txt': 'G.K. Chesterton',
'chesterton-brown.txt': 'G. K. Chesterton',
'chesterton-thursday.txt': 'G. K. Chesterton',
'edgeworth-parents.txt': 'Maria Edgeworth',
'melville-moby_dick.txt': 'Dick  Herman Melville',
'shakespeare-caesar.txt': 'William Shakespeare',
'shakespeare-hamlet.txt': 'William Shakespeare',
'whitman-leaves.txt': 'Walt Whitman'}
```
- Choose five different books by five different authors belong to the same category (History).
- Data preparation:
  - Removing stop words.
  - Converting all words to the lower case.
  - Tokenize the text.
  - Lemmatization is the next step that reduces a word to its base form.
- Data Partitioning: partition each book into 200 documents, each document is a 100 word record.
- Data labeling as follows:
  - austen-emma→ a
  - chesterton-thursday→ b
  - shakespeare-hamlet→ c
  - chesterton-ball→ d
  - carroll-alice→ e
- Word Cloud Generation: Generates word clouds displaying the most frequent 100 words in books for each author.
Feature Engineering:
- Transformation
  - Bag of Word (BOW):It represents the occurrence of words within a document, it involves two things:
    - A vocabulary of known words.
    - A measure of the presence of known words.
  - Term Frequency - Inverse Document Frequency (TF-IDF):a technique to quantify words in a set of documents. We compute a score for each word to signify its importance in the document and corpus.
  - Latent Dirichlet Allocation (LDA): Perform topic modeling to extract latent topics from the text data. Each document is represented as a mixture of topics.
  - Word Embedding (Word2Vec)

Encoding

Modeling: For each technique of the above, these following models are trained and tested.
- K-Means
- Expectation Maximization (EM)
- Hierarchical clustering (Agglomerative)
Model Evaluation
- using Silhouette Score
- using Kappa Score
  
  [!IMPORTANT] The method for calculating the Kappa Score has been uploaded in the document titled "Kappa Score.pdf".
Champion Model
- on Silhouette Score
- on Kappa Score
Error Analysis of Champion Model:

By reducing the number of clusters from 5 to 3
- on Silhouette Score
- Champion Model
- on Kappa Score
- Champion Model

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Gutenberg Book Clustering.ipynb		Gutenberg Book Clustering.ipynb
Kappa Score.pdf		Kappa Score.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhancing Gutenberg Book Clustering using Advanced NLP Techniques

Unsupervised Text Clustering problem

Key Tasks Undertaken

About

Releases

Packages

Languages

License

RimTouny/Enhancing-Gutenberg-Book-Clustering-using-Advanced-NLP-Techniques

Folders and files

Latest commit

History

Repository files navigation

Enhancing Gutenberg Book Clustering using Advanced NLP Techniques

Unsupervised Text Clustering problem

Key Tasks Undertaken

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages