Text clustering, an unsupervised ML technique in NLP, groups similar texts based on content. Techniques like hierarchical, k-means, or density-based clustering categorize unstructured data, unveiling insights and patterns in diverse datasets. This exploration was part of the NLP course in my University of Ottawa master's program in 2023.
- Required libraries: scikit-learn, pandas, matplotlib.
- Execute cells in a Jupyter Notebook environment.
- The uploaded code has been executed and tested successfully within the Google Colab environment.
Text clustering involves grouping comparable texts based on content similarity, a crucial unsupervised technique.(chose 5 differnet books for 5 differnet author and genre)
selected_books=['austen-emma.txt','whitman-leaves.txt','milton-paradise.txt', 'melville-moby_dick.txt','chesterton-thursday.txt']
-
Data Preparation, Preprocessing and, Cleaning:
-
Listing all the books in Gutenberg’s library.
{'austen-emma.txt': 'Jane Austen', 'austen-persuasion.txt': 'Jane Austen', 'austen-sense.txt': 'Jane Austen', 'carroll-alice.txt': 'Lewis Carroll', 'chesterton-ball.txt': 'G.K. Chesterton', 'chesterton-brown.txt': 'G. K. Chesterton', 'chesterton-thursday.txt': 'G. K. Chesterton', 'edgeworth-parents.txt': 'Maria Edgeworth', 'melville-moby_dick.txt': 'Dick Herman Melville', 'shakespeare-caesar.txt': 'William Shakespeare', 'shakespeare-hamlet.txt': 'William Shakespeare', 'whitman-leaves.txt': 'Walt Whitman'}
-
Choose five different books by five different authors belong to the same category (History).
-
Data preparation:
- Removing stop words.
- Converting all words to the lower case.
- Tokenize the text.
- Lemmatization is the next step that reduces a word to its base form.
-
Data Partitioning: partition each book into 200 documents, each document is a 100 word record.
-
-
Feature Engineering:
- Transformation
- Bag of Word (BOW):It represents the occurrence of words within a document, it involves two things:
- A vocabulary of known words.
- A measure of the presence of known words.
- Term Frequency - Inverse Document Frequency (TF-IDF):a technique to quantify words in a set of documents. We compute a score for each word to signify its importance in the document and corpus.
- Latent Dirichlet Allocation (LDA): Perform topic modeling to extract latent topics from the text data. Each document is represented as a mixture of topics.
- Word Embedding (Word2Vec)
- Bag of Word (BOW):It represents the occurrence of words within a document, it involves two things:
- Transformation
- Encoding
-
Modeling: For each technique of the above, these following models are trained and tested.
- K-Means
- Expectation Maximization (EM)
- Hierarchical clustering (Agglomerative)
-
Model Evaluation
-
Champion Model
-
Error Analysis of Champion Model: