In the last post, we talked about Topic Modeling, or a way to identify several topics from a corpus of documents. The method used there was Latent Dirichlet Allocation or LDA. In this article, we're going to perform a similar task but through the unsupervised machine learning method of clustering. While the method is different, the outcome is several groups (or topics) of words related to each other.
For this example, we will use the Wine Spectator reviews dataset from Kaggle[^KAGGLE]. It contains a little over 100,000 different wine reviews of varietals worldwide. The descriptions of the wines as tasting notes are the text-based variable that we're going to use to cluster and interpret the results.
Read more here: https://www.dataknowsall.com/textclustering.html