This repository contains code for text processing, clustering, and visualization using Python. The workflow involves cleaning and normalizing text data, generating embeddings with BERT, applying K-means clustering, reducing dimensionality, and visualizing the results using t-SNE and word clouds.
To get started, you need to install the necessary Python libraries. These libraries are essential for text preprocessing, generating embeddings, clustering, and visualizing the results.
pip install nltk transformers torch scikit-learn matplotlib seaborn wordcloud
- NLTK: Used for text processing tasks such as tokenization, stopword removal, and lemmatization.
- Transformers and Torch: Leverage the BERT model to generate high-quality text embeddings.
- Scikit-learn: Provides tools for scaling data, dimensionality reduction (PCA, t-SNE), and clustering (K-means).
- Matplotlib and Seaborn: Used for creating visualizations like t-SNE scatter plots and silhouette scores.
- WordCloud: Generates word clouds to visually represent word frequencies in each cluster.
The preprocessing involves cleaning and normalizing text data to prepare it for further analysis.
- Cleaning: Removes punctuation, numbers, and stopwords.
- Normalizing: Lemmatizes words to their base forms.
- Preprocessing Function: Applies both cleaning and normalizing steps to the text data.
def clean_text(text):
# Implementation
def normalize_text(text):
# Implementation
def preprocess_text(text):
# Implementation
Precomputed BERT embeddings are loaded and standardized to ensure consistent scaling across features.
- Hugging Face Dataset: The dataset used for embedding generation is available on Hugging Face.
- Embeddings File: The precomputed embeddings can be downloaded from this link.
train_embeddings = np.load('/path/to/train_embeddings.npy')
train_embeddings_scaled = StandardScaler().fit_transform(train_embeddings)
The optimal number of clusters is determined using silhouette scores to ensure effective clustering.
silhouette_scores = []
cluster_range = range(4, 8)
# Loop to calculate silhouette scores
Apply K-means clustering with the optimal number of clusters to group similar text embeddings.
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
kmeans_labels = kmeans.fit_predict(train_embeddings_scaled)
Dimensionality reduction is applied to make embeddings more manageable for visualization:
- PCA: Reduces dimensions to 50 while preserving variance.
- t-SNE: Further reduces dimensions to 2 for visualization.
pca_result = PCA(n_components=50).fit_transform(train_embeddings_scaled)
tsne_result = TSNE(n_components=2, random_state=42).fit_transform(pca_result)
- t-SNE Plot: Visualizes clusters in 2D to assess clustering quality.
- Word Clouds: Generates word clouds for each cluster to summarize frequent terms.
def plot_wordcloud(text, title):
# Implementation
Prints details of each cluster, including the number of texts and sample texts for inspection.
print("Cluster Details:")
# Implementation
The dataset used for processing is available on Hugging Face. The file used in this project is train.parquet
, which contains text data for processing.
The output includes:
- t-SNE plots visualizing the clusters.
- Word clouds summarizing each cluster's content.
- Cluster details with sample texts and counts.
- Ensure the path to the embeddings and dataset files is correct.
- Adjust
num_rows
to process a subset of data as needed. - Modify the cluster range and other parameters according to your data and requirements.