Using KMeans clustering with Euclidean distance measure to group together similar data points into 8 clusters. And then reporting the Sum Squared Error of the resulting clusters.
Objective is to run analysis algorithm on openstack cloud, by ansiblizing the major steps. For this we have to use ansible scripts to create the VMs, setup hadoop cluster, install required softwares, retrieve and upload the dataset into HDFS, and copy analysis code to Master-node of Hadoop Cluster. Login to master node, run the analysis code on the data in HDFS, retrieve the results, and show the output of algorithm ran.
Results:
The KMeans algorithm, when ran for 30 iterations on 13,700+ records for 8 clusters, the resulting sum squared error (SSE) was coming around 6300 ± 500. We ran our source multiple times from scratch.
Implementation:
The entry point to run this project is executing launch.sh
present at /src. The /src/twitter/ contains the main source code:
site.yml |--software.yml // install necessary softwares on the VM |--dataset.yml // retrieve the dataset and upload it to HDFS |--analysis.yml // copy the analysis code-base
which will install necessary softwares on the VM, retrieve the dataset and upload it to HDFS, copy the following analysis code-base:
main.sh |--twitter.sbt |--kmeans.demo.scala
to the master node.
To know how to run this project, refer the installation.rst file. To see a sample video demo of this project, click at -
References:
- Academic learnings from CSCI.I590.Topics In Informatics: Projects On Big Data Software by Professor Geoffrey Charles Fox
- The sample dataset of Emotion Vectors for tweets is obtained from my previous work
- Using KMeans referred from MLLib KMeans