bigdata-example-project

Using KMeans clustering with Euclidean distance measure to group together similar data points into 8 clusters. And then reporting the Sum Squared Error of the resulting clusters.

Objective is to run analysis algorithm on openstack cloud, by ansiblizing the major steps. For this we have to use ansible scripts to create the VMs, setup hadoop cluster, install required softwares, retrieve and upload the dataset into HDFS, and copy analysis code to Master-node of Hadoop Cluster. Login to master node, run the analysis code on the data in HDFS, retrieve the results, and show the output of algorithm ran.

Results: The KMeans algorithm, when ran for 30 iterations on 13,700+ records for 8 clusters, the resulting sum squared error (SSE) was coming around 6300 ± 500. We ran our source multiple times from scratch.

Implementation: The entry point to run this project is executing launch.sh present at /src. The /src/twitter/ contains the main source code:

site.yml
|--software.yml  // install necessary softwares on the VM
|--dataset.yml   // retrieve the dataset and upload it to HDFS
|--analysis.yml  // copy the analysis code-base

which will install necessary softwares on the VM, retrieve the dataset and upload it to HDFS, copy the following analysis code-base:

main.sh
|--twitter.sbt
|--kmeans.demo.scala

to the master node.

To know how to run this project, refer the installation.rst file. To see a sample video demo of this project, click at -

References:

Academic learnings from CSCI.I590.Topics In Informatics: Projects On Big Data Software by Professor Geoffrey Charles Fox
The sample dataset of Emotion Vectors for tweets is obtained from my previous work
Using KMeans referred from MLLib KMeans

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs/images		docs/images
sample-data		sample-data
src		src
.gitignore		.gitignore
README.rst		README.rst
installation.rst		installation.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigdata-example-project

About

Releases

Packages

Languages

mjaglan/bigdata-example-project

Folders and files

Latest commit

History

Repository files navigation

bigdata-example-project

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages