Scalable Weather Data Clustering Project

A data processing pipeline for clustering weather data.

Authors

Marios Souroulla, Philip Andreadis, Lorenzo Rota

Description

The purpose of the project is to cluster big amounts of weather data (Big Data). We perform a modified k-means algorithm over historic data to calculate the K-best centers that describe the various weather profiles. Then, we stream data, and we use the calculated centers to classify the streaming data into one of the centers.

For the data ingestion we use Apache Kafka, we store the data on HDFS, and we process them using Apache Spark in python.

Structure

The project runs as a pipeline and is structured into 3 consecutive layers:

Data Acquisition: This is responsible for emulating the weather data source by fetching data from multiple weather stations and publishing it to a real-time data streaming API. For more information on what this entails and how to set this up, read the instructions data_acquisition/README.md. It's important to note that this service be running continuously.
Data Ingestion: For any data processing to take place, there needs to be a bridge between the data source and the data sink. This is where the data ingestion layer comes in. In this project, data ingestion will be comprised of batch ingestion and streaming ingestion (following a lambda architecture). More information on how to run this can be found under data_ingestion/README.md.
Application: Here the core of the data processing will take place. In this case, it will be an application of a two versions of k-means clustering of historical data. The application also support streaming processing where new data records will be assigned to the clusters that were computed from the historical data. More information of this can be found under application/README.md.

While the Data Acquisition layer can run independently, the other two layers require that Kafka, Hadoop and Spark are up and running. For this, follow the instructions under docker_containers/README.md.

How to run the project

The instructions on how to run each layer are explained in the corresponding READMEs.

The correct sequence in order to run properly is:

Data Acquisition -> Data Ingestion -> Application (Data Processing).

Project Report

See the 'Scalable_Computing_Project.pdf'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable Weather Data Clustering Project

Authors

Description

Structure

How to run the project

Project Report

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
application		application
data_acquisition		data_acquisition
data_ingestion		data_ingestion
docker_containers		docker_containers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Scalable_Computing_Project.pdf		Scalable_Computing_Project.pdf
proposal.pdf		proposal.pdf

License

lorenzorota/Scalable-Weather-Data-Clustering

Folders and files

Latest commit

History

Repository files navigation

Scalable Weather Data Clustering Project

Authors

Description

Structure

How to run the project

Project Report

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages