Skip to content

AnirudhPenmatcha/Scalable-Cloud-Computing-for-Efficient-Big-Data-Analytics-A-Dask-Integration-Approach

 
 

Repository files navigation

Scalable Cloud-Based Distributed Computing for Efficient Big Data Analytics: A Dask Integration Approach

In this project, we are building a platform designed to be dynamically scalable based on user demand. At its core, it is the integration of Dask, a parallel execution framework, with JupyterHub, containerized and deployed on a cloud instance. We intend to benchmark the performance of Dask as a distributed computing framework on our cluster by conducting computationally intensive hyperparameter tuning of tree-based XGBoost algorithm on big data. Through systematic variations in input format, chunk size, task schedulers, worker nodes, clusters, and threading configurations, we seek to quantify the performance and compare it to baseline values obtained from running the program on the instance without distributing the workload. Our evaluation benchmarking serves two purposes: 1) to compare the performance of running computationally intensive ML algorithms with and without parallelizing the workload with Dask on cloud. 2) To understand in depth the many components of distributed computing that impact its performance.

About

A comparitive study of XGBoost algorithm's performance on Dask based cluster and Vanilla cluster

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 54.7%
  • Jupyter Notebook 45.3%