Scalable Cloud-Based Distributed Computing for Efficient Big Data Analytics: A Dask Integration Approach
In this project, we are building a platform designed to be dynamically scalable based on user demand. At its core, it is the integration of Dask, a parallel execution framework, with JupyterHub, containerized and deployed on a cloud instance. We intend to benchmark the performance of Dask as a distributed computing framework on our cluster by conducting computationally intensive hyperparameter tuning of tree-based XGBoost algorithm on big data. Through systematic variations in input format, chunk size, task schedulers, worker nodes, clusters, and threading configurations, we seek to quantify the performance and compare it to baseline values obtained from running the program on the instance without distributing the workload. Our evaluation benchmarking serves two purposes: 1) to compare the performance of running computationally intensive ML algorithms with and without parallelizing the workload with Dask on cloud. 2) To understand in depth the many components of distributed computing that impact its performance.