The repository contains the source code and dataset to reproduce the parallel computing exercise described in the paper:
This paper provides a practical programming guide to setting up a minimum working example of a distributed system for parallel time series analysis. The system is built in Apache Spark on top of Amazon's Hadoop-based service Elastic MapReduce (EMR). A simple forecasting exercise with 1,000 time series illustrates the proposed parallelization scheme, which reduces total runtime performance by about 95% relative to a single-core, single-machine setting. The ease of implementing this scheme makes this guide a useful reference for econometricians with a limited background in parallel programming. To facilitate reproducibility of the practical steps in this guide, the PySpark/Python code is available for download on github.
Link to the paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3226976