Philosophy: Write your code on your local machine, scale it out in the cloud when you want to either use the whole dataset or run 10.000.000 epochs in your Neural Network. The cluster is provisioned, data processed and stored to an external storage, and the cluster is destroyed. Minimal cost, maximal usage. Everything is automated.
What is Apache Spark?
Spark cluster is provisioned using Hashistack - Terraform and Consul - and Ansible. The cluster is provisioned on AWS, work environment is Docker on Windows.
The idea is to automate the Spark cluster provision, clone the code from GitHub and run it on data from an external source (for example S3). Once the data processing is done, results are saved to an external storage and the cluster is destroyed.
-
The Virtual Private Cloud is provisioned using Terraform. VPC's parameters are stored in Consul. This is a long live provision and serves multiple cluster provisions.
-
Data is loaded into S3. This is optional, depending on the data storage solution. Hadoop/Hive on top of S3 is also an option. A NoSql is also an option. Data Scientist inside the code handles data connection. If S3 is used, make sure you have access to the S3 storage. More on this, check out this post.
-
Code (or JAR application) to be run in Spark is put in a repository in GitHub. More here [Data Science examples](#Data Science examples)
-
Cluster provisioning creates the cluster, clones the repository, runs the code which points to the data storage (S3 bucket for example).
-
Data is processed and results are saved to an external storage (S3).
-
The cluster is destroyed after the processing is done.
- AWS credentials are needed. This is described in the [VPC on AWS](#VPC on AWS) section.
- This project depends on two other GitHub projects: Provision VPC on AWS which builds the mandatory VPC. The other project is the Docker for creating the container for development and testing
- Configuration in [local Consul](#Configuration to Consul)
The GitHub project Provision VPC on AWS sets up the Virtual Private Cloud in AWS. One can first build a Docker container on Windows to prepare the development environment and Consul agent where the configuration parameters are stored. The documentation in that project will help you create the Docker container and prepare the development environment.
The provisioning of the cluster uses Consul to fetch the parameters for provisioned cluster. Externalization of parameters is still work in progress. In the configuration project yaml files are stored and these files are used to feed the global Consul in AWS. At runtime, Terraform connects to the global Consul with a local Consul agent.
The YAML file for the Spark cluster configuration is the spark.yml. The GitHub repository with the code has to be entered in the configuration. This code is cloned to the Spark client (which is also the master). If you plan to do some demanding work locally (for example with pandas) choose an instance with more resources.
The python script takes no arguments, they all have to be written in the code - they are Data SCientist's responsibility. The above mentioned spark.yml file has a block with description of all parameters used.
This project holds some pyspark and scala examples to test the automatization - the provisioning process, once Spark cluster is established, the GitHub repository is cloned to the Spark Master and the code is run.
Even though this repository focuses on Apache Spark, the idea stays the same for any other service either distributed or single server - pay as you go. Instead of short-lived Spark cluster, a Hadoop cluster can be provisioned, or just an R-server with huge resources to do the job.