Skip to content

rupeshtiwari/learning-apache-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Install & Run Spark on your MAC machine & AWS Cloud step by step guide

http://www.rupeshtiwari.com/learning-apache-spark/

You will able to install spark and also run spark shell and pyspark shell on your mac.

Step 1: Install JAVA on your MAC

Install JAVA steps on your mac machine

Step 2: Install Spark on your MAC

  • Go to apache spark site and download the latest version. https://spark.apache.org/downloads.html

  • Create new folder spark3

  • Move the tar file in spark3 folder (newly created)

  • Untar the folder with script sudo tar -zxvf spark-3.2.1-bin-hadoop3.2.tgz

  • Set the Spark_home path to point to the spark3 folder export SPARK_HOME=~/spark3/spark-3.2.1-bin-hadoop3.2

  • Also put this script on startup command sudo vim .zshrc, Press β€œi” to edit, Press escape then :wq to save the file

  • Open new terminal and check the spark home path

    • echo $SPARK_HOME
  • Next add the spark home bin path in your default $PATH

    • Here is the command to update $path export PATH=$PATH:$SPARK_HOME/bin
    • Place this path in your startup script as well
  • Now you can start the spark shell type spark-shell

Step 3: Installing python3

If you already have python3 then ignore this step. In order to check type python3 on your terminal.

  1. Install python3: brew install python3

  2. Check python3 installed: python3

  3. Next setup pyspark_python environment variable to point python3: export PYSPARK_PYTHON=python3

  4. Check the path: echo $PYSPARK_PYTHON

  5. Also put this script on your startup command file .zshrc file in my case.

Step 4: Running PySpark shell in your MAC laptop

Now run pyspark to see the spark shell in python.

Running small program with PySpark Shell in your MAC laptop

You will learn about spark shell, local cluster, driver, executor and Spark Context UI.

cluster mode tool
Local Client Mode spark-shell
# Navigate to spark3 bin folder
cd ~/spark3/bin

# 1. create shell
pyspark

# read and display json file with multiline formated json like I have in my example
df = spark.read.option("multiline","true").json("/Users/rupeshti/workdir/git-box/learning-apache-spark/src/test.json")
df.show()

πŸ‘‰ option("multiline","true") is important if you have JSON with multiple lines formated by prettier or any other formatter

Analyzing Spark Jobs using Spark Context Web UI in your MAC laptop

To monitor and investigate your spark application you can check spark context web UI.

  • Go to url http://localhost:4040/jobs/
  • Check Event Timeline Spark started and executed driver process
  • We are not seeing separate executer process, because we are in local cluster. Every thing is running in single JVM. JVM is a combination of driver and executer.
  • When cluster created we did not pass number of thread so it took default number as 8 based on my laptop hardware resource available.
  • Storage Memory it took maximum 434.4 MB. This is sum of overall JVM.
  • You can access this spark context UI till your spark shell is open. Once you quit spark shell you will loose the access to this UI.
  • Each Executer is a JVM that run on a independent machine.You don't have control which executer will run on which slave machine. Cluster manager is the guy who will assign executer to slave machine.

Running Jupyter Notebook ( In local cluster and client mode ) in your MAC laptop

Data scientist use Jupyter Notebook to develop & explore application step by step. Spark programming in python requires you to have python on your machine.

cluster mode tool
Local Client Mode Notebook

If you install Anaconda environment, then you get python development environment also you will get spark support. You can download the community edition and install Anaconda. ANaconda comes with pre-configured Jupyter notebook.

How to use Spark using Jupyter Notebook?

Notebook is a Shell based environment. You can type your code in shell and run it.

  1. set SPARK_HOME environment variable
  2. Install findspark package
  3. Initialize findspark: the connection between Anaconda python environment and your spark installation

Step 1: Setting environment variable and starting notebook

After installing Anaconda. Type jupyter notebook on terminal. You will see your browser will spin up at this URL on default browser: http://localhost:8888/tree You shell will also be keep running on terminal. Now you have Jupyter notebook environment.

Go to desired folder and create a new python 3 notebook.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.read.option("multiline","true").json("/Users/rupeshti/workdir/git-box/learning-apache-spark/src/test.json").show()

You get this error ModuleNotFoundError: No module named 'pyspark' because you have not connected the shell to spark.

Step 2: installing findspark

# 1. install pipx
brew install pipx

#2. install findspark
pip3 install findspark

Step 3 connecting spark with notebook shell

Below script will connect to spark.

import findspark
findspark.init()

Final notebook code is:

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.option("multiline","true").json("/Users/rupeshti/workdir/git-box/learning-apache-spark/src/test.json").show()

Installing Multi-Node Spark Cluster in AWS Cloud

At AWS, Amazon EMR (Elastic Map & Reduce) service can be used to create Hadoop cluster with spark.

cluster mode tool
YARN Client Mode spark-shell, Notebook

This mode is used by data scientist for interactive exploration directly with production cluster. Most cases we use notebooks for web base interface and graph capability.

Step 1: Creating EMR cluster at AWS cloud

Creating spark shell on a real multi-node yarn cluster.

What is Amazon EMR?

Amazon EMR is the industry-leading cloud big data platform for data processing, interactive analysis, and machine learning using open source framework such as Apache Spark, Apache Hive and Presto.

Benefits of using Amazon EMR are:

  1. You do not need to manage compute capacity or open-source applications that will save you time and money.
  2. Amazon EMR lets you to set up scaling rules to manage changing compute demand
  3. You can set up CloudWatch alerts to notify you of changes in your infrastructure and take actions immediately
  4. EMR has optimized runtime which speed up your analysis and save both time and money
  5. You can submit your workload to either EC2 or EKS using EMR
  • I am create cluster with 1 master and 3 worker nodes.

  • Use spark version 2.4.8

  • We will use notebook so lets also take Zeppelin 0.10.0

  • Create the cluster

  • Note you get 3 slave (executer) and 1 master (driver) EC2 instances created.

  • Go to security group of master node, add new rule, and allow all traffic from your IP address.

  • SSH to Master instance. ssh -i "fsm01.pem" hadoop@ec2-18-209-11-152.compute-1.amazonaws.com

πŸ‘‰ make sure to login with hadoop user

Step 2: Running PySpark on EMR cluster in AWS cloud using Spark-Shell

  • Run pyspark to create spark shell, when you want to quit the shell then press control D.
  • My spark shell is running. My driver and executers already created and waiting for me to submit spark command.
  • You can see spark context UI to analyze the job by clicking on spark history server in EMR cluster at AWS.
  • Go to the spark history server URL
    • It will show you list of application that you executed in the past.
    • Currently it is not showing any application so go ahead and close your pyspark shell and you see as many times you have opened pyspark and closed it they all are treated as a application.
    • I closed pyspark shell 4 times.
    • Open any one and go to time line events.
    • Note you got 1 driver and 3 executers. That is what you asked when you created your cluster.
    • Click on executers tab and note you get 3 executers and check their memory allocation.

Step 3: Running PySpark on Notebook on EMR cluster at AWS cloud using Zeppelin

πŸ‘‰ Note: mostly you will not use pyspark shell in real world people are using notebooks. Therefore, we are going to use zeppelins notebook next.

Visit Zeppelin URL

In your secured enterprise setup you have to ask your cluster operations team to provide you the URL and grant you the access for the same.

Notebook is not like spark shell. So It is not connected to spark by default you have to run some spark command to connect. You can simply run spark.version command also.

Create new notebook and run spark.version Default notebook zeppelin shell is skala shell. Therefore, you should use interpreter directive %pyspark so that you can run python code.

Working with spark-submit on EMR cluster

cluster mode tool
YARN Cluster Mode spark-submit

This mode of operation is mostly used for executing spark application on your production cluster. spark-submit --help to check all options.

Let's create and submit a spark application.

  • create main.py in master node
import sys
x=int(sys.argv[1])
y=int(sys.argv[2])
sum=x+y
print("The addition is :",sum)
  • Submit application spark-submit main.py 1 3

Todo

https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210705/

git pull && git add . && git commit -m 'adding new notes' && git push