- Homebrew:
If not installed, run
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
in terminal - XCode:
If not installed, run
xcode-select --install
You probably have these installed already from things you have already installed in this course. These are not Spark specific.
- Use homebrew to install
java
(spark relies on the java JVM to run). Run these commands in the terminal.
brew cask install homebrew/cask-versions/java8
Earlier versions of OSX ( <= 10.9) may also have to install brew cask install java
.
- Now install
scala
andapache-spark
using Homebrew (not there is nocask
this time)
brew install scala apache-spark
- First we need to get a Java install that's up to date
sudo apt-get install default-jdk
- Next, we need to get Scala, the language that Spark is built in
sudo apt-get install scala
- Now we're ready to download Spark. Go here: Spark DL. Once it's downloaded, you can uncompress it with
tar -xzvf spark-2.3.1-bin-hadoop2.7.tgz
- Now move to the bin folder so that the other commands below will work.
cd spark-2.3.1-bin-hadoop2.7/bin
Test if this worked by loading a spark shell in the terminal:
spark-shell
If that errors out, you may on Ubuntu you may need to do:
./spark-shell
After a bunch of warning messages, you should see
2018-05-31 13:33:19 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://172.16.0.216:4040
Spark context available as 'sc' (master = local[*], app id = local-1527791607666).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
If you see something similar to this, congratulations! You have Spark installed. In the terminal, you will also see the line
Spark context Web UI available at http://172.16.0.216:4040
If you type in the URL into your browser (in my case http://172.16.0.216:4040, but yours will be different) you will get a Spark status screen that reports on the status of the different jobs Spark is running.
Quit the spark-shell
by typing :quit
at the prompt:
scala> :quit
- Install pyspark:
conda install pyspark
You have now installed Spark and PySpark on your computer!
- Install the Python package
findspark
to help Jupyter Notebooks ..... find Spark
conda install -c conda-forge findspark
3a. (ONLY ON MACs) PySpark will load, but it will have problem actually connecting to the Java Virtual Machine. You need to set the environment variable JAVA_HOME. Add the following line to ~/.bash_profile
:
export JAVA_HOME=`/usr/libexec/java_home -v 1.8`
Either open a new terminal, or run the command
source ~/.bashrc
3b. (ONLY ON SOME UBUNTU) If you're getting errors when trying to us pyspark, check you version of Java with java -version
. If it's not java-8-stuff-here, then you'll need to convert to Java 8.
sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer
Then open your ~/.bashrc
and add the following line
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre
then run source ~/.bashrc
to make that active. Now your pyspark should find the right version of Java.
- Test that everything worked
In the terminal, run
pyspark
You should see the
- From the terminal, you can run
pyspark
to load python with spark already loaded with a spark context preloaded:
pyspark
OR
- If you load
python
from the terminal, you can import pyspark as a package and create a context manually:
$ python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Apr 26 2018, 08:42:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
> import pyspark
> spark = pyspark.sql.SparkSession.builder.getOrCreate() # your context!
Either way, you can then run:
> a = spark.createDataFrame([[1,2,3],[4,5,6]])
> a.show()
You should see something like
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+