Materials for course: Introduction to Big Data with Apache Spark
core
- Apache Spark core examplesdata
- data for the exercisesdocker
- Docker used in trainingexercises
- exercise questionsnotebooks
- Jupyter notebookssql
- Apache Spark SQL examplesstreaming
- Apache Spark Streaming examples
The below are software packages needed for this course:
- Git
- Python 3.4+, installed via Anaconda (contains the majority of necessary packages)
- PySpark (1.6.0+)
Docker setup requires moderate resources but assures that everyone has a working environment for the training.
Setup steps:
- Download and install Git https://git-scm.com/downloads
- Download and install Docker following the instructions:
- https://docs.docker.com/windows/
- https://docs.docker.com/linux/
- https://docs.docker.com/mac/
- Use Docker Toolbox for Windows and Mac OS X https://www.docker.com/products/docker-toolbox
- (OS X / Win) Open Docker Quickstart Terminal (use
Terminal
, notiTerm
) - Go into this repository
- Build docker
docker-compose build
- To start Docker run
docker-compose up
- If one of the above docker commands fail, run
eval "$(docker-machine env default)"
and then the command, e.g.docker-compose build
- Jupyter runs on port 8888 on localhost on Linux on Docker VM IP available from
docker-machine ip
on Mac OS X and Windows data
andnotebooks
directories are mounted directly from the host file system- Note that the container will close with the current terminal session closure
- If one of the above docker commands fail, run
Potential issues:
- Setup can take some time as Docker pulls a number of images from the network
- Docker Toolbox with VirtualBox does not work well with Microsoft HyperV used by the new docker; remove HyperV before installing Docker Toolbox
- Sometimes Docker has problem with getting IPs on restrictive networks
- Put this repository into your home directory as Docker can have issues with mounting folders that are places outside of the home directory
This setup requires least resources but can be difficult on Windows machines.
Setup steps:
- Download and install Git https://git-scm.com/downloads
- Download and install Anaconda Python 3.4+ https://www.continuum.io/downloads
- Download Spark from http://spark.apache.org/downloads.html
- You sould add Spark to your PYTHONPATH
- You can also use Findspark package https://github.com/minrk/findspark
Most of the examples are written in Java 8 apart from Word Count examples, which are written in Java 7 and 8 and Scala; see the file suffixes.
The project is build with Apache Maven (http://maven.apache.org).
mvn clean
mvn install