Skip to content
Johnny Foulds edited this page Aug 12, 2019 · 10 revisions

For this project the dataset is explored with Apache Spark using the Scala programming language. Because of the nature of data exploration is quite experimental Apache Zeppelin was chose to work interactively with the data while also providing the functionality to easily produce data visualizations on the fly.

Installing Zeppelin

The Zeppelin notebooks will be run from a Docker container which allows a development environment to be quickly spun up as needed to work on data that does not require massive processing power.

Instead of building a custom Dockerfile for Zeppelin the official Apache Container will be used from Docker Hub.

docker pull apache/zeppelin:0.8.1

The container is run with the following external volumes:

  • logs - The Zeppelin logs to make it easier to view from the working machine.
  • notebooks - Having this in an external volume simplifies source control from the project and backing up work.
  • data - The external data to work on in notebooks.
docker run -p 8080:8080 --rm \
-v $PWD/logs:/logs \
-v $PWD/code/nyc-job-exploration/zeppelin/notebook:/notebook \
-v $PWD/code/nyc-job-exploration/data:/data \
-e ZEPPELIN_LOG_DIR='/logs' \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
--name zeppelin apache/zeppelin:0.8.1

If you are running docker from a Virtual Machine it is also necessary to configure port forwarding so that Zeppelin can be opened as if it was running on the local machine.

Once the container is running Zeppelin can be opened from http://localhost:8080/ and the following code executer to make sure everything is up and running:

val nums = Array(1,2,3,5,6)
val rdd = sc.parallelize(nums)

import spark.implicits._
val df = rdd.toDF("num")

df.show()

Notebook Sharing

Zepel is an online "Data science notbook hub" that can be used to share Zeppelin notbooks online: https://www.zepl.com

Linking JavaScript files from GitHub

Statically can be used to link to files in a GitHub repository, the following example shows the equivalent generated URL:

https://github.com/JohnnyFoulds/nyc-job-exploration/blob/master/zeppelin/notebook/word-cloud/d3.layout.cloud.js
https://cdn.statically.io/gh/JohnnyFoulds/nyc-job-exploration/700bbf40/zeppelin/notebook/word-cloud/d3.layout.cloud.js

Video References

Web References

Clone this wiki locally