Skip to content
Frank Rosner edited this page Sep 10, 2015 · 1 revision

User Guide for DDS 2.3.1

Importing DDS

In order to use DDS in your Spark shell, you need to add it to your classpath and import the DDS core functions. Depending on the Scala version your Spark is compiled in, select the correct jar. Assuming your Spark is built with 2.10, start your Spark shell with the following parameter and import the core functionality:

spark-shell --jars spawncamping-dds-2.3.1_2.10.jar
import de.frosner.dds.core.DDS._

The Web UI

DDS comes with a lightweight web server that serves the results and charts to your browser. It pushes JSON objects to the JavaScript front-end that will then display them using HTML, CSS and SVG. The server needs to be started once after the Spark shell has loaded. It can be used for the entire session. However, you can stop and restart it as often as you like.

Note that a visualization is served only once. So currently only one browser window is able to consume the result (the one that asks quickest). However, if you want the current visualization not to be refreshed, there is a small lock icon in the top right corner of the screen. Clicking this icon will prevent the front-end from talking to the back-end. This allows to have multiple visualizations open at once.

You can change the visualization title in the top left corner to save it for later reference or show it to your colleagues.

Starting the server

The server can be started by calling the start() function in the Spark shell. You can also specify the interface and port the server should listen to. To start the server listening to 192.168.0.5:8081, execute the following command:

start("192.168.0.5", 8081)

You can also protect your web UI with a password:

start("192.168.0.5", 8081, "pasw1234")

Note, that this mechanism uses basic HTTP authentication which transmits the password unencrypted. That means you should not reuse any important passwords here and do not rely on the password staying secure when transmitting it over the internet.

Stopping the server

The server can be stopped by calling the stop() function in the Spark shell.

Available Functions

To get a list of all available functions use the help() function. To get help for particular functions, you can pass the function name to help(name: String).

Besides the functions based on RDDs, DDS offers a bunch of generic plotting functions not related to Spark. They can be used to plot charts or display tables in the browser from any data source you have.

Scala Functions

line

line[N](values: Seq[N])(implicit num: Numeric[N])

Prints a line chart with the given values and a default label.

lines

lines[N](labels: Seq[String], values: Seq[Seq[N]])(implicit num: Numeric[N])

Prints a line chart with multiple lines. The label labels(x) corresponds to the value sequence values(x).

bar (indexed)

bar[N](values: Seq[N])(implicit num: Numeric[N])

Prints a bar chart showing the given counts on an indexed axis and a default label.

bar (categorical)

bar[N](values: Seq[N], categories: Seq[String])(implicit num: Numeric[N])

Prints a bar chart showing the given counts on a categorical axis with a default label. There must be as many categories as there are counts.

bars (indexed)

bars[N](labels: Seq[String], values: Seq[Seq[N]])(implicit num: Numeric[N])

Prints a bar chart with multiple bars on an indexed axis. The label labels(x) corresponds to the value sequence values(x).

bars (categorical)

bars[N](labels: Seq[String], values: Seq[Seq[N]], categories: Seq[String])
       (implicit num: Numeric[N])

Prints a bar chart with multiple bars on a categorical axis. The label labels(x) corresponds to the value sequence values(x). There must be a category for each value.

pie

pie[K, V](keyValuePairs: Iterable[(K, V)])(implicit num: Numeric[V])

Prints a pie chart with one segment for each key-value pair in the input collection. The keys are assumed to be unique (e.g. values already reduced).

scatter

scatter[N1, N2](values: Seq[(N1, N2)])
               (implicit num1: Numeric[N1] = null, num2: Numeric[N2] = null)

Prints a scatter plot of the given (x, y) sequence. X and Y axes can be either numeric or non-numeric.

heatmap

heatmap[N](values: Seq[Seq[N]], rowNames: Seq[String] = null, 
           colNames: Seq[String] = null)(implicit num: Numeric[N])

Draws a heat map visualizing the given matrix. values(i)(j) corresponds to the element in the ith row and jth column of the matrix. If no row or column names are specified, they will be numbered.

histogram

histogram[N1, N2](bins: Seq[N1], frequencies: Seq[N2])
                 (implicit num1: Numeric[N1], num2: Numeric[N2])

Plots a histogram chart visualizing the given bins and frequencies. The bins are defined by their borders. To specify n bins, you need to pass n+1 borders and n frequencies.

Example:

  • 5 people are between 0 and 18 years old, 10 people between 18 and 25
  • bins = [0, 18, 25], frequencies = [5, 10]

show

show[V](sequence: Seq[V])(implicit tag: TypeTag[V])

Shows the given value sequence in a table. DDS can show sequences of simple values (e.g. strings, numbers) or composite ones (collections, case classes). The values of that sequence need to have a type tag, i.e. if they are custom classes they need to be defined top level.

table

table(head: Seq[String], rows: Seq[Seq[Any]])

Prints a table with the given row-wise data. The head(x) corresponds to each column, e.g. rows(0)(x).

If the table contains optional values (represented by Scala's Option[T] monad), they will be treated as JavaScript null by the front-end and displayed as empty cells if they are missing.

graph

graph[ID, VL, EL](vertices: Seq[(ID, VL)], edges: Iterable[(ID, ID, EL)])

Displays a graph defined by the given vertices and edges. A vertex is represented by a pair of an identifier and a label. Edges are pairs of source and target vertex identifiers. You can specify a vertex label (VL) as well as an edge label (EL) for each vertex and edge, respectively.

Spark Core Functions

show

show[V](rdd: RDD[V], sampleSize: Int = 100)(implicit tag: TypeTag[V])

Prints the first lines of the given RDD in a table. The second (optional) argument determines the number of rows to show. DDS can show RDDs of simple values (e.g. strings, numbers) or composite ones (collections, case classes). The values of that RDD need to have a type tag, i.e. if they are custom classes they need to be defined top level.

bar

bar[V](values: RDD[V])

Prints a bar chart visualizing the count of all distinct values in this RDD. It is recommended to execute it only on non-numeric value RDDs. Use the histogram function for numeric RDDs instead.

pie

pie[V](values: RDD[V])

Prints a pie chart visualizing the count of all distinct values in this RDD. It is recommended to execute it only on non-numeric value RDDs. Use the histogram function for numeric RDDs instead.

histogram (fixed number of buckets)

histogram[N](values: RDD[N], (optional) numBuckets: Int)(implicit num: Numeric[N])

Prints a histogram visualizing the given numerical RDD. The second (optional) parameter specifies the number of evenly distributed buckets. If no number is given, DDS will apply Sturge's formula to compute the optimal number of bins (assuming Gaussian distribution).

histogram (fixed bucket borders)

histogram[N1, N2](values: RDD[N1], buckets: Seq[N2])
                 (implicit num1: Numeric[N1], num2: Numeric[N2])

Prints a histogram visualizing the given numerical RDD. The second parameter specifies the buckets to use. Example: To get two buckets, one from 0 to 5 and another one from 5 to 10, you can to pass buckets = List(0,5,10).

groupAndPie

groupAndPie[K, N](readyToGroup: RDD[(K, N)])
                 (reduceFunction: (N, N) => N)
                 (implicit num: Numeric[N])

Computes a pie chart visualizing the numeric values per group. It is assumed that there are key-value pairs in each input row, where the key can be used for grouping. DDS will apply the given reduce function to the values of each group before plotting the reduced value in a segment.

pieGroups

pieGroups[K, N](grouped: RDD[(K, Iterable[N])])
               (reduceFunction: (N, N) => N)
               (implicit num: Numeric[N])

Computes a pie chart visualizing the numeric values per group. It is assumed that there is one input row per group and is usually a result of a group-by operation on an RDD. DDS will apply the given reduce function to the values of each group before plotting the reduced value in a segment.

summarize

summarize[N](values: RDD[N])(implicit num: Numeric[N] = null)

Shows some basic summary statistics for the given RDD. If the RDD has numeric values, count, sum, min, max, mean, stdev, and variance are calculated. If the RDD has nominal values, DDS will calculate mode and cardinality for you.

groupAndSummarize

groupAndSummarize[K, N](readyToGroup: RDD[(K, N)])(implicit num: Numeric[N])

Shows some basic summary statistics for each of the groups defined by the given key. It is assumed that there are key-value pairs in each input row, where the key can be used for grouping.

summarizeGroups

summarizeGroups[K, N](grouped: RDD[(K, Iterable[N])])
                     (implicit num: Numeric[N])

Shows some basic summary statistics for each of the given groups. It is assumed that there is one input row per group and is usually a result of a group-by operation on an RDD.

median

median[N: ClassTag](values: RDD[N])
                   (implicit num: Numeric[N] = null)

Calculates the median of a given numeric data set. Note that this operation can be computationally expensive, as it requires a distributed sort and lookup.

Spark SQL Functions

show

show(dataFrame: DataFrame, sampleSize: Int = 100)

Prints the first lines of the given data frame in a table. The second (optional) argument determines the number of rows to show.

DDS will display the column type and the nullable flag in the column header. A nullable age column would show Age [Double*], while an id column might be ID [String], because it is not nullable.

bar

bar(dataFrame: DataFrame, (optional) nullValue: Any)

Prints a bar chart visualizing the count of all distinct values in a data frame column. You can optionally specify a null value to replace missing values with. Otherwise, DDS will use Scala's Option to represent nullable columns. It is recommended to execute it only on non-numeric columns. Use the histogram function for numeric columns instead.

pie

pie(dataFrame: DataFrame, (optional) nullValue: Any)

Prints a pie chart visualizing the count of all distinct values in a data frame column. You can optionally specify a null value to replace missing values with. Otherwise, DDS will use Scala's Option to represent nullable columns. It is recommended to execute it only on non-numeric columns. Use the histogram function for numeric columns instead.

histogram (fixed number of buckets)

histogram(dataFrame: DataFrame, (optional) numBuckets: Int)

Prints a histogram visualizing the given numerical data frame column. The second (optional) parameter specifies the number of evenly distributed buckets. If no number is given, DDS will apply Sturge's formula to compute the optimal number of bins (assuming Gaussian distribution).

histogram (fixed bucket borders)

histogram[N](dataFrame: DataFrame, buckets: Seq[N])
            (implicit num: Numeric[N])

Prints a histogram visualizing the given numerical data frame column. The second parameter specifies the buckets to use. Example: To get two buckets, one from 0 to 5 and another one from 5 to 10, you can to pass buckets = List(0,5,10).

median

median(dataFrame: DataFrame)

Calculates the median of a given numeric data frame column. Note that this operation can be computationally expensive, as it requires a distributed sort and lookup.

correlation

correlation(dataFrame: DataFrame)

Computes and shows the Pearson correlation matrix of all numerical columns in the given table. If there are missing values, DDS will only use pairwise complete observations for calculating the correlation of two columns / variables.

mutualInformation

mutualInformation(dataFrame: DataFrame, (optional) normalization: String)

Computes and shows a mutual information matrix of all columns in the given table. Although numerical values will be binned automatically, date values are not binned so far. It is strongly advised to bin them before applying this function.

The mutual information will be rescaled using the maximum entropy of both variables. If no normalization is desired, the second parameter can be set to "none". Please note that in case the entropy of a column is 0, the normalization might not be defined. This is reflected in a black cell in the heatmap.

summarize

summarize(dataFrame: DataFrame)

Computes a summary of the given data frame by showing summary statistics of each column, as well as an aggregated graphical representation (bar chart or histogram) based on the data type. Supported data types are numeric, date and nominal.

dashboard (alpha)

dashboard(dataFrame: DataFrame)

Gives an overview of the given data set in form of a dashboard. The dashboard contains a sample of 100 rows, a correlation and mutual information matrix as well as summary statistics for each column.

However, this function is still in alpha stage as it uses mutual information without binning on numerical columns and is not very efficient in terms of layout 👎

Spark GraphX Functions

showVertexSample

showVertexSample[VD, ED](graph: Graph[VD, ED],
                         sampleSize: Int = 20,
                         vertexFilter: (VertexId, VD) => Boolean =
                           (id: VertexId, attr: VD) => true)

Plots a sample of the given GraphX graph based on the vertex set. DDS will sample the graph by taking a subset of the vertices and discarding any edges that do not have both source and target node as part of the subset. You can also pass a function to filter vertices before the sampling is applied to have some control over the reduction.

It is advised to choose small samples first because the graph layout algorithm and the SVG drawing can be quite slow for huge graphs.

showEdgeSample

showEdgeSample[VD, ED](graph: Graph[VD, ED],
                       sampleSize: Int = 20,
                       edgeFilter: (Edge[ED]) => Boolean = 
                         (edge: Edge[ED]) => true)

Plots a sample of the given GraphX graph based on the edge set. DDS will sample the graph by taking a subset of the edges and discarding any vertices that do not appear as either source or target in the subset. You can also pass a function to filter edges before the sampling is applied to have some control over the reduction.

It is advised to choose small samples first because the graph layout algorithm and the SVG drawing can be quite slow for huge graphs.

connectedComponents

connectedComponents[VD, ED](graph: graphx.Graph[VD, ED])

Plots the number of nodes and the number of edges for each connected component present in the given graph.

Example Data Sets

To quickly get started with Spark and DDS, you can load one of the available example data sets either as an RDD of case classes or a data frame.

To access a data set you need to pass your Spark context (sc). If you want to apply a schema, you also need to pass a SQL context (sql). You may also provide them as implicits.

golf

The golf data set is a small artificial example used for explaining classification problems. It was obtained from LearningJS.

// RDD of case classes
val golf = de.frosner.dds.datasets.golf(sc)
// DataFrame
val golf = de.frosner.dds.datasets.golf(sc, sql)

flights

In the flights data set you will find a subset of rows and column taken from the Bureau of Transportation Statistics. They contain information about flights and their corresponding delay.

// RDD of case classes
val flights = de.frosner.dds.datasets.flights(sc)
// DataFrame
val flights = de.frosner.dds.datasets.flights(sc, sql)

enron

This data set contains a network of e-mail communications between (mostly) senior managers of Enron. It is available at the Stanford Large Network Dataset Collection.

// Graph
val network = de.frosner.dds.datasets.enron(sc)

Interactive Visualizations

DDS comes with a set of interactive visualizations at the front-end side. The following sections explains only the visualizations that are truly interactive, i.e. the user can modify it by clicking and pressing buttons. Simple visualizations like tables or line charts are not covered here.

Grid View

Tabular data is displayed in an interactive grid. The table contents can be sorted by clicking on the column header of the attribute to sort by. Each element is assigned an ID which links the contents of the grid with the accompanying plot (e.g. parallel coordinates).

If the table has many entries, it will be paginated. You can find the controls for pagination in the menu bar above the grid.

Parallel Coordinates

The parallel coordinates plot is a visualization technique to explore high dimensional data. Instead of drawing the axes orthogonally like in a 2D scatter plot, they are drawn in a parallel fashion. One row / data point corresponds to a polyline through all the axes. DDS currently supports two interactions with the parallel coordinates plot: Axis reordering and filtering. DDS offers parallel coordinates for all tabular data.

Axis Manipulations

To reorder the axes, drag one axis on its label to a different position. This can be helpful to detect patterns like correlations or clusters in the data.

To change the ordering of the values on the axis, double click on the axis label.

To hide / show the ticks labels, press the hide ticks button in the top right corner.

Filtering

You can filter the values by selecting a range in one of the dimensions. Just click and span a filter region on an axis. DDS supports filters on multiple dimensions simultaneously. To change the filter just move it around or change its size. To delete a filter, just click on a different part of the axis that is not in range of the filter.

Color Coding

You can dynamically color the polylines in the parallel coordinates plot based on one dimension. To select the dimension to use for color coding, just click on the small grey circle above the axis. If the attribute is numerical, a linear scale from orange to red will be used. Otherwise, a multi-color scale will be used that does not imply any ordering.

Scatter Plots

When visualizing data in a scatter plot, you can choose whether to apply some random noise (jitter) on non-numerical axes. Toggle jitter by clicking on the enable jitter / disable jitter button in the top right corner.

Graph Drawing

When displaying a graph in DDS, you can choose to hide / show edge and vertex labels. Toggle the visibility of them using the navigations buttons in the top right corner. Additionally, you can choose between drawing directed and undirected edges.

In order to interact with the layout, you can drag and drop vertices to change their position.

Heat Maps

When visualizing matrices as heat maps, it sometimes makes sense to try different color scales. As of today, DDS offers one two-color scale and one three-color scale. You can select the corresponding scale by clicking the colored buttons in the header menu.

It is also possible to customize the range of the color scale (min, max). For example, it makes sense to use a three color scale and (-1, 1) range when looking at a correlation matrix. Specify min and max values using the two input fields labeled "z", residing in the header menu.

Logging

DDS currently uses the Apache log4j logger which is also used by Spark. So you can configure the logging in the same way you configure logging in Spark. The logger is called DDS.