modeling.Rmd

```{r include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
source("r/render.R")
```

# Modeling {#modeling}

> I’ve trusted in your visions, in your prophecies, for years.
>
> --- Stannis Baratheon

In [Chapter 3](#analysis) you learned how to scale up data analysis to large datasets using Spark. In this chapter, we detail the steps required to build<!--((("prediction models", seealso="modeling")))((("Spark using R", "modeling", id="SURmodle04")))--> prediction models in Spark. We explore `MLlib`, the component of Spark that allows you to write high-level code to perform predictive modeling on distributed data, and use data wrangling in the context of feature engineering and exploratory data analysis.

We will start this chapter by introducing modeling in the context of Spark and the dataset you will use throughout the chapter. We then demonstrate a supervised learning workflow that includes exploratory data analysis, feature engineering, and model building. Then we move on to an unsupervised topic modeling example using unstructured text data. Keep in mind that our goal is to show various techniques of executing data science tasks on large data rather than conducting a rigorous and coherent analysis. There are also many other models available in Spark that won’t be covered in this chapter, but by the end of the chapter, you will have the right tools to experiment with additional ones on your own.

While predicting datasets manually is often a reasonable approach (by "manually," we mean someone imports a dataset into Spark and uses the fitted model to enrich or predict values), it does beg the question, could we automate this process into systems that anyone can use? For instance, how can we build a system that automatically identifies an email as spam without having to manually analyze each email account? [Chapter 5](pipelines) presents the tools to automate data analysis and modeling with pipelines, but to get there, we need to first understand how to train models "by hand".

## Overview

The R interface to Spark provides modeling algorithms that should be familiar to R users, and we'll go into detail in the chapter. For instance, we've already used `ml_linear_regression(cars, mpg ~ .)`, but we could run `ml_logistic_regression(cars, am ~ .)` just as easily.

Take a moment to look at the long list of `MLlib` functions included in the appendix of this book; a quick glance at this list shows that Spark supports Decision Trees, Gradient-Boosted Trees, Accelerated Failure Time Survival Regression, Isotonic Regression, _K_-Means Clustering, Gaussian Mixture Clustering, and more.

As you can see, Spark provides a wide range of algorithms and feature transformers, and here we touch on a representative portion of the functionality. A complete treatment of predictive modeling concepts is beyond the scope of this book, so we recommend complementing this discussion with [_R for Data Science_](https://r4ds.had.co.nz/) by  Hadley Wickham and Garrett Grolemund G (O'Reilly) and Feature Engineering and Selection: A Practical Approach for Predictive Models,^[Kuhn M, Johnson K (2019). _Feature Engineering and Selection: A Practical Approach for Predictive Models_. (CRC PRess.)] from which we adopted (sometimes verbatim) some of the examples and visualizations in this chapter.

This chapter focuses on<!--((("modeling", "predictive modeling")))--> predictive modeling, since Spark aims to enable machine learning as opposed to statistical inference. Machine learning<!--((("machine learning")))--> is often more concerned about forecasting the future rather than inferring the process by which our data is generated,^[We acknowledge that the terms here might mean different things to different people and that there is a continuum between the two approaches; however, they are defined.] which is then used to create automated systems. Machine learning can be categorized into<!--((("supervised learning", "process of")))--> _supervised learning_ (predictive modeling) and _unsupervised learning_. In supervised learning, we try to learn a function that will map from X to Y, from a dataset of (x, y) examples. In<!--((("unsupervised learning", "process of")))--> unsupervised learning, we just have X and not the Y labels, so instead we try to learn something about the structure of X. Some practical use cases for supervised learning include forecasting tomorrow’s weather, determining whether a credit card transaction is fraudulent, and coming up with a quote for your car insurance policy. With unsupervised learning, examples include automated grouping of photos of individuals, segmenting customers based on their purchase history, and clustering of documents.

The<!--((("sparklyr package", "machine learning interface")))--> ML interface in `sparklyr` has been designed to minimize the cognitive effort for moving from a local, in-memory, native-R workflow to the cluster, and back. While the Spark ecosystem<!--((("Spark ecosystem")))--> is very rich, there is still a tremendous number of packages from CRAN, with some implementing functionality that you might require for a project. Also, you might want to leverage your skills and experience working in R to maintain productivity. What we learned in [Chapter 3](#analysis) also applies here—it is important to keep track of where you are performing computations and move between the cluster and your R session as appropriate.

The examples in this chapter utilize the [`OkCupid` dataset](https://oreil.ly/Uv9r_).^[Kim AY, Escobedo-Land A (2015). “OKCupid data for introductory statistics and data science courses.” _Journal of Statistics Education_, 23(2).] The dataset consists of user profile data from an online dating site and contains a diverse set of features, including biographical characteristics such as gender and profession, as well as free text fields related to personal interests. There are about 60,000 profiles in the dataset, which fits comfortably into memory on a modern laptop and wouldn’t be considered "big data", so you can easily follow along running Spark in local mode.

You can download this dataset as follows:

```{r modeling-download-data}
download.file(
  "https://github.com/r-spark/okcupid/raw/master/profiles.csv.zip",
  "okcupid.zip")

unzip("okcupid.zip", exdir = "data")
unlink("okcupid.zip")
```

We don’t recommend sampling this dataset since the model won’t be nearly as rich; however, if you have limited hardware resources, you are welcome to sample it as follows:

```{r modeling-sample-data}
profiles <- read.csv("data/profiles.csv")
write.csv(dplyr::sample_n(profiles, 10^3),
          "data/profiles.csv", row.names = FALSE)
```

**Note:** The examples in this chapter utilize small datasets so that you can easily follow along in local mode. In practice, if your dataset fits comfortably in memory on your local machine, you might be better off using an efficient, nondistributed implementation of the modeling algorithm. For example, you might want to use the `ranger` package instead of `ml_random_forest_classifier()`.

In addition, to follow along, you will need to install a few additional packages:

```{r modeling-install-packages, eval=FALSE, exercise=TRUE}
install.packages("ggmosaic")
install.packages("forcats")
install.packages("FactoMineR")
```

To motivate the examples, we consider the following problem:

> Predict whether someone is actively working—that is, not retired, a student, or unemployed.

Next up, we explore this dataset.

## Exploratory Data Analysis

Exploratory<!--((("modeling", "exploratory data analysis", id="Mexplor04")))((("exploratory data analysis (EDA)", id="explordata04")))--> data analysis (EDA), in the context of predictive modeling, is the exercise of looking at excerpts and summaries of the data. The specific goals of the EDA stage are informed by the business problem, but here are some common objectives:

- Check for data quality; confirm meaning and prevalence of missing values and reconcile statistics against existing controls.
- Understand univariate relationships between variables.
- Perform an initial assessment on what variables to include and what transformations need to be done on them.

To begin, we connect to Spark, load libraries, and read in the data:

```{r modeling-connect}
library(sparklyr)
library(ggplot2)
library(dbplot)
library(dplyr)

sc <- spark_connect(master = "local", version = "2.3")

okc <- spark_read_csv(
  sc, 
  "data/profiles.csv", 
  escape = "\"", 
  memory = FALSE,
  options = list(multiline = TRUE)
) %>%
  mutate(
    height = as.numeric(height),
    income = ifelse(income == "-1", NA, as.numeric(income))
  ) %>%
  mutate(sex = ifelse(is.na(sex), "missing", sex)) %>%
  mutate(drinks = ifelse(is.na(drinks), "missing", drinks)) %>%
  mutate(drugs = ifelse(is.na(drugs), "missing", drugs)) %>%
  mutate(job = ifelse(is.na(job), "missing", job))
```

We specify `escape = "\""` and `options = list(multiline = TRUE)` here to accommodate embedded quote characters and newlines in the essay fields. We also convert the `height` and `income` columns to numeric types and recode missing values in the string columns. Note that it might very well take a few tries of specifying different parameters to get the initial data ingest correct, and sometimes you might need to revisit this step after you learn more about the data during modeling.

We can now take a quick look at our<!--((("commands", "glimpse()")))--> data by using `glimpse()`:

```{r modeling-eda-glimpse}
glimpse(okc)
```
```
Observations: ??
Variables: 31
Database: spark_connection
$ age         <int> 22, 35, 38, 23, 29, 29, 32, 31, 24, 37, 35…
$ body_type   <chr> "a little extra", "average", "thin", "thin…
$ diet        <chr> "strictly anything", "mostly other", "anyt…
$ drinks      <chr> "socially", "often", "socially", "socially…
$ drugs       <chr> "never", "sometimes", "missing", "missing"…
$ education   <chr> "working on college/university", "working …
$ essay0      <chr> "about me:<br />\n<br />\ni would love to …
$ essay1      <chr> "currently working as an international age…
$ essay2      <chr> "making people laugh.<br />\nranting about…
$ essay3      <chr> "the way i look. i am a six foot half asia…
$ essay4      <chr> "books:<br />\nabsurdistan, the republic, …
$ essay5      <chr> "food.<br />\nwater.<br />\ncell phone.<br…
$ essay6      <chr> "duality and humorous things", "missing", …
$ essay7      <chr> "trying to find someone to hang out with. …
$ essay8      <chr> "i am new to california and looking for so…
$ essay9      <chr> "you want to be swept off your feet!<br />…
$ ethnicity   <chr> "asian, white", "white", "missing", "white…
$ height      <dbl> 75, 70, 68, 71, 66, 67, 65, 65, 67, 65, 70…
$ income      <dbl> NaN, 80000, NaN, 20000, NaN, NaN, NaN, NaN…
$ job         <chr> "transportation", "hospitality / travel", …
$ last_online <chr> "2012-06-28-20-30", "2012-06-29-21-41", "2…
$ location    <chr> "south san francisco, california", "oaklan…
$ offspring   <chr> "doesn&rsquo;t have kids, but might want t…
$ orientation <chr> "straight", "straight", "straight", "strai…
$ pets        <chr> "likes dogs and likes cats", "likes dogs a…
$ religion    <chr> "agnosticism and very serious about it", "…
$ sex         <chr> "m", "m", "m", "m", "m", "m", "f", "f", "f…
$ sign        <chr> "gemini", "cancer", "pisces but it doesn&r…
$ smokes      <chr> "sometimes", "no", "no", "no", "no", "no",…
$ speaks      <chr> "english", "english (fluently), spanish (p…
$ status      <chr> "single", "single", "available", "single",…
```

Now, we add our response variable as a column in the dataset and look at its distribution:

```{r modeling-eda-tally}
okc <- okc %>%
  mutate(
    not_working = ifelse(job %in% c("student", "unemployed", "retired"), 1 , 0)
  )

okc %>% 
  group_by(not_working) %>% 
  tally()
```
```
# Source: spark<?> [?? x 2]
  not_working     n
        <dbl> <dbl>
1           0 54541
2           1  5405
```

Before we proceed further, let's perform an initial split of our data into a training set and a testing set and put away the latter. In practice, this is a crucial step because we would like to have a holdout set that we set aside at the end of the modeling process to evaluate model performance. If we were to include the entire dataset during EDA, information from the testing set could "leak" into the visualizations and summary statistics and bias our model-building process even though the data is not used directly in a learning algorithm. This would undermine the credibility of our performance metrics. We can easily split the data by using the `sdf_random_split()` function:

```{r modeling-eda-splits}
data_splits <- sdf_random_split(okc, training = 0.8, testing = 0.2, seed = 42)
okc_train <- data_splits$training
okc_test <- data_splits$testing
```

We can quickly look at the distribution of our response variable:

```{r modeling-eda-dist}
okc_train %>%
  group_by(not_working) %>%
  tally() %>%
  mutate(frac = n / sum(n))
```
```
# Source: spark<?> [?? x 3]
  not_working     n   frac
        <dbl> <dbl>  <dbl>
1           0 43785 0.910 
2           1  4317 0.0897
```

Using the `sdf_describe()` function, we can obtain numerical summaries of specific columns:

```{r modeling-describe}
sdf_describe(okc_train, cols = c("age", "income"))
```
```
# Source: spark<?> [?? x 3]
  summary age                income            
  <chr>   <chr>              <chr>             
1 count   48102              9193              
2 mean    32.336534863415245 104968.99815076689
3 stddev  9.43908920033797   202235.2291773537 
4 min     18                 20000.0           
5 max     110                1000000.0   
```

Like we saw in [Chapter 3](#analysis), we can also utilize the `dbplot` package to plot distributions of these variables. In Figure \@ref(fig:age-histogram) we show a histogram of the distribution of the `age` variable, which is the result of the following code:

```{r modeling-eda-bins, echo=FALSE, eval=FALSE}
db_compute_bins(okc_train, age) %>%
  ggplot() +
  geom_col(aes(age, count), fill = "#878787") +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  labs(title = "Distribution of Age", subtitle = "Age histogram in OKCupid dataset") +
  ggsave("images/modeling-okc-histogram-age.png", width = 10, height = 5)
```
```{r modeling-eda-hist}
dbplot_histogram(okc_train, age)
```
```{r age-histogram, eval=TRUE, echo=FALSE, fig.cap='Distribution of age', fig.align = 'center'}
render_image("images/modeling-okc-histogram-age.png", "Distribution of age")
```

A common EDA exercise is to look at the relationships between the response and the individual predictors. Often, you might have prior business knowledge of what these relationships should be, so this can serve as a data quality check. Also, unexpected trends can inform variable interactions that you might want to include in the model. As an example, we can explore the `religion` variable:

```{r modeling-eda-prop}
prop_data <- okc_train %>%
  mutate(religion = regexp_extract(religion, "^\\\\w+", 0)) %>% 
  group_by(religion, not_working) %>%
  tally() %>%
  group_by(religion) %>%
  summarize(
    count = sum(n),
    prop = sum(not_working * n) / sum(n)
  ) %>%
  mutate(se = sqrt(prop * (1 - prop) / count)) %>%
  collect()

prop_data
```
```{r modeling-eda-save-prop, echo = FALSE}
saveRDS(prop_data, "data/modeling-okc-prop-data.rds")
```
```
# A tibble: 10 x 4
   religion     count   prop      se
   <chr>        <dbl>  <dbl>   <dbl>
 1 judaism       2520 0.0794 0.00539
 2 atheism       5624 0.118  0.00436
 3 christianity  4671 0.120  0.00480
 4 hinduism       358 0.101  0.0159 
 5 islam          115 0.191  0.0367 
 6 agnosticism   7078 0.0958 0.00346
 7 other         6240 0.0841 0.00346
 8 missing      16152 0.0719 0.002  
 9 buddhism      1575 0.0851 0.007  
10 catholicism   3769 0.0886 0.00458
```

Note that `prop_data` is a small<!--((("DataFrames", "visualization using ggplot2")))--> DataFrame that has been collected into memory in our R session, we can take advantage of `ggplot2` to create an informative visualization (see Figure \@ref(fig:modeling-eda-prop-render)):

```{r modeling-eda-prop-plot, echo=FALSE, eval=FALSE}
prop_data <- readRDS("data/modeling-okc-prop-data.rds")
prop_data %>%
  ggplot(aes(x = religion, y = prop)) + 
  theme(axis.text.x = element_text(size = 12)) +
  geom_errorbar(aes(ymin = prop - 1.96 * se, ymax = prop + 1.96 * se), width = .1) +
  geom_point(size = 2) +
  
  geom_hline(yintercept = sum(prop_data$prop * prop_data$count) / sum(prop_data$count), linetype = "dashed", color = "#888888") +
  ggsave("images/modeling-okc-prop.png", device = "png", width = 10, height = 5)
```
```{r modeling-eda-prop-code}
prop_data %>%
  ggplot(aes(x = religion, y = prop)) + geom_point(size = 2) +
  geom_errorbar(aes(ymin = prop - 1.96 * se, ymax = prop + 1.96 * se),
                width = .1) +
  geom_hline(yintercept = sum(prop_data$prop * prop_data$count) /
                              sum(prop_data$count))
```
```{r modeling-eda-prop-render, echo = FALSE, eval = TRUE, fig.cap='Proportion of individuals not currently employed, by religion', fig.align = 'center'}
render_image("images/modeling-okc-prop.png", "Proportion of individuals not currently employed, by religion")
```

Next, we take a look at the relationship between a couple of predictors: alcohol use and drug use. We would expect there to be some correlation between them. You can compute a contingency table via `sdf_crosstab()`:

```{r modeling-eda-contingency}
contingency_tbl <- okc_train %>% 
  sdf_crosstab("drinks", "drugs") %>%
  collect()

contingency_tbl
```
```
# A tibble: 7 x 5
  drinks_drugs missing never often sometimes
  <chr>          <dbl> <dbl> <dbl>     <dbl>
1 very often        54   144    44       137
2 socially        8221 21066   126      4106
3 not at all       146  2371    15       109
4 desperately       72    89    23        74
5 often           1049  1718    69      1271
6 missing         1121  1226    10        59
7 rarely           613  3689    35       445
```

```{r modeling-eda-contingency-save, echo = FALSE}
saveRDS(contingency_tbl, "data/modeling-okc-contingency-table.rds")
```

We can visualize this contingency table using a mosaic plot (see Figure \@ref(fig:modeling-eda-mosaic)):

```{r modeling-eda-contingency-code}
library(ggmosaic)
library(forcats)
library(tidyr)

contingency_tbl %>%
  rename(drinks = drinks_drugs) %>%
  gather("drugs", "count", missing:sometimes) %>%
  mutate(
    drinks = as_factor(drinks) %>% 
      fct_relevel("missing", "not at all", "rarely", "socially", 
                  "very often", "desperately"),
    drugs = as_factor(drugs) %>%
      fct_relevel("missing", "never", "sometimes", "often")
  ) %>%
  ggplot() +
  geom_mosaic(aes(x = product(drinks, drugs), fill = drinks, 
                  weight = count))
```

```{r modeling-eda-contingency-plot, echo=FALSE, eval=FALSE}
library(ggmosaic)
library(forcats)
library(tidyr)

contingency_tbl <- readRDS("data/modeling-okc-contingency-table.rds")

contingency_tbl %>%
  rename(drinks = drinks_drugs) %>%
  gather("drugs", "count", missing:sometimes) %>%
  mutate(
    drinks = as_factor(drinks) %>% 
      forcats::fct_relevel("missing", "not at all", "rarely", "socially", "very often", "desperately"),
    drugs = as_factor(drugs) %>%
      forcats::fct_relevel("missing", "never", "sometimes", "often")
  ) %>%
  ggplot() + guides(fill = FALSE) +
  geom_mosaic(aes(x = product(drinks, drugs), fill = drinks, weight = count)) +
  xlab("Drugs") + ylab("Drinks") +
  scale_fill_grey(start = 0.6, end = 0.2) +
  labs(title = "Contingency Table", subtitle = "Mosaic plot of drug and alcohol use") +
  theme(
    axis.text.x = element_text(vjust = 1, hjust = 1)
  ) +
  ggsave("images/modeling-okc-mosaic.png", width = 10, height = 5)
```
```{r modeling-eda-mosaic, echo = FALSE, eval = TRUE, fig.cap='Mosaic plot of drug and alcohol use', fig.align = 'center'}
render_image("images/modeling-okc-mosaic.png", "Mosaic plot of drug and alcohol use")
```

To further explore the relationship between these two variables, we can perform correspondence analysis^[Greenacre M (2017). _Correspondence analysis in practice_. Chapman and Hall/CRC.] using the `FactoMineR` package. This technique enables us to summarize the relationship between the high-dimensional factor levels by mapping each level to a point on the plane. We first obtain the mapping using `FactoMineR::CA()` as follows:

```{r modeling-eda-factominer-map}
dd_obj <- contingency_tbl %>% 
  tibble::column_to_rownames(var = "drinks_drugs") %>%
  FactoMineR::CA(graph = FALSE)
```

We can then plot the results using `ggplot`, which you can see in Figure \@ref(fig:modeling-eda-pcs):

```{r modeling-eda-factominer-code}
dd_drugs <-
  dd_obj$row$coord %>%
  as.data.frame() %>%
  mutate(
    label = gsub("_", " ", rownames(dd_obj$row$coord)),
    Variable = "Drugs"
  )

dd_drinks <-
  dd_obj$col$coord %>%
  as.data.frame() %>%
  mutate(
    label = gsub("_", " ", rownames(dd_obj$col$coord)),
    Variable = "Alcohol"
  )
  
ca_coord <- rbind(dd_drugs, dd_drinks)
  
ggplot(ca_coord, aes(x = `Dim 1`, y = `Dim 2`, 
                     col = Variable)) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_text(aes(label = label)) +
  coord_equal()
```
```{r modeling-eda-factominer-plot, echo=FALSE, eval=FALSE}
# From FES
dd_obj <- contingency_tbl %>% 
  tibble::column_to_rownames(var = "drinks_drugs") %>%
  as.matrix() %>%
  as.table() %>%
  FactoMineR::CA(graph = FALSE)

dd_drugs <-
  dd_obj$row$coord %>%
  as.data.frame() %>%
  mutate(
    label = gsub("_", " ", rownames(dd_obj$row$coord)),
    Variable = "Drugs"
  )

dd_drinks <-
  dd_obj$col$coord %>%
  as.data.frame() %>%
  mutate(
    label = gsub("_", " ", rownames(dd_obj$col$coord)),
    Variable = "Alcohol"
  )

# ------------------------------------------------------------------------------

dd_rng <- extendrange(c(dd_drinks$`Dim 1`, dd_drinks$`Dim 2`))
dd_x <- paste0("Dimension #1 (",
               round(dd_obj$eig["dim 1", "percentage of variance"], 0),
               "%)")
dd_y <- paste0("Dimension #2 (",
               round(dd_obj$eig["dim 2", "percentage of variance"], 0),
               "%)")

ca_coord <- rbind(dd_drugs, dd_drinks)

# ------------------------------------------------------------------------------

ggplot(ca_coord, aes(x = `Dim 1`, y = `Dim 2`, col = Variable)) +
  scale_color_grey(start = 0.6, end = 0.2) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_text(aes(label = label)) +
  xlim(dd_rng) + ylim(dd_rng) +
  xlab(dd_x) + ylab(dd_y) +
  labs(title = "Correspondence Analysis", subtitle = "Principal coordinates for drugs and alcohol use") +
  ggsave("images/modeling-okc-correspondence-analysis.png", width = 10, height = 5)
```
```{r modeling-eda-pcs, echo = FALSE, eval = TRUE, fig.cap='Correspondence analysis principal coordinates for drug and alcohol use', fig.align = 'center'}
render_image("images/modeling-okc-correspondence-analysis.png")
```

In Figure \@ref(fig:modeling-eda-pcs), we<!--((("principal coordinates")))--> see that the correspondence analysis procedure has transformed the factors into variables called _principal coordinates_, which correspond to the axes in the plot and represent how much information in the contingency table they contain. We can, for example, interpret the proximity of "drinking often" and "using drugs very often" as indicating association.<!--((("", startref="Mexplor04")))((("", startref="explordata04")))-->

This concludes our discussion on EDA. Let's proceed to feature engineering.

## Feature Engineering

The<!--((("modeling", "feature engineering")))((("feature engineering")))--> feature engineering exercise comprises transforming the data to increase the performance of the model. This can include things like centering and scaling numerical values and performing string manipulation to extract meaningful variables. It also often includes variable selection—the process of selecting which predictors are used in the model.

In Figure \@ref(fig:age-histogram) we see that the `age` variable has a range from 18 to over 60. Some algorithms, especially neural networks, train faster if we normalize our inputs so that they are of the same magnitude. Let’s now normalize the `age` variable by removing the mean and scaling to unit variance, beginning by calculating its mean and standard deviation:

```{r modeling-eda-scale}
scale_values <- okc_train %>%
  summarize(
    mean_age = mean(age),
    sd_age = sd(age)
  ) %>%
  collect()

scale_values
```
```
# A tibble: 1 x 2
  mean_age sd_age
     <dbl>  <dbl>
1     32.3   9.44
```

We can then use these to transform the dataset:

```{r modeling-eda-scale-age}
okc_train <- okc_train %>%
  mutate(scaled_age = (age - !!scale_values$mean_age) /
           !!scale_values$sd_age)
```
```{r modeling-eda-scale-code}
dbplot_histogram(okc_train, scaled_age)
```
```{r modeling-eda-scale-plot, echo=FALSE, eval=FALSE}
db_compute_bins(okc_train, scaled_age) %>%
  ggplot() +
  geom_col(aes(scaled_age, count), fill = "#878787") +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  labs(title = "Distribution of Scaled Age", subtitle = "Scaled age histogram in OKCupid dataset") +
  ggsave("images/modeling-okc-histogram-scaled-age.png", width = 10, height = 5)
```

In Figure \@ref(fig:modeling-eda-scaled-dist), we see that the scaled age variable has values that are closer to zero. We now move on to discussing other types of transformations, but during your feature engineering workflow you might want to perform the normalization for all numeric variables that you want to include in the model.

```{r modeling-eda-scaled-dist, echo = FALSE, eval = TRUE, fig.cap='Distribution of scaled age', fig.align = 'center'}
render_image("images/modeling-okc-histogram-scaled-age.png")
```

Since some of the profile features are multiple-select—in other words, a person can choose to associate multiple options for a variable—we need to process them before we can build meaningful models. If we take a look at the ethnicity column, for example, we see that there are many different combinations:

```{r modeling-eda-ethnicity}
okc_train %>%
  group_by(ethnicity) %>%
  tally()
```
```
# Source: spark<?> [?? x 2]
   ethnicity                                     n
   <chr>                                     <dbl>
 1 hispanic / latin, white                    1051
 2 black, pacific islander, hispanic / latin     2
 3 asian, black, pacific islander                5
 4 black, native american, white                91
 5 middle eastern, white, other                 34
 6 asian, other                                 78
 7 asian, black, white                          12
 8 asian, hispanic / latin, white, other         7
 9 middle eastern, pacific islander              1
10 indian, hispanic / latin                      5
# … with more rows
```

One way to proceed would be to treat each combination of races as a separate level, but that would lead to a very large number of levels, which becomes problematic in many algorithms. To better encode this information, we can create dummy variables for each race, as follows:

```{r modeling-eda-ethnicity-glimpse}
ethnicities <- c("asian", "middle eastern", "black", "native american", "indian", 
                 "pacific islander", "hispanic / latin", "white", "other")
ethnicity_vars <- ethnicities %>% 
  purrr::map(~ expr(ifelse(like(ethnicity, !!.x), 1, 0))) %>%
  purrr::set_names(paste0("ethnicity_", gsub("\\s|/", "", ethnicities)))
okc_train <- mutate(okc_train, !!!ethnicity_vars)
okc_train %>% 
  select(starts_with("ethnicity_")) %>%
  glimpse()
```
```
Observations: ??
Variables: 9
Database: spark_connection
$ ethnicity_asian           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ethnicity_middleeastern   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ethnicity_black           <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ ethnicity_nativeamerican  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ethnicity_indian          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ethnicity_pacificislander <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ethnicity_hispaniclatin   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ethnicity_white           <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0…
$ ethnicity_other           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
```

For the free text fields, a straightforward way to extract features is counting the total number of characters. We will store the train dataset in Spark’s memory with `compute()` to speed up computation.

```{r modeling-eda-essay-compute}
okc_train <- okc_train %>%
  mutate(
    essay_length = char_length(paste(!!!syms(paste0("essay", 0:9))))
  ) %>% compute()
```
```{r modeling-eda-essay-plot, echo=FALSE, eval=FALSE}
db_compute_bins(okc_train, essay_length, bins = 100) %>%
  ggplot() +
  geom_col(aes(essay_length, count), fill = "#878787") +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  labs(title = "Distribution of Essay Length", subtitle = "Essay length histogram in OKCupid dataset")
  ggsave("images/modeling-okc-histogram-essay-length.png", width = 10, height = 5)
```
```{r modeling-eda-essay-render}
dbplot_histogram(okc_train, essay_length, bins = 100)
```

We can see the distribution of the `essay_length` variable in Figure \@ref(fig:modeling-essay-length-distribution).

```{r modeling-essay-length-distribution, echo=FALSE, eval=TRUE, fig.cap='Distribution of essay length', fig.align = 'center'}
render_image("images/modeling-okc-histogram-essay-length.png", "Distribution of essay length")
```

We use this dataset in [Chapter 5](#pipelines), so let’s save it first as a Parquet file—an efficient file format ideal for numeric data:

```{r modeling-eda-write-parquet}
spark_write_parquet(okc_train, "data/okc-train.parquet")
```

Now that we have a few more features to work with, we can begin running some unsupervised learning algorithms.

## Supervised Learning

Once<!--((("modeling", "supervised learning", id="Msuperlearn04")))((("supervised learning", "overview of")))--> we have a good grasp on our dataset, we can start building some models. Before we do so, however, we need to come up with a plan to tune and validate the "candidate" models—in modeling projects, we often try different types of models and ways to fit them to see which ones perform the best. Since we are dealing with a binary classification problem, the metrics we can use include accuracy, precision, sensitivity, and area under the<!--((("receiver operating characteristic curve (ROC AUC)")))--> receiver operating characteristic curve (ROC AUC), among others. The metric you optimize depends on your specific business problem, but for this exercise, we will focus on the ROC AUC.

It is important that we don’t peek at the testing holdout set until the very end, because any information we obtain could influence our modeling decisions, which would in turn make our estimates of model performance less credible. For tuning and validation, we perform 10-fold cross-validation, which is a standard approach for model tuning. The scheme works as follows: we first divide our dataset into 10 approximately equal-sized subsets. We take the 2nd to 10th sets together as the training set for an algorithm and validate the resulting model on the 1st set. Next, we reserve the 2nd set as the validation set and train the algorithm on the 1st and 3rd to 10th sets. In total, we train 10 models and average the performance. If time and resources allow, you can also perform this procedure multiple times with different random partitions of the data. In our case, we will demonstrate how to perform the cross-validation once. Hereinafter, we refer to the training set associated with each<!--((("analysis data")))((("assessment data")))--> split as the _analysis_ data, and the validation set as _assessment_ data.

Using<!--((("commands", "sdf_random_split()")))--> the `sdf_random_split()` function, we can create a list of subsets from our `okc_train` table:

```{r modeling-super-split}
vfolds <- sdf_random_split(
  okc_train,
  weights = purrr::set_names(rep(0.1, 10), paste0("fold", 1:10)),
  seed = 42
)
```

We then create our first analysis/assessment split as follows:

```{r modeling-super-split-analysis}
analysis_set <- do.call(rbind, vfolds[2:10])
assessment_set <- vfolds[[1]]
```

One item we need to carefully treat here is the scaling of variables. We need to make sure that we do not leak any information from the assessment set to the analysis set, so we calculate the mean and standard deviation on the analysis set only and apply the same transformation to both sets. Here is how we would handle this for the `age` variable:

```{r modeling-super-scale-age}
make_scale_age <- function(analysis_data) {
  scale_values <- analysis_data %>%
    summarize(
      mean_age = mean(age),
      sd_age = sd(age)
    ) %>%
    collect()

  function(data) {
    data %>%
      mutate(scaled_age = (age - !!scale_values$mean_age) / !!scale_values$sd_age)
  }
}

scale_age <- make_scale_age(analysis_set)
train_set <- scale_age(analysis_set)
validation_set <- scale_age(assessment_set)
```

For brevity, here we show only how to transform the `age` variable. In practice, however, you would want to normalize each one of your continuous predictors, such as the `essay_length` variable we derived in the previous section.

Logistic regression<!--((("logistic regression")))--> is often a reasonable starting point for binary classification problems, so let's give it a try. Suppose also that our domain knowledge provides us with an initial set of predictors. We can then fit a model by using the `Formula` interface:

```{r modeling-super-logistic}
lr <- ml_logistic_regression(
  analysis_set, not_working ~ scaled_age + sex + drinks + drugs + essay_length
)
lr
```
```
Formula: not_working ~ scaled_age + sex + drinks + drugs + essay_length

Coefficients:
      (Intercept)        scaled_age             sex_m   drinks_socially 
    -2.823517e+00     -1.309498e+00     -1.918137e-01      2.235833e-01 
    drinks_rarely      drinks_often drinks_not at all    drinks_missing 
     6.732361e-01      7.572970e-02      8.214072e-01     -4.456326e-01 
drinks_very often       drugs_never     drugs_missing   drugs_sometimes 
     8.032052e-02     -1.712702e-01     -3.995422e-01     -7.483491e-02 
     essay_length 
     3.664964e-05 
```

To<!--((("commands", "ml_evaluate()")))--> obtain a summary of performance metrics on the assessment set, we can use the `ml_evaluate()` function:

```{r modeling-super-logistic-eval}
validation_summary <- ml_evaluate(lr, assessment_set)
```

You can print `validation_summary` to see the available metrics:

```{r modeling-super-logistic-summary}
validation_summary
```
```
BinaryLogisticRegressionSummaryImpl 
 Access the following via `$` or `ml_summary()`. 
 - features_col() 
 - label_col() 
 - predictions() 
 - probability_col() 
 - area_under_roc() 
 - f_measure_by_threshold() 
 - pr() 
 - precision_by_threshold() 
 - recall_by_threshold() 
 - roc() 
 - prediction_col() 
 - accuracy() 
 - f_measure_by_label() 
 - false_positive_rate_by_label() 
 - labels() 
 - precision_by_label() 
 - recall_by_label() 
 - true_positive_rate_by_label() 
 - weighted_f_measure() 
 - weighted_false_positive_rate() 
 - weighted_precision() 
 - weighted_recall() 
 - weighted_true_positive_rate() 
 ```

We can plot the ROC curve by collecting the output of `validation_summary$roc()` and using `ggplot2`:

```{r modeling-super-roc-code}
roc <- validation_summary$roc() %>%
  collect()

ggplot(roc, aes(x = FPR, y = TPR)) +
  geom_line() + geom_abline(lty = "dashed")
```
```{r modeling-super-roc-save, echo=FALSE}
saveRDS(roc, "data/modeling-okc-roc-1.rds")
```
```{r modeling-super-roc-plot, echo=FALSE, eval=FALSE}
roc <- readRDS("data/modeling-okc-roc-1.rds")
ggplot(roc, aes(x = FPR, y = TPR)) +
  geom_line() +
  geom_abline(lty = "dashed") +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  labs(title = "ROC Curve", subtitle = "Receiver operating characteristic curve for the logistic regression model") +
  ggsave("images/modeling-okc-roc1.png", device = "png", width = 10, height = 5)
```

Figure \@ref(fig:modeling-super-roc1) shows the results of the plot.

```{r modeling-super-roc1, echo = FALSE, eval = TRUE, fig.cap='ROC curve for the logistic regression model', fig.align = 'center'}
render_image("images/modeling-okc-roc1.png")
```

The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1–specificity) for varying values of the classification threshold. In practice, the business problem helps to determine where on the curve one sets the threshold for classification. The AUC is a summary measure for determining the quality of a model, and we can compute it by calling the `area_under_roc()` function.

```{r modeling-super-area-roc}
validation_summary$area_under_roc()
```
```
[1] 0.7872754
```

**Note:** Spark provides evaluation methods for only generalized linear models (including linear models and logistic regression). For other algorithms, you can use the evaluator functions (e.g., `ml_binary_classification_evaluator()` on the prediction DataFrame) or compute your own metrics.

Now, we can easily repeat the logic we already have and apply it to each analysis/assessment split:

```{r modeling-super-map-roc}
cv_results <- purrr::map_df(1:10, function(v) {
  analysis_set <- do.call(rbind, vfolds[setdiff(1:10, v)]) %>% compute()
  assessment_set <- vfolds[[v]]
  
  scale_age <- make_scale_age(analysis_set)
  train_set <- scale_age(analysis_set)
  validation_set <- scale_age(assessment_set)
  
  model <- ml_logistic_regression(
    analysis_set, not_working ~ scaled_age + sex + drinks + drugs + essay_length
  )
  s <- ml_evaluate(model, assessment_set)
  roc_df <- s$roc() %>% 
    collect()
  auc <- s$area_under_roc()
  
  tibble(
    Resample = paste0("Fold", stringr::str_pad(v, width = 2, pad = "0")),
    roc_df = list(roc_df),
    auc = auc
  )
})
```

This gives us 10 ROC curves:

```{r modeling-super-unnest-roc}
unnest(cv_results, roc_df) %>%
  ggplot(aes(x = FPR, y = TPR, color = Resample)) +
  geom_line() + geom_abline(lty = "dashed")
```
```{r modeling-super-unnest-roc-save, echo=FALSE, eval=FALSE}
cv_roc <- cv_results %>% 
  unnest(roc_df)
saveRDS(cv_roc, "data/modeling-okc-cv-roc.rds")
```
```{r modeling-super-unnest-roc-render, echo=FALSE, eval=FALSE}
cv_roc <- readRDS("data/modeling-okc-cv-roc.rds")
cv_roc %>%
  ggplot(aes(x = FPR, y = TPR, color = Resample)) +
  geom_line() + 
  geom_abline( lty = "dashed") +
  scale_color_grey() +
  labs(title = "ROC Curves", subtitle = "Cross-validated ROC curves for the logistic regression model") +
  geom_hline(yintercept = 0, size = 1, colour = "#333333") +
  ggsave("images/modeling-okc-roc2.png", device = "png", width = 10, height = 5)
```

Figure \@ref(fig:modeling-super-roc2) shows the results of the plot.

```{r modeling-super-roc2, echo = FALSE, eval = TRUE, fig.cap='Cross-validated ROC curves for the logistic regression model', fig.align = 'center'}
render_image("images/modeling-okc-roc2.png")
```

And we can obtain the average AUC metric:

```{r modeling-super-auc-mean}
mean(cv_results$auc)
```
```
[1] 0.7715102
```

### Generalized Linear Regression

If<!--((("supervised learning", "generalized linear regression")))((("generalized linear regression")))--> you are interested in generalized linear model (GLM) diagnostics,you can also fit a logistic regression via the generalized linear regression interface by specifying `family = "binomial"`. Because the result is a regression model, the `ml_predict()` method does not give class probabilities. However, it includes confidence intervals for coefficient estimates:

```{r modeling-super-glr-save, echo=FALSE, eval=FALSE}
glr <- ml_generalized_linear_regression(
  analysis_set, 
  not_working ~ scaled_age + sex + drinks + drugs + essay_length, 
  family = "binomial"
)
tidy_glr <- tidy(glr)
saveRDS(tidy_glr, "data/modeling-okc-tidy-glr.rds")
```
```{r modeling-super-glr-code}
glr <- ml_generalized_linear_regression(
  analysis_set, 
  not_working ~ scaled_age + sex + drinks + drugs, 
  family = "binomial"
)

tidy_glr <- tidy(glr)
```
```{r modeling-super-glr-load, echo=FALSE}
tidy_glr <- readRDS("data/modeling-okc-tidy-glr.rds")
```

We<!--((("coefficient estimates")))((("DataFrames", "extracting coefficient estimates into")))--> can extract the coefficient estimates into a tidy DataFrame, which we can then process further—for example, to create a coefficient plot, which you can see in Figure \@ref(fig:modeling-super-glr-coefs).

```{r modeling-super-glr-plot-code}
tidy_glr %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_point() +
  geom_errorbar(
    aes(ymin = estimate - 1.96 * std.error, 
       ymax = estimate + 1.96 * std.error, width = .1)
  ) +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed")
```
```{r modeling-super-glr-plot-render, echo=FALSE, eval=FALSE}
tidy_glr %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = estimate - 1.96 * std.error, ymax = estimate + 1.96 * std.error, width = .1)) +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Confidence Intervals", subtitle = "Coefficient estimates with 95% confidence intervals") +
  ggsave("images/modeling-okc-glr-coefs.png", width = 10, height = 5)
```
```{r modeling-super-glr-coefs, eval=TRUE, echo=FALSE, fig.cap='Coefficient estimates with 95% confidence intervals', fig.align = 'center'}
render_image("images/modeling-okc-glr-coefs.png")
```

**Note:** Both<!--((("commands", "ml_logistic_regression()")))((("commands", "ml_linear_regression()")))--> `ml_logistic_regression()` and `ml_linear_regression()` support elastic net regularization^[Zou H, Hastie T (2005). “Regularization and variable selection via the elastic net.” _Journal of the royal statistical society: series B (statistical methodology)_, 67(2), 301–320.] through the `reg_param` and `elastic_net_param` parameters. `reg_param` corresponds to $\lambda$, whereas `elastic_net_param` corresponds to $\alpha$. `ml_generalized_linear_regression()` supports only `reg_param`.

### Other Models

Spark supports<!--((("supervised learning", "other models")))--> many of the standard modeling algorithms and it’s easy to apply these models and hyperparameters (values that control the model-fitting process) for your particular problem. You can find a list of supported ML-related functions in the appendix. The interfaces to access these functionalities are largely identical, so it is easy to experiment with them. For example, to fit a neural network model we can run the following:

```{r modeling-super-other}
nn <- ml_multilayer_perceptron_classifier(
  analysis_set,
  not_working ~ scaled_age + sex + drinks + drugs + essay_length, 
  layers = c(12, 64, 64, 2)
)
```

This<!--((("feedforward neural network model")))--> gives us a feedforward neural network model with two hidden layers of 64 nodes each. Note that you have to specify the correct values for the input and output layers in the `layers` argument. We can obtain predictions on a validation set using `ml_predict()`:

```{r modeling-super-other-predict}
predictions <- ml_predict(nn, assessment_set)
```

Then, we can compute the AUC via `ml_binary_classification_evaluator()`:

```{r modeling-super-other-eval}
ml_binary_classification_evaluator(predictions)
```
```
[1] 0.7812709
```

Up until now, we have not looked into the unstructured text in the essay fields apart from doing simple character counts. In the next section, we explore the textual data in more depth.<!--((("", startref="Msuperlearn04")))-->

## Unsupervised Learning

Along<!--((("modeling", "unsupervised learning")))((("unsupervised learning", "overview of")))--> with speech, images, and videos, textual data is one of the components of the big data explosion. Prior to modern text-mining techniques and the computational resources to support them, companies had little use for freeform text fields. Today, text is considered a rich source of insights that can be found anywhere from physician’s notes to customer complaints. In this section, we show some basic text analysis capabilities of `sparklyr`. If you would like more background on text-mining techniques, we recommend reading [_Text Mining with R_](https://oreil.ly/OrjWr) by David Robinson and Julie Silge (O'Reilly).

In this section, we show how to perform a basic topic-modeling task on the essay data in the `OKCupid` dataset. Our plan is to concatenate the essay fields (of which there are 10) of each profile and regard each profile as a document, then attempt to discover _topics_ (we define these soon) using Latent Dirichlet Allocation (LDA).

### Data Preparation

As<!--((("unsupervised learning", "data preparation")))--> always, before analyzing a dataset (or a subset of one), we want to take a quick look at it to orient ourselves. In this case, we are interested in the freeform text that the users entered into their dating profiles.

```{r modeling-unsuper-glimpse}
essay_cols <- paste0("essay", 0:9)
essays <- okc %>%
  select(!!essay_cols)
essays %>% 
  glimpse()
```
```
Observations: ??
Variables: 10
Database: spark_connection
$ essay0 <chr> "about me:<br />\n<br />\ni would love to think that…
$ essay1 <chr> "currently working as an international agent for a f…
$ essay2 <chr> "making people laugh.<br />\nranting about a good sa…
$ essay3 <chr> "the way i look. i am a six foot half asian, half ca…
$ essay4 <chr> "books:<br />\nabsurdistan, the republic, of mice an…
$ essay5 <chr> "food.<br />\nwater.<br />\ncell phone.<br />\nshelt…
$ essay6 <chr> "duality and humorous things", "missing", "missing",…
$ essay7 <chr> "trying to find someone to hang out with. i am down …
$ essay8 <chr> "i am new to california and looking for someone to w…
$ essay9 <chr> "you want to be swept off your feet!<br />\nyou are …
```

Just from this output, we see the following:

- The text contains HTML tags
- The text contains the newline (`\n`) character
- There are missing values in the data

The HTML tags and special characters pollute the data since they are not directly input by the user and do not provide interesting information. Similarly, since we have encoded missing character fields with the _missing_ string, we need to remove it. (Note that by doing this we are also removing instances of the word "missing" written by the users, but the information lost from this removal is likely to be small.)

As you analyze your own text data, you will quickly come across and become familiar with the peculiarities of the specific dataset. As with tabular numerical data, preprocessing text data is an iterative process, and after a few tries we have the following transformations:

```{r modeling-unsuper-words}
essays <- essays %>%
  # Replace `missing` with empty string.
  mutate_all(list(~ ifelse(. == "missing", "", .))) %>%
  # Concatenate the columns.
  mutate(essay = paste(!!!syms(essay_cols))) %>%
  # Remove miscellaneous characters and HTML tags
  mutate(words = regexp_replace(essay, "\\n|&nbsp;|<[^>]*>|[^A-Za-z|']", " "))
```

Note here we are using `regex_replace()`, which is a Spark SQL function. Next, we discuss LDA and how to apply it to our cleaned dataset.

### Topic Modeling

LDA<!--((("unsupervised learning", "topic modeling")))((("topic modeling")))--> is a type of topic model for identifying abstract "topics" in a set of documents. It is an unsupervised algorithm in that we do not provide any labels, or topics, for the input documents. LDA posits that each document is a mixture of topics, and each topic is a mixture of words. During training, it attempts to estimate both of these simultaneously. A typical use case for topic models involves categorizing many documents, for which the large number of documents renders manual approaches infeasible. The application domains range from GitHub issues to legal documents.

After we have a reasonably clean dataset following the workflow in the previous section, we can fit an LDA model with `ml_lda()`:

```{r modeling-unsuper-lda}
stop_words <- ml_default_stop_words(sc) %>%
  c(
    "like", "love", "good", "music", "friends", "people", "life",
    "time", "things", "food", "really", "also", "movies"
  )

lda_model <-  ml_lda(essays, ~ words, k = 6, max_iter = 1, min_token_length = 4, 
                     stop_words = stop_words, min_df = 5)
```

We are also including a `stop_words` vector, consisting of commonly used English words and common words in our dataset, that instructs the algorithm to ignore them. After the model is fit, we can use<!--((("commands", "tidy()")))--> the `tidy()` function to extract the associated betas, which are the per-topic-per-word probabilities, from the model.

```{r modeling-unsuper-lda-tidy}
betas <- tidy(lda_model)
betas
```
```{r modeling-unsuper-lda-save, echo=FALSE}
saveRDS(betas, "data/modeling-okc-betas-1.rds")
```
```
# A tibble: 256,992 x 3
   topic term      beta
   <int> <chr>    <dbl>
 1     0 know      303.
 2     0 work      250.
 3     0 want      367.
 4     0 books     211.
 5     0 family    213.
 6     0 think     291.
 7     0 going     160.
 8     0 anything  292.
 9     0 enjoy     145.
10     0 much      272.
# … with 256,982 more rows
```

We can then visualize this output by looking at word probabilities by topic. In Figure \@ref(fig:modeling-unsuper-lda-betas-topics) and Figure \@ref(fig:modeling-unsuper-lda-betas-topics-2), we show the results at 1 iteration and 100 iterations. The code that generates Figure \@ref(fig:modeling-unsuper-lda-betas-topics) follows; to generate Figure \@ref(fig:modeling-unsuper-lda-betas-topics-2), you would need to set `max_iter = 100` when running `ml_lda()`, but beware that this can take a really long time in a single machine—this is the kind of big-compute problem that a proper Spark cluster would be able to easily tackle.

```{r modeling-unsuper-lda-betas}
betas %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()
```
```{r modeling-unsuper-lda-betas-save, echo=FALSE, eval=FALSE}
lda_model <-  ml_lda(
  essays, ~ words, 
  k = 6, max_iter = 100, min_token_length = 4, stop_words = stop_words, min_df = 5
)
betas <- tidy(lda_model)
saveRDS(betas, "data/modeling-okc-betas-100.rds")
```
```{r modeling-unsuper-lda-betas-plot, echo=FALSE, eval=FALSE}
plot_betas <- function(betas) {
  betas %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    arrange(topic, -beta) %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~ topic, scales = "free") +
      scale_fill_grey() +
      coord_flip()
}

betas <- readRDS("data/modeling-okc-betas-1.rds")
p <- plot_betas(betas)
ggsave("images/modeling-okc-betas-topics-1.png", plot = p, width = 10, height = 5)

betas <- readRDS("data/modeling-okc-betas-100.rds")
p <- plot_betas(betas)
ggsave("images/modeling-okc-betas-topics-100.png", plot = p, width = 10, height = 5)
```

```{r modeling-unsuper-lda-betas-topics, eval=TRUE, echo=FALSE, fig.cap='The most common terms per topic in the first iteration', fig.align = 'center'}
render_image("images/modeling-okc-betas-topics-1.png")
```

At 100 iterations, we can see "topics" starting to emerge. This could be interesting information in its own right if you were digging into a large collection of documents with which you aren’t familiar. The learned topics can also serve as features in a downstream supervised learning task; for example, we could consider using the topic number as a predictor in our model to predict employment status in our predictive modeling example.

```{r modeling-unsuper-lda-betas-topics-2, eval=TRUE, echo=FALSE, fig.cap='The most common terms per topic after 100 iterations', fig.align = 'center'}
render_image("images/modeling-okc-betas-topics-100.png")
```

Finally, to conclude this chapter you should disconnect from Spark. [Chapter 5](#pipelines) also makes use of the `OKCupid` dataset, but we provide instructions to reload it from scratch:

```{r modeling-unsuper-disconnect}
spark_disconnect(sc)
```

## Recap

In this chapter, we covered the basics of building predictive models in Spark with R by presenting the topics of EDA, feature engineering, and building supervised models, in which we explored using logistic regression and neural networks—just to pick a few from dozens of models available in Spark through `MLlib`.

We then explored how to use unsupervised learning to process raw text, in which you created a topic model that automatically grouped the profiles into six categories. We demonstrated that building the topic model can take a significant amount of time using a single machine, which is a nearly perfect segue to introduce full-sized computing clusters! But hold that thought: we first need to consider how to automate data science workflows.

As we mentioned when introducing this chapter, emphasis was placed on predictive modeling. Spark can help with data science at scale, but it can also assist in productionizing data science workflows into automated processes, known by many as machine learning. [Chapter 5](#pipelines) presents the tools we will need to take our predictive models, and even our entire training workflows, into automated environments that can run continuously or be exported and consumed in web applications, mobile applications, and more.<!--((("", startref="SURmodle04")))-->