-
Notifications
You must be signed in to change notification settings - Fork 47
/
12-spark-pipelines.Rmd
103 lines (75 loc) · 2.97 KB
/
12-spark-pipelines.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
```{r, spark-pipelines, include = FALSE}
eval_pipe <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_pipe <- Sys.getenv("GLOBAL_EVAL")
```
# Spark Pipelines
```{r, eval = eval_pipe, include = FALSE}
library(sparklyr)
library(dplyr)
```
## Build an Estimator (plan)
*Create a simple estimator that transforms data and fits a model*
1. Use the `spark_lineitems` variable to create a new aggregation by `order_id`. Summarize the total sales and number of items
```{r, eval = eval_pipe}
```
2. Assign the code to a new variable called `orders`
```{r, eval = eval_pipe}
orders <-
```
3. Start a new code chunk, with calling `ml_pipeline(sc)`
```{r, eval = eval_pipe}
ml_pipeline(sc)
```
4. Pipe the `ml_pipeline()` code into a `ft_dplyr_transfomer()` call. Use the `orders` variable for its argument
```{r, eval = eval_pipe}
ml_pipeline(sc) %>%
```
5. Add an `ft_binarizer()` step that determines if the total sale is above $50. Name the new variable `above_50`
```{r, eval = eval_pipe}
ml_pipeline(sc) %>%
```
6. Using the `ft_r_formula`, add a step that sets the model's formula to: `above_50 ~ no_items`
```{r, eval = eval_pipe}
ml_pipeline(sc) %>%
```
7. Finalize the pipeline by adding a `ml_logistic_regression()` step, no arguments are needed
```{r, eval = eval_pipe}
ml_pipeline(sc) %>%
```
8. Assign the code to a new variable called `orders_plan`
```{r, eval = eval_pipe}
orders_plan <- ml_pipeline(sc) %>%
```
9. Call `orders_plan` to confirm that all of the steps are present
```{r, eval = eval_pipe}
orders_plan
```
## Build a Transformer (fit)
*Execute the planned changes to obtain a new model*
1. Use `ml_fit()` to execute the changes in `order_plan` using the `spark_lineitems` data. Assign to a new variable called `orders_fit`
```{r, eval = eval_pipe}
orders_fit <-
```
2. Call `orders_fit` to see the print-out of the newly fitted model
```{r, eval = eval_pipe}
orders_fit
```
## Predictions using Spark Pipelines
*Overview of how to use a fitted pipeline to run predictions*
1. Use `ml_transform()` in order to use the `orders_fit` model to run predictions over `spark_lineitems`
```{r, eval = eval_pipe}
orders_preds <- ml_transform(orders_fit, spark_lineitems)
```
2. With `count()`, compare the results from `above_50` against the predictions, the variable created by `ml_transform()` is called `prediction`
```{r, eval = eval_pipe}
```
## Save the pipeline objects
*Overview of how to save the Estimator and the Transformer*
1. Use `ml_save()` to save `order_plan` in a new folder called "saved_model"
```{r, eval = eval_pipe}
```
2. Navigate to the "saved_model" folder to inspect its contents
3. Use `ml_save()` to save `orders_fit` in a new folder called "saved_pipeline"
```{r, eval = eval_pipe}
```
4. Navigate to the "saved_pipeline" folder to inspect its contents