-
Notifications
You must be signed in to change notification settings - Fork 55
/
extensions.Rmd
893 lines (710 loc) · 49.8 KB
/
extensions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
```{r include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
source("r/render.R")
source("r/plots.R")
library(ggplot2)
```
# Extensions {#extensions}
> I try to know as many people as I can. You never know which one you’ll need.
>
> --- Tyrion Lannister
In [Chapter 9](#tuning), you learned how Spark processes data at large scale by allowing users to configure the cluster resources, partition data implicitly or explicitly, execute commands across distributed compute nodes, shuffle data across them when needed, cache data to improve performance, and serialize data efficiently over the network. You also learned how to configure the different Spark settings used while connecting, submitting a job, and running an application, as well as particular settings applicable only to R and R extensions that we present in this chapter.
Chapters [3](#analysis)-[4](#modeling), and [8](#data) provided a foundation to read and understand most datasets. However, the functionality that was presented was scoped to Spark’s built-in features and tabular datasets. This chapter goes beyond tabular data and explores how to analyze and model networks of interconnected objects through graph processing, analyze genomics datasets, prepare data for deep learning, analyze geographic datasets, and use advanced modeling libraries like H2O and XGBoost over large-scale datasets.
The combination of all the content presented in the previous chapters should take care of most of your large-scale computing needs. However, for those few use cases for which functionality is still lacking, the following chapters provide tools to extend Spark yourself—through custom R transformation, custom Scala code, or a recent new execution mode in Spark that enables analyzing real-time datasets. However, before reinventing the wheel, let's examine some of the extensions available in Spark.
## Overview
In [Chapter 1](#intro), we<!--((("extensions", "overview of", id="Eover10")))--> presented the R community as a vibrant group of individuals collaborating with each other in many ways—for example, moving open science forward by creating R packages that you can install from CRAN. In a similar way, but at a much smaller scale, the R community has contributed extensions that increase the functionality initially supported in Spark and R. Spark itself also provides support for creating Spark extensions and, in fact, many R extensions make use of Spark extensions.
Extensions are constantly being created, so this section will be outdated at some given point in time. In addition, we might not even be aware of many Spark and R extensions. However, at the very least we can track the extensions that are available in CRAN by looking at the ["reverse imports" for `sparklyr` in CRAN](http://bit.ly/2Z568xz). Extensions and R packages published in CRAN tend to be the most stable since when a package is published in CRAN, it goes through a review process that increases the overall quality of a contribution.
While we wish we could present all the extensions, we’ve instead scoped this chapter to the extensions that should be the most interesting to you. You can find additional extensions at the [github.com/r-spark](https://github.com/r-spark) organization or by searching repositories on GitHub with the `sparklyr` tag.
rsparkling
: The `rsparkling` extensions allows you to use H2O and Spark from R. This extension is what we would consider advanced modeling in Spark. While Spark's built-in modeling library, Spark MLlib, is quite useful in many cases; H2O's modeling capabilities can compute additional statistical metrics and can provide performance and scalability improvements over Spark MLlib. We, ourselves, have not performed detailed comparisons nor benchmarks between MLlib and H2O; so this is something you will have to research on your own to create a complete picture of when to use H2O's capabilities.
graphframes
: The `graphframes` extensions<!--((("graphframes extensions")))--> adds support to process graphs in Spark. A graph is a structure that describes a set of objects in which some pairs of the objects are in some sense related. As you learned in [Chapter 1](#intro), ranking web pages was an early motivation to develop precursors to Spark powered by MapReduce; web pages happen to form a graph if you consider a link between pages as the relationship between each pair of pages. Computing operations likes PageRank over graphs can be quite useful in web search and social networks, for example.
sparktf
: The `sparktf` extension<!--((("sparktf extension")))((("TensorFlow")))--> provides support to write TensorFlow records in Spark. TensorFlow is one of the leading deep learning frameworks, and it is often used with large amounts of numerical data represented as TensorFlow records, a file format optimized for TensorFlow. Spark is often used to process unstructured and large-scale datasets into smaller numerical datasets that can easily fit into a GPU. You can use this extension to save datasets in the TensorFlow record file format.
xgboost
: The `xgboost` extension<!--((("xgboost extension")))--> brings the well-known XGBoost modeling library to the world of large-scale computing. XGBoost is a scalable, portable, and distributed library for gradient boosting. It became well known in the machine learning competition circles after its use in the winning solution of the [Higgs Boson Machine Learning Challenge](http://bit.ly/2YPE2qO) and has remained popular in other Kaggle competitions since then.
variantspark
: The `variantspark` extension<!--((("variantspark extension")))--> provides an interface to use Variant Spark, a scalable toolkit for<!--((("genome-wide association studies (GWAS)")))--> genome-wide association studies (GWAS). It currently provides functionality to build random forest models, estimating variable importance, and reading variant call format (VCF) files. While there are other random forest implementations in Spark, most of them are not optimized to deal with GWAS datasets, which usually come with thousands of samples and millions of pass:[<span class="keep-together">variables</span>].
geospark
: The `geospark` extension<!--((("geospark extensios")))--> enables us to load and query large-scale geographic datasets. Usually datasets containing latitude and longitude points or complex areas are defined in the<!--((("Well-Known Text (WKT) format")))--> well-known text (WKT) format, a text markup language for representing vector geometry objects on a map.
Before<!--((("extensions", "using with R and Spark")))((("Spark using R", "using extensions with")))--> you learn how and when to use each extension, we should first briefly explain how you can use extensions with R and Spark.
First, a Spark extension is just an R package that happens to be aware of Spark. As with any other R package, you will first need to install the extension. After you've installed it, it is important to know that you will need to reconnect to Spark before the extension can be used. So, in general, here's the pattern you should follow:
```{r eval=FALSE}
library(sparkextension)
library(sparklyr)
sc <- spark_connect(master = "<master>")
```
Notice that `sparklyr` is loaded after the extension to allow the extension to register properly. If you had to install and load a new extension, you would first need to disconnect using `spark_disconnect(sc)`, restart your R session, and repeat the preceding steps with the new extension.
It’s not difficult to install and use Spark extensions from R; however, each extension can be a world of its own, so most of the time you will need to spend time understanding what the extension is, when to use it, and how to use it properly. The first extension you will learn about is the `rsparkling` extension, which enables you to use H2O in Spark with R.<!--((("", startref="Eover10")))-->
## H2O
[H2O](https://www.h2o.ai/), created by H2O.ai, is open source software<!--((("extensions", "H2O", id="Eh2o10")))((("H2O", id="h2o10")))--> for large-scale modeling that allows you to fit thousands of potential models as part of discovering patterns in data. You can consider using H2O to complement or replace Spark’s default modeling algorithms. It is common to use Spark’s default modeling algorithms and transition to H2O when Spark’s algorithms fall short or when advanced functionality (like additional modeling metrics or automatic model selection) is desired.
We can’t do justice to H2O’s great modeling capabilities in a single paragraph; explaining H2O properly would require a book in and of itself. Instead, we would like to recommend reading Darren Cook's [_Practical Machine Learning with H2O_](https://oreil.ly/l5RHI) (O'Reilly) to explore in-depth H2O’s modeling algorithms and features. In the meantime, you can use this section as a brief guide to get started using H2O in Spark with R.
To use H2O with Spark, it is important to know that there are four components involved: H2O, Sparkling Water, [`rsparkling`](http://bit.ly/2MlFxqa), and Spark. Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. You can think of Sparkling Water as a component bridging Spark with H2O and `rsparkling` as the R frontend for Sparkling Water, as depicted in Figure \@ref(fig:extensions-h2o-diagram).
```{r extensions-h2o-diagram, eval=TRUE, echo=FALSE, fig.cap='H2O components with Spark and R', out.height = '280pt', out.width = 'auto', fig.align = 'center'}
render_nomnoml("
#spacing: 20
#padding: 16
#lineWidth: 1
[R |
[h2o]
[rsparkling]
[sparklyr]
]->[Spark |
[Sparkling Water]
[H2O]
]
", "images/extensions-h2o-diagram.png")
```
First, install `rsparkling` and `h2o` as specified on the [`rsparkling` documentation site](http://bit.ly/2Z78MD0).
```{r extensions-h2o-install, exercise=TRUE}
install.packages("h2o", type = "source",
repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-yates/5/R")
install.packages("rsparkling", type = "source",
repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.3/31/R")
```
It is important to note that you need to use compatible versions of Spark, Sparkling Water, and H2O as specified in their documentation; we present instructions for Spark 2.3, but using different Spark versions will require you to install different versions. So let’s start by checking the version of H2O by running the following:
```{r extensions-h2o-version, eval=TRUE}
packageVersion("h2o")
packageVersion("rsparkling")
```
We then can connect with the supported Spark versions as follows (you will have to adjust the `master` parameter for your particular cluster):
```{r extensions-h2o-connect}
library(rsparkling)
library(sparklyr)
library(h2o)
sc <- spark_connect(master = "local", version = "2.3",
config = list(sparklyr.connect.timeout = 3 * 60))
cars <- copy_to(sc, mtcars)
```
H2O provides a web interface that can help you monitor training and access much of H2O’s functionality. You can access the web interface (called H2O Flow) through `h2o_flow(sc)`, as shown in Figure \@ref(fig:extensions-h2o-flow).
When using H2O, you will have to convert your Spark DataFrame into and H2O DataFrame through `as_h2o_frame`:
```{r extensions-h2o-copy}
cars_h2o <- as_h2o_frame(sc, cars)
cars_h2o
```
```
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
[32 rows x 11 columns]
```
```{r extensions-h2o-flow, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='The H2O Flow interface using Spark with R'}
render_image("images/extensions-h2o-flow.png")
```
Then, you can use many of the modeling functions available in the `h2o` package with ease. For instance, we can fit a generalized linear model with ease:
```{r extensions-h2o-glm}
model <- h2o.glm(x = c("wt", "cyl"),
y = "mpg",
training_frame = cars_h2o,
lambda_search = TRUE)
```
H2O provides additional metrics not necessarily available in Spark’s modeling algorithms. The model that we just fit, `Residual Deviance`, is provided in the model, while this would not be a standard metric when using Spark MLlib.
```{r extensions-h2o-model}
model
```
```
...
MSE: 6.017684
RMSE: 2.453097
MAE: 1.940985
RMSLE: 0.1114801
Mean Residual Deviance : 6.017684
R^2 : 0.8289895
Null Deviance :1126.047
Null D.o.F. :31
Residual Deviance :192.5659
Residual D.o.F. :29
AIC :156.2425
```
Then, you can run prediction over the generalized linear model (GLM). A similar approach would work for many other models available in H2O:
```{r extensions-h2o-predict}
predictions <- as_h2o_frame(sc, copy_to(sc, data.frame(wt = 2, cyl = 6)))
h2o.predict(model, predictions)
```
```
predict
1 24.05984
[1 row x 1 column]
```
You can also use H2O to perform automatic training and tuning of many models, meaning that H2O can choose which model to use for you using [AutoML](https://oreil.ly/Ck9Ao):
```{r extensions-h2o-automl}
automl <- h2o.automl(x = c("wt", "cyl"), y = "mpg",
training_frame = cars_h2o,
max_models = 20,
seed = 1)
```
For this particular dataset, H2O determines that a deep learning model is a better fit than a GLM.^[Notice that AutoML uses cross-validation, which we did not use in GLM.] Specifically, H2O’s AutoML explored using XGBoost, deep learning, GLM, and a Stacked Ensemble model:
```{r extensions-h2o-automl-print}
automl@leaderboard
```
```
model_id mean_residual_dev… rmse mse mae rmsle
1 DeepLearning_… 6.541322 2.557601 6.541322 2.192295 0.1242028
2 XGBoost_grid_1… 6.958945 2.637981 6.958945 2.129421 0.1347795
3 XGBoost_grid_1_… 6.969577 2.639996 6.969577 2.178845 0.1336290
4 XGBoost_grid_1_… 7.266691 2.695680 7.266691 2.167930 0.1331849
5 StackedEnsemble… 7.304556 2.702694 7.304556 1.938982 0.1304792
6 XGBoost_3_… 7.313948 2.704431 7.313948 2.088791 0.1348819
```
Rather than using the leaderboard, you can focus on the best model through `automl@leader`; for example, you can glance at the particular parameters from this deep learning model as follows:
```{r}
tibble::tibble(parameter = names(automl@leader@parameters),
value = as.character(automl@leader@parameters))
```
```
# A tibble: 20 x 2
parameter values
<chr> <chr>
1 model_id DeepLearning_grid_1_AutoML…
2 training_frame automl_training_frame_rdd…
3 nfolds 5
4 keep_cross_validation_models FALSE
5 keep_cross_validation_predictions TRUE
6 fold_assignment Modulo
7 overwrite_with_best_model FALSE
8 activation RectifierWithDropout
9 hidden 200
10 epochs 10003.6618461538
11 seed 1
12 rho 0.95
13 epsilon 1e-06
14 input_dropout_ratio 0.2
15 hidden_dropout_ratios 0.4
16 stopping_rounds 0
17 stopping_metric deviance
18 stopping_tolerance 0.05
19 x c("cyl", "wt")
20 y mpg
```
You can then predict using the leader as follows:
```{r}
h2o.predict(automl@leader, predictions)
```
```
predict
1 30.74639
[1 row x 1 column]
```
Many additional examples are [available](http://bit.ly/2NdTIwX), and you can also request help from the official [GitHub repository](http://bit.ly/2MlFxqa) for the `rsparkling` package.
The next extension, `graphframes`, allows you to process large-scale relational datasets. Before you start using it, make sure to disconnect with `spark_disconnect(sc)` and restart your R session since using a different extension requires you to reconnect to Spark and reload `sparklyr`.<!--((("", startref="h2o10")))((("", startref="Eh2o10")))-->
```{r extensions-h2o-disconnect, echo=FALSE}
spark_disconnect(sc)
```
## Graphs
The<!--((("extensions", "graphs", id="Egraph10")))((("graphs", "graph theory")))--> first paper in the history of graph theory was written by Leonhard Euler on the Seven Bridges of Königsberg in 1736. The problem was to devise a walk through the city that would cross each bridge once and only once. Figure \@ref(fig:extensions-eulers-paths) presents the original diagram.
```{r extensions-eulers-paths, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='The Seven Bridges of Königsberg from the Euler archive'}
render_image("images/extensions-eulers-paths.png")
```
Today, a<!--((("graphs", "directed and undirected")))((("directed graphs")))((("undirected graphs")))--> graph is defined as an ordered pair $G=(V,E)$, with $V$ a set of vertices (nodes or points) and $E \subseteq \{\{x, y\} | (x, y) ∈ \mathrm{V}^2 \land x \ne y\}$ a set of edges (links or lines), which are either an unordered pair for _undirected graphs_ or an ordered pair for _directed graphs_. The former describes links where the direction does not matter, and the latter describes links where it does.
As a simple example, we can use the `highschool` dataset from the `ggraph` package, which tracks friendship among high school boys. In this dataset, the vertices are the students and the edges describe pairs of students who happen to be friends in a particular year:
```{r eval=FALSE, exercise=TRUE}
install.packages("ggraph")
install.packages("igraph")
```
```{r}
ggraph::highschool
```
```
# A tibble: 506 x 3
from to year
<dbl> <dbl> <dbl>
1 1 14 1957
2 1 15 1957
3 1 21 1957
4 1 54 1957
5 1 55 1957
6 2 21 1957
7 2 22 1957
8 3 9 1957
9 3 15 1957
10 4 5 1957
# … with 496 more rows
```
While<!--((("graphs", "GraphX component")))((("GraphX component")))--> the high school dataset can easily be processed in R, even medium-size graph datasets can be difficult to process without distributing this work across a cluster of machines, for which Spark is well suited. Spark supports processing graphs through the [`graphframes`](http://bit.ly/2Z5hVYB) extension, which in turn uses the [GraphX](http://bit.ly/30cbKU6) Spark component. GraphX is Apache Spark’s API for graphs and<!--((("parallel execution")))--> graph-parallel computation. It’s comparable in performance to the fastest specialized graph-processing systems and provides a growing library of graph algorithms.
A<!--((("DataFrames", "constructing for vertices")))--> graph in Spark is also represented as a DataFrame of edges and vertices; however, our format is slightly different since we will need to construct a DataFrame for vertices. Let's first install the [`graphframes`](http://bit.ly/2Z5hVYB) extension:
```{r eval=FALSE}
install.packages("graphframes")
```
Next, we need to connect, copying the `highschool` dataset and transforming the graph to the format that this extension expects. Here, we scope this dataset to the friendships of the year 1957:
```{r extensions-graphframes}
library(graphframes)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.3")
highschool_tbl <- copy_to(sc, ggraph::highschool, "highschool") %>%
filter(year == 1957) %>%
transmute(from = as.character(as.integer(from)),
to = as.character(as.integer(to)))
from_tbl <- highschool_tbl %>% distinct(from) %>% transmute(id = from)
to_tbl <- highschool_tbl %>% distinct(to) %>% transmute(id = to)
vertices_tbl <- distinct(sdf_bind_rows(from_tbl, to_tbl))
edges_tbl <- highschool_tbl %>% transmute(src = from, dst = to)
```
The `vertices_tbl` table is expected to have a single `id` column:
```{r extensions-graphframes-vertices}
vertices_tbl
```
```
# Source: spark<?> [?? x 1]
id
<chr>
1 1
2 34
3 37
4 43
5 44
6 45
7 56
8 57
9 65
10 71
# … with more rows
```
And the `edges_tbl` is expected to have `src` and `dst` columns:
```{r extensions-graphframes-eedges}
edges_tbl
```
```
# Source: spark<?> [?? x 2]
src dst
<chr> <chr>
1 1 14
2 1 15
3 1 21
4 1 54
5 1 55
6 2 21
7 2 22
8 3 9
9 3 15
10 4 5
# … with more rows
```
You can now create a GraphFrame:
```{r}
graph <- gf_graphframe(vertices_tbl, edges_tbl)
```
We<!--((("graphs", "degree of a vertex")))((("degree of a vertex")))((("valency of a vertex")))--> now can use this graph to start analyzing this dataset. For instance, we'll find out how many friends on average every boy has, which is referred to as the _degree_ or _valency_ of a _vertex_:
```{r}
gf_degrees(graph) %>% summarise(friends = mean(degree))
```
```
# Source: spark<?> [?? x 1]
friends
<dbl>
1 6.94
```
We then can find what the shortest path to some specific vertex (a boy for this dataset). Since the data is anonymized, we can just pick the boy identified as `33` and find how many degrees of separation exist between them:
```{r}
gf_shortest_paths(graph, 33) %>%
filter(size(distances) > 0) %>%
mutate(distance = explode(map_values(distances))) %>%
select(id, distance)
```
```
# Source: spark<?> [?? x 2]
id distance
<chr> <int>
1 19 5
2 5 4
3 27 6
4 4 4
5 11 6
6 23 4
7 36 1
8 26 2
9 33 0
10 18 5
# … with more rows
```
Finally, we<!--((("graphs", "PageRank algorithm")))((("PageRank algorithm")))--> can also compute PageRank over this graph, which was presented in [Chapter 1](#intro)'s discussion of Google’s page ranking algorithm:
```{r echo=FALSE, eval=FALSE}
model <- gf_graphframe(vertices_tbl, edges_tbl) %>%
gf_pagerank(reset_prob = 0.15, max_iter = 10L)
highschool_tbl %>% collect() %>%
saveRDS("data/09-extensions-graphframes-highschool.rds")
```
```{r extensions-graphframes-code}
gf_graphframe(vertices_tbl, edges_tbl) %>%
gf_pagerank(reset_prob = 0.15, max_iter = 10L)
```
```
GraphFrame
Vertices:
Database: spark_connection
$ id <dbl> 12, 12, 14, 14, 27, 27, 55, 55, 64, 64, 41, 41, 47, 47, 6…
$ pagerank <dbl> 0.3573460, 0.3573460, 0.3893665, 0.3893665, 0.2362396, 0.…
Edges:
Database: spark_connection
$ src <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 12, 12, 12,…
$ dst <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,…
$ weight <dbl> 0.25000000, 0.25000000, 0.25000000, 0.25000000, 0.25000000,…
```
To give you some insights into this dataset, Figure \@ref(fig:extensions-graph-pagerank) plots this chart using the `ggraph` and highlights the highest PageRank scores for the following dataset:
```{r extensions-graph-pagerank-plot, eval=FALSE}
highschool_tbl %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
ggraph(layout = 'kk') +
geom_edge_link(alpha = 0.2,
arrow = arrow(length = unit(2, 'mm')),
end_cap = circle(2, 'mm'),
start_cap = circle(2, 'mm')) +
geom_node_point(size = 2, alpha = 0.4)
```
```{r extensions-graph-pagerank-create, eval=FALSE, echo=FALSE}
library(ggraph)
library(igraph)
highschool_rdf <- readRDS("data/09-extensions-graphframes-highschool.rds")
highschool_rdf %>% igraph::graph_from_data_frame(directed = FALSE) %>%
ggraph(layout = 'kk') +
geom_edge_link(alpha = 0.2,
arrow = arrow(length = unit(2, 'mm')),
end_cap = circle(2, 'mm'),
start_cap = circle(2, 'mm')) +
geom_node_point(size = 2, alpha = 0.4) + theme_light() +
annotate("point", x = -1.18, y = -3.55, size = 3) +
annotate("point", x = 6.25, y = 2.85, size = 3) + xlab("") + ylab("") +
ggsave("images/extensions-graph-pagerank.png", width = 10, height = 5)
```
```{r extensions-graph-pagerank, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='High school ggraph dataset with highest PageRank highlighted'}
render_image("images/extensions-graph-pagerank.png")
```
There<!--((("graphs", "additional graph algorithms")))--> are many more graph algorithms provided in `graphframes`—for example, breadth-first search, connected components, label propagation for detecting communities, strongly connected components, and triangle count. For questions on this extension refer to the official [GitHub repository](http://bit.ly/2Z5hVYB). We now present a popular gradient-boosting framework—make sure to disconnect and restart before trying the next extension.<!--((("", startref="Egraph10")))-->
```{r echo=FALSE}
spark_disconnect(sc)
```
## XGBoost
A _decision tree_ is<!--((("decision trees")))((("extensions", "XGBoost")))((("xgboost extension")))--> a flowchart-like structure in which each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. For example, Figure \@ref(fig:extensions-decision-diagram) shows a decision tree that could help classify whether an employee is likely to leave given a set of factors like job satisfaction and overtime. When a decision tree is used to predict continuous variables instead of discrete outcomes—say, how likely someone is to leave a company—it is referred to as a<!--((("regression trees")))--> _regression tree_.
```{r extensions-decision-diagram, eval=TRUE, echo=FALSE, fig.align='center', fig.cap='A decision tree to predict job attrition based on known factors', out.height = '280pt', out.width = 'auto', fig.align = 'center'}
render_nomnoml("
[0.5|Job Satisfaction > 0.2?
]-Yes[0.3|Over Time Hours > 10?]
[0.5]-No[0.7|Job Level > 2?]
[0.3]-Yes[0.65]
[0.3]-No[0.20]
[0.7]-Yes[0.15]
[0.7]-No[0.70]
", "images/extensions-decision-diagram.png")
```
While a decision tree representation is quite easy to understand and to interpret, finding out the decisions in the tree requires mathematical techniques<!--((("gradient descent")))--> like _gradient descent_ to find a local minimum. Gradient descent takes steps proportional to the negative of the gradient of the function at the current point. The gradient is represented by $\nabla$, and the learning rate by $\gamma$. You simply start from a given state $a_n$ and compute the next iteration $a_{n+1}$ by following the direction of the gradient:
<center style="margin-top: 30px; margin-bottom: 30px">
$a_{n+1} = a_n - \gamma \nabla F(a_n)$
</center>
XGBoost is an open source software library that provides a gradient-boosting framework. It aims to provide scalable, portable, and distributed gradient boosting for training gradient-boosted decision trees (GBDT) and gradient-boosted <!--((("gradient-boosted decision trees (GBDT)")))((("gradient-boosted regression trees (GBRT)")))-->regression trees (GBRT). Gradient-boosted means XGBoost uses gradient descent and boosting, which is a technique that chooses each predictor sequentially.
`sparkxgb` is an extension that you can use to train XGBoost models in Spark; however, be aware that currently Windows is unsupported. To use this extension, first install it from CRAN:
```{r extensions-xgb-install, eval=FALSE, exercise=TRUE}
install.packages("sparkxgb")
```
Then, you need to import the `sparkxgb` extension followed by your usual Spark connection code, adjusting `master` as needed:
```{r extensions-xgb-connect}
library(sparkxgb)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.3")
```
For this example, we use the `attrition` dataset from the `rsample` package, which you would need to install by using `install.packages("rsample")`. This is a fictional dataset created by IBM data scientists to uncover the factors that lead to employee attrition:
```{r extensions-xgb-copy}
attrition <- copy_to(sc, rsample::attrition)
attrition
```
```
# Source: spark<?> [?? x 31]
Age Attrition BusinessTravel DailyRate Department DistanceFromHome
<int> <chr> <chr> <int> <chr> <int>
1 41 Yes Travel_Rarely 1102 Sales 1
2 49 No Travel_Freque… 279 Research_… 8
3 37 Yes Travel_Rarely 1373 Research_… 2
4 33 No Travel_Freque… 1392 Research_… 3
5 27 No Travel_Rarely 591 Research_… 2
6 32 No Travel_Freque… 1005 Research_… 2
7 59 No Travel_Rarely 1324 Research_… 3
8 30 No Travel_Rarely 1358 Research_… 24
9 38 No Travel_Freque… 216 Research_… 23
10 36 No Travel_Rarely 1299 Research_… 27
# … with more rows, and 25 more variables: Education <chr>,
# EducationField <chr>, EnvironmentSatisfaction <chr>, Gender <chr>,
# HourlyRate <int>, JobInvolvement <chr>, JobLevel <int>, JobRole <chr>,
# JobSatisfaction <chr>, MaritalStatus <chr>, MonthlyIncome <int>,
# MonthlyRate <int>, NumCompaniesWorked <int>, OverTime <chr>,
# PercentSalaryHike <int>, PerformanceRating <chr>,
# RelationshipSatisfaction <chr>, StockOptionLevel <int>,
# TotalWorkingYears <int>, TrainingTimesLastYear <int>,
# WorkLifeBalance <chr>, YearsAtCompany <int>, YearsInCurrentRole <int>,
# YearsSinceLastPromotion <int>, YearsWithCurrManager <int>
```
To build an XGBoost model in Spark, use `xgboost_classifier()`. We will compute attrition against all other features by using the `Attrition ~ .` formula and specify `2` for the number of classes since the attrition attribute tracks only whether an employee leaves or stays. Then, you can use `ml_predict()` to predict over large-scale datasets:
```{r extensions-xgb-model}
xgb_model <- xgboost_classifier(attrition,
Attrition ~ .,
num_class = 2,
num_round = 50,
max_depth = 4)
xgb_model %>%
ml_predict(attrition) %>%
select(Attrition, predicted_label, starts_with("probability_")) %>%
glimpse()
```
```
Observations: ??
Variables: 4
Database: spark_connection
$ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No", …
$ predicted_label <chr> "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Y…
$ probability_No <dbl> 0.753938094, 0.024780750, 0.915146366, 0.143568754, 0.07…
$ probability_Yes <dbl> 0.24606191, 0.97521925, 0.08485363, 0.85643125, 0.927375…
```
XGBoost became well known in the competition circles after its use in the winning solution of the Higgs Machine Learning Challenge, which uses the ATLAS experiment to identify the Higgs boson. Since then, it has become a popular model and used for a large number of Kaggle competitions. However, decision trees could prove limiting especially in datasets with nontabular data like images, audio, and text, which you can better tackle with deep learning models—should we remind you to disconnect and restart?
```{r echo=FALSE}
spark_disconnect(sc)
```
## Deep Learning
A _perceptron_ is<!--((("perceptrons")))((("extensions", "deep learning", id="Edeep10")))((("deep learning", id="deep10")))--> a mathematical model introduced by Frank Rosenblatt,^[Rosenblatt F (1958). “The perceptron: a probabilistic model for information storage and organization in the brain.” _Psychological review_.] who developed it as a theory for a hypothetical nervous system. The perceptron maps stimuli to numeric inputs that are weighted into a threshold function that activates only when enough stimuli is present, mathematically:
$f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases}$
Minsky<!--((("multilayered perceptrons")))--> and Papert found out that a single perceptron can classify only datasets that are linearly separable; however, they also revealed in their book _Perceptrons_ that layering perceptrons would bring additional classification capabilities.^[Minsky M, Papert SA (2017). _Perceptrons: An introduction to computational geometry_. MIT press.] Figure \@ref(fig:extensions-minsky-layered) presents the original diagram showcasing a multilayered perceptron.
```{r extensions-minsky-layered, eval=TRUE, echo=FALSE, fig.cap='Layered perceptrons, as illustrated in the book Perceptrons', fig.align = 'center'}
render_image("images/extensions-minsky-multi-layers.png")
```
Before we start, let’s first install all the packages that we are about to use:
```{r eval=FALSE, exercise=TRUE}
install.packages("sparktf")
install.packages("tfdatasets")
```
Using Spark we can create a multilayer perceptron classifier with `ml_multilayer_perceptron_classifier()` and gradient descent to classify and predict over large datasets. Gradient descent was introduced to layered perceptrons by Geoff Hinton.^[Ackley DH, Hinton GE, Sejnowski TJ (1985). “A learning algorithm for Boltzmann machines.” _Cognitive science_.]
```{r}
library(sparktf)
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.3")
attrition <- copy_to(sc, rsample::attrition)
nn_model <- ml_multilayer_perceptron_classifier(
attrition,
Attrition ~ Age + DailyRate + DistanceFromHome + MonthlyIncome,
layers = c(4, 3, 2),
solver = "gd")
nn_model %>%
ml_predict(attrition) %>%
select(Attrition, predicted_label, starts_with("probability_")) %>%
glimpse()
```
```
Observations: ??
Variables: 4
Database: spark_connection
$ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No"…
$ predicted_label <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ probability_No <dbl> 0.8439275, 0.8439275, 0.8439275, 0.8439275, 0.8439275,…
$ probability_Yes <dbl> 0.1560725, 0.1560725, 0.1560725, 0.1560725, 0.1560725,…
```
Notice that the columns must be numeric, so you will need to manually convert them with the feature transforming techniques presented in [Chapter 4](#modeling). It is natural to try to add more layers to classify more complex datasets; however, adding too many layers will cause the gradient to vanish, and other techniques will need to use these deep layered networks, also known as _deep learning models_.
Deep learning models solve the vanishing gradient problem by making use of special activation functions, dropout, data augmentation and GPUs. You can use Spark to retrieve and preprocess large datasets into numerical-only datasets that can fit in a GPU for deep learning training. TensorFlow is one of the most popular deep learning frameworks. As mentioned previously, it supports a binary format known as TensorFlow records.
You can write TensorFlow records using the `sparktf` in Spark, which you can prepare to process in GPU instances with libraries like Keras or TensorFlow.
You can then preprocess large datasets in Spark and write them as TensorFlow records using `spark_write_tf()`:
```{r}
copy_to(sc, iris) %>%
ft_string_indexer_model(
"Species", "label",
labels = c("setosa", "versicolor", "virginica")
) %>%
spark_write_tfrecord(path = "tfrecord")
```
After you have trained the dataset with Keras or TensorFlow, you can use the `tfdatasets` package to load it. You will also need to install the TensorFlow runtime by using `install_tensorflow()` and install Python on your own. To learn more about training deep learning models with Keras, we recommend reading _Deep Learning with R_.^[Chollet F, Allaire J (2018). _Deep Learning with R_. Manning Publications.]
```{r eval=FALSE}
tensorflow::install_tensorflow()
tfdatasets::tfrecord_dataset("tfrecord/part-r-00000")
```
```
<DatasetV1Adapter shapes: (), types: tf.string>
```
Training deep learning models in a single local node with one or more GPUs is often enough for most applications; however, recent state-of-the-art deep learning models train using distributed computing frameworks like Apache Spark. Distributed computing frameworks<!--((("OpenAI")))((("artificial intelligence (AI)")))--> are used to achieve higher petaflops each day the systems spends training these models. [OpenAI](http://bit.ly/2HawofQ) analyzed trends in the field of _artificial intelligence_ (AI) and cluster computing (illustrated in Figure \@ref(fig:extensions-distributed-training)). It should be obvious from the figure that there is a trend in recent years to use distributed computing frameworks.
```{r echo=FALSE, eval=FALSE}
data.frame(
system = c("AlexNet", "Dropout", "DQN", "Conv Nets", "GoogleNet", "Seq2Seq / VGG", "ResNets", "DeepSpeech2",
"Xception", "TI7 Dota 1v1", "Neural Architecture Search", "Neural Machine Translation", "AlphaZero", "AlphaGo Zero"),
petaflops = c(0.0058, 0.003, 0.0002, 0.009, 0.02, 0.1, 0.12, 0.30,
4.5, 8.5, 40, 90, 500, 2000),
year = as.Date(paste0(c("2012-02", "2012-06", "2013-09", "2013-08", "2013-08", "2013-07", "2015-11", "2015-11",
"2016-10", "2017-08", "2016-10", "2016-08", "2017-11", "2017-09"), "-01"), "%Y-%m-%d"),
training = c("local", "local", "local", "local", "local", "local", "local", "local",
"distributed", "distributed", "distributed", "distributed", "distributed", "distributed")) %>%
ggplot(aes(x = year, y = petaflops, label = system, color = training)) +
scale_color_grey(start = 0.2, end = 0.6) +
labs(title = "Distributed Deep Learning",
subtitle = "Training using distributed systems based on OpenAI analysis") +
theme(legend.position = "right") +
geom_point(size = 2.5) +
geom_text(aes(label = system), hjust = -0.2, vjust = 0.4) +
scale_y_log10(
limits = c(1/10^4, 10^4),
breaks = c(1/10^5, 1/10^4, 1/10^3, 1/10^2, 1/10^1, 1, 10^1, 10^2, 10^3, 10^4),
labels = c("0.00001", "0.0001", "0.001", "0.01", "0.1", "1", "10", "100", "1000", "10000")
) +
scale_x_date(name = "",
limits = as.Date(as.character(c(2011, 2020)), "%Y"),
breaks = as.Date(as.character(seq(2013, 2019, by = 2)), "%Y"),
labels = as.character(seq(2013, 2019, by = 2))) +
ggsave("images/extensions-distributed-training.png", width = 10, height = 5)
```
```{r extensions-distributed-training, eval=TRUE, echo=FALSE, fig.cap='Training using distributed systems based on OpenAI analysis', fig.align = 'center'}
render_image("images/extensions-distributed-training.png")
```
Training large-scale deep learning models is possible in Spark and TensorFlow through frameworks like Horovod. Today, it’s possible to use Horovod with Spark from R using the `reticulate` package, since Horovod requires Python and Open MPI, this goes beyond the scope of this book. Next, we will introduce a different Spark extension in the domain of genomics.<!--((("", startref="Edeep10")))((("", startref="deep10")))-->
```{r echo=FALSE}
spark_disconnect(sc)
```
## Genomics
The [human genome](http://bit.ly/2z2gMqn) consists<!--((("extensions", "genomics", id="Egenomics10")))((("genomics", id="genomics10")))--> of two copies of about 3 billion base pairs of DNA within the 23 chromosome pairs. Figure \@ref(fig:extensions-genomics-diagram) shows the organization of the genome into chromosomes. DNA strands are composed of nucleotides, each composed of one of four nitrogen-containing nucleobases: cytosine (C), guanine (G), adenine (A), or thymine (T). Since the DNA of all humans is nearly identical, we need to store only the differences from the reference genome<!--((("variant call format (VCF)")))--> in the form of a variant call format (VCF) file.
```{r extensions-genomics-diagram, eval=TRUE, echo=FALSE, fig.cap='The idealized human diploid karyotype showing the organization of the genome into chromosomes', fig.align = 'center'}
render_image("images/extensions-genomics-diagram.png")
```
`variantspark`<!--((("variantspark extension")))--> is a framework based on Scala and Spark to analyze genome datasets. It is being developed by CSIRO Bioinformatics team in Australia. `variantspark` was tested on datasets with 3,000 samples, each one containing 80 million features in either unsupervised clustering approaches or supervised applications, like classification and regression. `variantspark`<!--((("DataFrames", "reading VCF files and running analyses")))--> can read VCF files and run analyses while using familiar Spark DataFrames.
To get started, install `variantspark` from CRAN, connect to Spark, and retrieve a `vsc` connection to `variantspark`:
```{r extensions-genomics-connect}
library(variantspark)
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.3",
config = list(sparklyr.connect.timeout = 3 * 60))
vsc <- vs_connect(sc)
```
We can start by loading a VCF file:
```{r extensions-genomics-read-vcf}
vsc_data <- system.file("extdata/", package = "variantspark")
hipster_vcf <- vs_read_vcf(vsc, file.path(vsc_data, "hipster.vcf.bz2"))
hipster_labels <- vs_read_csv(vsc, file.path(vsc_data, "hipster_labels.txt"))
labels <- vs_read_labels(vsc, file.path(vsc_data, "hipster_labels.txt"))
```
`variantspark` uses random forest to assign an importance score to each tested variant reflecting its association to the interest phenotype. A variant with higher importance score implies it is more strongly associated with the phenotype of interest. You can compute the importance and transform it into a Spark table, as follows:
```{r extensions-genomics-importance-code}
importance_tbl <- vs_importance_analysis(vsc, hipster_vcf,
labels, n_trees = 100) %>%
importance_tbl()
importance_tbl
```
```
# Source: spark<?> [?? x 2]
variable importance
<chr> <dbl>
1 2_109511398 0
2 2_109511454 0
3 2_109511463 0.00000164
4 2_109511467 0.00000309
5 2_109511478 0
6 2_109511497 0
7 2_109511525 0
8 2_109511527 0
9 2_109511532 0
10 2_109511579 0
# … with more rows
```
You then can use `dplyr` and `ggplot2` to transform the output and visualize it (see Figure \@ref(fig:extensions-genomics-importance)):
```{r extensions-genomics-plot-code}
library(dplyr)
library(ggplot2)
importance_df <- importance_tbl %>%
arrange(-importance) %>%
head(20) %>%
collect()
ggplot(importance_df) +
aes(x = variable, y = importance) +
geom_bar(stat = 'identity') +
scale_x_discrete(limits =
importance_df[order(importance_df$importance), 1]$variable) +
coord_flip()
```
```{r echo=FALSE, eval=FALSE}
ggplot(importance_df) +
aes(x = variable, y = importance) +
labs(title = "Importance Analysis",
subtitle = "Genomic importance analysis using variantspark") +
scale_color_grey(start = 0.2, end = 0.6) +
theme(legend.position = "right",
axis.text.y = element_text(color = "#999999"),
axis.text = ggplot2::element_text(size=8)) +
geom_bar(stat = 'identity', fill = "#888888") +
scale_x_discrete(limits = importance_df[order(importance_df$importance), 1]$variable) +
coord_flip() +
ggsave("images/extensions-genomics-importance.png", width = 10, height = 5)
```
```{r extensions-genomics-importance, eval=TRUE, echo=FALSE, fig.cap='Genomic importance analysis using variantspark', fig.align = 'center'}
render_image("images/extensions-genomics-importance.png")
```
This concludes a brief introduction to genomic analysis in Spark using the `variantspark` extension. Next, we move away from microscopic genes to macroscopic datasets that contain geographic locations across the world.<!--((("", startref="Egenomics10")))((("", startref="genomics10")))-->
```{r extensions-genomics-disconnect, echo=FALSE}
spark_disconnect(sc)
```
## Spatial
[`geospark`](http://bit.ly/2zbTEW8) enables<!--((("extensions", "spatial")))((("geospark extensions")))((("spatial operations")))--> distributed geospatial computing using a grammar compatible with [`dplyr`](http://bit.ly/2KYKOAC) and [`sf`](http://bit.ly/2ZerAwb), which provides a set of tools for working with geospatial vectors.
You can install `geospark` from CRAN, as follows:
```{r eval=FALSE, exercise=TRUE}
install.packages("geospark")
```
Then, initialize the `geospark` extension and connect to Spark:
```{r}
library(geospark)
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.3")
```
Next, we load a spatial dataset containing polygons and points:
```{r}
polygons <- system.file("examples/polygons.txt", package="geospark") %>%
read.table(sep="|", col.names = c("area", "geom"))
points <- system.file("examples/points.txt", package="geospark") %>%
read.table(sep = "|", col.names = c("city", "state", "geom"))
polygons_wkt <- copy_to(sc, polygons)
points_wkt <- copy_to(sc, points)
```
There are various spatial operations defined in `geospark`, as depicted in Figure \@ref(fig:extensions-geospark-operations). These operations allow you to control how geospatial data should be queried based on overlap, intersection, disjoint sets, and so on.
```{r extensions-geospark-operations, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Spatial operations available in geospark'}
render_image("images/extensions-geospark-operations.png")
```
For instance, we can use these operations to find the polygons that contain a given set of points using `st_contains()`:
```{r}
library(dplyr)
polygons_wkt <- mutate(polygons_wkt, y = st_geomfromwkt(geom))
points_wkt <- mutate(points_wkt, x = st_geomfromwkt(geom))
inner_join(polygons_wkt,
points_wkt,
sql_on = sql("st_contains(y,x)")) %>%
group_by(area, state) %>%
summarise(cnt = n())
```
```
# Source: spark<?> [?? x 3]
# Groups: area
area state cnt
<chr> <chr> <dbl>
1 california area CA 10
2 new york area NY 9
3 dakota area ND 10
4 texas area TX 10
5 dakota area SD 1
```
You can also plot these datasets by collecting a subset of the entire dataset or aggregating the geometries in Spark before collecting them. One package you should look into is `sf`.
We close this chapter by presenting a couple of troubleshooting techniques applicable to all extensions.
```{r echo=FALSE}
spark_disconnect(sc)
```
## Troubleshooting
When<!--((("extensions", "troubleshooting")))((("troubleshooting", "extensions")))--> you are using a new extension for the first time, we recommend increasing the connection timeout (given that Spark will usually need to download extension dependencies) and changing logging to verbose to help you troubleshoot when the download process does not complete:
```{r eval=FALSE}
config <- spark_config()
config$sparklyr.connect.timeout <- 3 * 60
config$sparklyr.log.console = TRUE
sc <- spark_connect(master = "local", config = config)
```
In addition, you should know that [Apache IVY](http://ant.apache.org/ivy) is a popular dependency manager focusing on flexibility and simplicity, and is used by Apache Spark for installing extensions. When the connection fails while you are using an extension, consider clearing your [IVY cache](http://bit.ly/2Zcubun) by running the following:
```{r extensions-rsparkling-cache, eval=FALSE}
unlink("~/.ivy2", recursive = TRUE)
```
In addition, you can also consider opening GitHub issues from the following extensions repositories to get help from the extension authors:
- *rsparkling*: [github.com/h2oai/sparkling-water](https://github.com/h2oai/sparkling-water).
- *sparkxgb*: [github.com/rstudio/sparkxgb](https://github.com/rstudio/sparkxgb).
- *sparktf*: [github.com/rstudio/sparktf](https://github.com/rstudio/sparktf).
- *variantspark*: [github.com/r-spark/variantspark](https://github.com/r-spark/variantspark).
- *geospark*: [github.com/r-spark/geospark](https://github.com/r-spark/geospark).
## Recap
This chapter provided a brief overview on using some of the Spark extensions available in R, which happens to be as easy as installing a package. You then learned how to use the `rsparkling` extension, which provides access to H2O in Spark, which in turn provides additional modeling functionality like enhanced metrics and the ability to automatically select models. We then jumped to `graphframes`, an extension to help you process relational datasets that are formally referred to as graphs. You also learned how to compute simple connection metrics or run complex algorithms like PageRank.
The XGBoost and deep learning sections provided alternate modeling techniques that use gradient descent: the former over decision trees, and the latter over deep multilayered perceptrons where we can use Spark to preprocess datasets into records that later can be consumed by TensorFlow and Keras using the `sparktf` extension. The last two sections introduced extensions to process genomic and spatial datasets through the `variantspark` and `geospark` extensions.
These extensions, and many more, provide a comprehensive library of advanced functionality that, in combination with the analysis and modeling techniques presented, should cover most tasks required to run in computing clusters. However, when functionality is lacking, you can consider writing your own extension, which is what we discuss in [Chapter 13](#contributing), or you can apply custom transformations over each partition using R code, as we describe in [Chapter 11](distributed).