-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.qmd
455 lines (322 loc) · 11.7 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
---
title: "Parallel Computing with targets package"
subtitle: "(Embarrassingly) Easy Parallelization"
author: "Jongoh Kim"
date: "`r Sys.Date()`"
date-format: long
institute: "LISER"
format:
revealjs:
preview-links: auto
incremental: false
theme: [moon, custom.scss]
pdf-separate-fragments: true
strip-comments: true
highlight-style: atom-one
auto-animate-duration: 0.8
code-copy: true
slide-number: true
execute:
eval: false
echo: true
editor:
markdown:
wrap: 72
---
<!-- Print to PDF -->
<!-- Follow this: https://quarto.org/docs/presentations/revealjs/presenting.html#print-to-pdf -->
<!-- Use chrome and not firefox -->
# Introduction
## Objectvie
<br>
::: {.callout-important icon="false" appearance="simple"}
This training aims to introduce you to (embarrassingly) simple parallel computing.
:::
## Prerequisite
<br>
::: {.callout-important icon="false" appearance="simple"}
This training is for people who have intermediate knowledge of R
programming!
:::
You should have at least the following experiences:
you have
- ***comfortably used apply functions(lapply, sapply, vapply)***
- the basic knowledge of targets package
## What is parallel computing?
<br>
***Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously.***
## What is embarrassingly parallel?
<br>
- also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel
- little or no effort is needed to separate the problem into a number of parallel tasks
## When can I do (embarrassingly) parallel computing?
<br>
1. If you have more than one core in your CPU
2. If little or no dependency exists between those parallel tasks, or for results between them!
- e.g. for loops
## Three ways to do simple parallel computing wtih targets
<br>
(@) Easy setting(but for HPC)
- clustermq package
- future package
<br>
(@) Hard setting
- parallel package
# Real Example
## Setting
<br>
Let's say we have a dataset with text, and it comprises of real and fake news.
We're interested in calculating negative/positive sentiment scores in each article and looking at its distribution.
## Overall workflow
<!--html_preserve-->
<iframe src = "img/workflow.html" width="900" height="600"> </iframe>
<!--/html_preserve-->
## Without parllelization {auto-animate="true"}
<br>
Top part of the _targets.R file
```{r}
library(targets)
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
format = "qs")
```
## Without parllelization {auto-animate="true"}
<br>
```{.r code-line-numbers="10-23"}
library(targets)
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
format = "qs")
# End this file with a list of target objects.
list(
#reading in the news data
tar_target(data,
read_news()),
#cleaning the text
tar_target(cleaning_text,
clean_text(data)),
#doing with sentiment analysis without parallelization
tar_target(sentiment_analysis,
extract_sentiment(data, cleaning_text))
)
```
## The extract_sentiment function
```{r}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
print("Doing simple lapply(for-loop)!")
#creating the final table
final.df <- data %>%
select(-text)
tryCatch(expr = {
#getting the sentiment score
final.df[,sentiment_score:=sapply(X = clean_text_list,
FUN = get_sentiment_score,
USE.NAMES = F)]
})
return(final.df)
}
```
## The output
```{r}
#reading the result
result <- tar_read(sentiment_analysis)
#getting the first 6 rows without the date information
result %>% select(-date) %>% head()
" title subject is_real sentiment_score
1: As U.S. budget fight looms, Republicans flip their fiscal script politicsNews TRUE 12
2: U.S. military to accept transgender recruits on Monday: Pentagon politicsNews TRUE 14
3: Senior U.S. Republican senator: 'Let Mr. Mueller do his job' politicsNews TRUE 6
4: FBI Russia probe helped by Australian diplomat tip-off: NYT politicsNews TRUE 7
5: Trump wants Postal Service to charge 'much more' for Amazon shipments politicsNews TRUE -5
6: White House, Congress prepare for talks on spending, immigration politicsNews TRUE 6"
```
## How long it took
<br>
![](img/lapply.png){fig.align="center"}
# With clustermq
```{r}
library(targets)
library(clustermq)
options(clustermq.scheduler = "multiprocess")
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi",
"ggplot2", "data.table", "parallel",
"tidytext", "stopwords"),
format = "qs")
```
## With clustermq
<br>
Then you simply type:
```{r}
#without saying how many cores you will use
tar_make_clustermq()
"OR"
#setting how many cores you will use
tar_make_clustermq(workers = 2)
```
## REMEMBER!
<br>
::: {.callout-important icon="false" appearance="simple"}
To be safe, leave at least 33% of your cores to run your computer's OS and other background programs.
For instance, if you have 4 cores, use only 2!
:::
## How long it took
<br>
![](img/cluster_mq.png){fig.align="center"}
# With Future
```{r}
library(targets)
library(future)
library(future.callr)
plan(callr)
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
format = "qs")
```
## With Future
<br>
Then you simply type:
```{r}
#without saying how many cores you will use
tar_make_future()
"OR"
#setting how many cores you will use
tar_make_future(workers = 2)
```
## How long it took
<br>
![](img/future.png){fig.align="center"}
# With Parallel
It is a bit different with parallel.
The top part of _targets.R is as same as the lapply version but calling the parallel package.
<br>
```{r}
library(targets)
source("scripts/functions/parallel_functions.R")
# # configuring the script it should run(run it one time and it will create an targets.yaml file in the project folder)
# tar_config_set(script = "scripts/2._targets_pattern.R")
# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
format = "qs")
```
## The difference
<br>
::: {.callout-important icon="false" appearance="simple"}
The major difference lies in the function you call to do the parallel computing!
:::
## The extract_sentiment function before
```{r}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
print("Doing simple lapply(for-loop)!")
#creating the final table
final.df <- data %>%
select(-text)
tryCatch(expr = {
#getting the sentiment score
final.df[,sentiment_score:=sapply(X = clean_text_list,
FUN = get_sentiment_score,
USE.NAMES = F)]
})
return(final.df)
}
```
## The extract_sentiment function for parallel {auto-animate="true"}
```{r}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
print("Number of Cores that could be used:")
print(parallel::detectCores(logical = F))
#declaring the number of cores
num_cores <- floor(parallel::detectCores(logical = F)*0.66) #at least leave 33% of your cores to run your OS & other programs
#create the cluster
cl <- makeCluster(num_cores)
print("DON'T USE ALL YOUR CORES!")
print(paste("Currently using", num_cores, "Cores!"))
```
## The extract_sentiment function for parallel {auto-animate="true"}
```{.r code-line-numbers="13-29"}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
print("Number of Cores that could be used:")
print(parallel::detectCores(logical = F))
#declaring the number of cores
num_cores <- floor(parallel::detectCores(logical = F)*0.66) #at least leave 33% of your cores to run your OS & other programs
#create the cluster
cl <- makeCluster(num_cores)
print("DON'T USE ALL YOUR CORES!")
print(paste("Currently using", num_cores, "Cores!"))
#creating the final table
final.df <- data %>%
select(-text)
tryCatch(expr = {
#getting the sentiment score
final.df[,sentiment_score:=parSapply(cl = cl,
X = clean_text_list,
FUN = get_sentiment_score,
USE.NAMES = F)]
},
finally = {
#stop using the cluster IMPORTANT!
stopCluster(cl)
})
return(final.df)
}
```
## The get_sentiment_score function for parallel
You have to call the required packages inside the function!
```{r}
#getting the sentiment scores by each text
get_sentiment_score <- function(text){#text should be a vector of words!
#calling the packages again because when you do parallization packages need to be recalled!
packages <- c("qs", "dplyr", "stringr", "stringi", "data.table", "parallel", "tidytext", "stopwords")
lapply(packages, require, character.only = TRUE)
#setting the words related to sentiments
sentiment_words <- get_sentiments("bing")
```
## How long it took
<br>
![](img/par_vs_simple.png){fig.align="center"}
# Thanks!
<br>
Special thanks to [Etienne Bacher](https://github.com/etiennebacher) for
his slide codes!
<br>
Source code for slides:
[https://github.com/jongohkim91/targets_parallelization/blob/master/index.qmd](https://github.com/jongohkim91/targets_parallelization/blob/master/index.qmd){.external
target="_blank"}
<br>
Examples I used in this training
- [link](https://github.com/jongohkim91/targets_parallelization/tree/master/example%20codes%20or%20projects)
# Good resources
The {targets} R package user manual from Will Landau(The creator of 'targets' package)
1. The parallel computing in the HPC environment part
[https://books.ropensci.org/targets/hpc.html](https://books.ropensci.org/targets/hpc.html){.external
target="_blank"}
<br>
2. clustermq part
[https://books.ropensci.org/targets/hpc.html#clustermq](https://books.ropensci.org/targets/hpc.html#clustermq){.external
target="_blank"}
<br>
3. future part
[https://books.ropensci.org/targets/hpc.html#future](https://books.ropensci.org/targets/hpc.html#future){.external
target="_blank"}
# Good resources
R Programming for Data Science from Roger D. Peng
- Parallel Computation part
[https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html](https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html){.external
target="_blank"}
<br>
Parallel Processing in R from Josh Errickson (Uni Michigan of Statistics)
Nice examples for parLapply
[https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html#using-sockets-with-parlapply](https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html#using-sockets-with-parlapply){.external
target="_blank"}