index.qmd

---
title: "Parallel Computing with targets package"
subtitle: "(Embarrassingly) Easy Parallelization"
author: "Jongoh Kim"
date: "`r Sys.Date()`"
date-format: long
institute: "LISER"
format: 
 revealjs:
   preview-links: auto
   incremental: false  
   theme: [moon, custom.scss]
   pdf-separate-fragments: true
   strip-comments: true
   highlight-style: atom-one
   auto-animate-duration: 0.8
   code-copy: true
   slide-number: true
execute:
  eval: false
  echo: true
editor: 
  markdown: 
    wrap: 72
---

<!-- Print to PDF -->

<!-- Follow this: https://quarto.org/docs/presentations/revealjs/presenting.html#print-to-pdf -->

<!-- Use chrome and not firefox -->

# Introduction

## Objectvie

<br>

::: {.callout-important icon="false" appearance="simple"}
This training aims to introduce you to (embarrassingly) simple parallel computing.
:::

## Prerequisite

<br>

::: {.callout-important icon="false" appearance="simple"}
This training is for people who have intermediate knowledge of R
programming!
:::

You should have at least the following experiences:

you have

-   ***comfortably used apply functions(lapply, sapply, vapply)***
-   the basic knowledge of targets package

## What is parallel computing?

<br>

***Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously.***

## What is embarrassingly parallel?

<br>

-   also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel
-   little or no effort is needed to separate the problem into a number of parallel tasks


## When can I do (embarrassingly) parallel computing?

<br>

1.  If you have more than one core in your CPU
2.  If little or no dependency exists between those parallel tasks, or for results between them!
    -  e.g. for loops

## Three ways to do simple parallel computing wtih targets

<br>

(@)  Easy setting(but for HPC)
    - clustermq package
    - future    package

<br>

(@)  Hard setting
    - parallel package

# Real Example

## Setting

<br>

Let's say we have a dataset with text, and it comprises of real and fake news.
We're interested in calculating negative/positive sentiment scores in each article and looking at its distribution.

## Overall workflow

<!--html_preserve-->
<iframe src = "img/workflow.html" width="900" height="600"> </iframe>
<!--/html_preserve-->

## Without parllelization {auto-animate="true"}

<br>

Top part of the _targets.R file

```{r}
library(targets)

source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
               format = "qs")
```

## Without parllelization {auto-animate="true"}

<br>


```{.r code-line-numbers="10-23"}
library(targets)

source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
               format = "qs")

# End this file with a list of target objects.
list(
  #reading in the news data
  tar_target(data, 
             read_news()),
  
  #cleaning the text
  tar_target(cleaning_text, 
             clean_text(data)),
  
  #doing with sentiment analysis without parallelization
  tar_target(sentiment_analysis, 
             extract_sentiment(data, cleaning_text))
)
```

## The extract_sentiment function

```{r}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
  print("Doing simple lapply(for-loop)!")
  #creating the final table
  final.df <- data %>%
    select(-text)
  tryCatch(expr = {
    #getting the sentiment score
    final.df[,sentiment_score:=sapply(X = clean_text_list,
                                      FUN = get_sentiment_score,
                                      USE.NAMES = F)]
  })
  return(final.df)
}
```

## The output

```{r}
#reading the result
result <- tar_read(sentiment_analysis)
#getting the first 6 rows without the date information
result %>% select(-date) %>% head()
"                                                                   title      subject is_real sentiment_score
1:      As U.S. budget fight looms, Republicans flip their fiscal script politicsNews    TRUE              12
2:      U.S. military to accept transgender recruits on Monday: Pentagon politicsNews    TRUE              14
3:          Senior U.S. Republican senator: 'Let Mr. Mueller do his job' politicsNews    TRUE               6
4:           FBI Russia probe helped by Australian diplomat tip-off: NYT politicsNews    TRUE               7
5: Trump wants Postal Service to charge 'much more' for Amazon shipments politicsNews    TRUE              -5
6:      White House, Congress prepare for talks on spending, immigration politicsNews    TRUE               6"
```

## How long it took

<br>

![](img/lapply.png){fig.align="center"}

# With clustermq

```{r}
library(targets)
library(clustermq)
options(clustermq.scheduler = "multiprocess")
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", 
                            "ggplot2", "data.table", "parallel", 
                            "tidytext", "stopwords"),
               format = "qs")

```

## With clustermq 

<br>

Then you simply type:

```{r}
#without saying how many cores you will use
tar_make_clustermq()

"OR"

#setting how many cores you will use
tar_make_clustermq(workers = 2)
```

## REMEMBER!

<br>

::: {.callout-important icon="false" appearance="simple"}
To be safe, leave at least 33% of your cores to run your computer's OS and other background programs.
For instance, if you have 4 cores, use only 2!
:::
## How long it took

<br>

![](img/cluster_mq.png){fig.align="center"}

# With Future

```{r}
library(targets)
library(future)
library(future.callr)
plan(callr)
source("scripts/functions/parallel_functions.R")
#source("R/different_code.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
               format = "qs")

```

## With Future 

<br>

Then you simply type:

```{r}
#without saying how many cores you will use
tar_make_future()

"OR"

#setting how many cores you will use
tar_make_future(workers = 2)
```

## How long it took

<br>

![](img/future.png){fig.align="center"}

# With Parallel

It is a bit different with parallel.

The top part of _targets.R is as same as the lapply version but calling the parallel package.

<br>

```{r}
library(targets)
source("scripts/functions/parallel_functions.R")


# # configuring the script it should run(run it one time and it will create an targets.yaml file in the project folder)
# tar_config_set(script = "scripts/2._targets_pattern.R")

# Set packages.
tar_option_set(packages = c("qs", "dplyr", "stringr", "stringi", "ggplot2", "data.table", "parallel", "tidytext", "stopwords"),
               format = "qs")
```

## The difference

<br>

::: {.callout-important icon="false" appearance="simple"}
The major difference lies in the function you call to do the parallel computing!
:::

## The extract_sentiment function before

```{r}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
  print("Doing simple lapply(for-loop)!")
  #creating the final table
  final.df <- data %>%
    select(-text)
  tryCatch(expr = {
    #getting the sentiment score
    final.df[,sentiment_score:=sapply(X = clean_text_list,
                                      FUN = get_sentiment_score,
                                      USE.NAMES = F)]
  })
  return(final.df)
}
```

## The extract_sentiment function for parallel {auto-animate="true"}

```{r}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
  print("Number of Cores that could be used:")
  print(parallel::detectCores(logical = F))
  
  #declaring the number of cores
  num_cores <- floor(parallel::detectCores(logical = F)*0.66) #at least leave 33% of your cores to run your OS & other programs 
  #create the cluster
  cl <- makeCluster(num_cores)
  
  print("DON'T USE ALL YOUR CORES!")
  print(paste("Currently using", num_cores, "Cores!"))
```

## The extract_sentiment function for parallel {auto-animate="true"}

```{.r code-line-numbers="13-29"}
#getting the sentiment scores from a list of texts(each text is a vector of words!)
extract_sentiment <- function(data, clean_text_list){
  print("Number of Cores that could be used:")
  print(parallel::detectCores(logical = F))
  
  #declaring the number of cores
  num_cores <- floor(parallel::detectCores(logical = F)*0.66) #at least leave 33% of your cores to run your OS & other programs 
  #create the cluster
  cl <- makeCluster(num_cores)
  
  print("DON'T USE ALL YOUR CORES!")
  print(paste("Currently using", num_cores, "Cores!"))
  
  #creating the final table
  final.df <- data %>%
    select(-text)
  tryCatch(expr = {
    #getting the sentiment score
    final.df[,sentiment_score:=parSapply(cl = cl,
                                         X = clean_text_list,
                                         FUN = get_sentiment_score,
                                         USE.NAMES = F)]
  },
  finally = {
    #stop using the cluster IMPORTANT!
    stopCluster(cl)
  })
  return(final.df)
}
```

## The get_sentiment_score function for parallel

You have to call the required packages inside the function!

```{r}
#getting the sentiment scores by each text
get_sentiment_score <- function(text){#text should be a vector of words!
  #calling the packages again because when you do parallization packages need to be recalled!
  packages <- c("qs", "dplyr", "stringr", "stringi", "data.table", "parallel", "tidytext", "stopwords")
  lapply(packages, require, character.only = TRUE)
  
  #setting the words related to sentiments
  sentiment_words <- get_sentiments("bing")
```

## How long it took

<br>

![](img/par_vs_simple.png){fig.align="center"}

# Thanks!

<br>

Special thanks to [Etienne Bacher](https://github.com/etiennebacher) for
his slide codes!

<br>

Source code for slides:

[https://github.com/jongohkim91/targets_parallelization/blob/master/index.qmd](https://github.com/jongohkim91/targets_parallelization/blob/master/index.qmd){.external
target="_blank"}

<br>

Examples I used in this training
-  [link](https://github.com/jongohkim91/targets_parallelization/tree/master/example%20codes%20or%20projects)

# Good resources

The {targets} R package user manual from Will Landau(The creator of 'targets' package)

1.  The parallel computing in the HPC environment part
[https://books.ropensci.org/targets/hpc.html](https://books.ropensci.org/targets/hpc.html){.external
target="_blank"}

<br>

2.  clustermq part
[https://books.ropensci.org/targets/hpc.html#clustermq](https://books.ropensci.org/targets/hpc.html#clustermq){.external
target="_blank"}

<br>

3.  future part
[https://books.ropensci.org/targets/hpc.html#future](https://books.ropensci.org/targets/hpc.html#future){.external
target="_blank"}

# Good resources

R Programming for Data Science from Roger D. Peng
-  Parallel Computation part
[https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html](https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html){.external
target="_blank"}

<br>

Parallel Processing in R from Josh Errickson (Uni Michigan of Statistics)

Nice examples for parLapply
[https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html#using-sockets-with-parlapply](https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html#using-sockets-with-parlapply){.external
target="_blank"}