forked from edgararuiz-zz/dbplot
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
209 lines (147 loc) · 5.47 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
title: "dbplot"
output:
github_document:
toc: true
toc_depth: 3
---
```{r, setup, include = FALSE}
library(dplyr)
library(dbplot)
library(sparklyr)
library(nycflights13)
Sys.setenv("JAVA_HOME" = normalizePath("C:\\Program Files (x86)\\Java\\jre1.8.0_144"))
knitr::opts_chunk$set(fig.height = 3.5, fig.width = 4, fig.align = 'center')
```
[![Build Status](https://travis-ci.org/edgararuiz/dbplot.svg?branch=master)](https://travis-ci.org/edgararuiz/dbplot)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/dbplot)](http://cran.r-project.org/package=dbplot)
![CRAN downloads](https://cranlogs.r-pkg.org/badges/grand-total/dbplot)
Leverages `dplyr` to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels:
1. Functions that ouput a `ggplot2` object
2. Functions that outputs a `data.frame` object with the calculations.
3. Creates the formula needed to calculate bins for a Histogram or a Raster plot
## Installation
```{r, eval = FALSE}
# You can install the released version from CRAN
install.packages("dbplot")
# Or the the development version from GitHub:
install.packages("devtools")
devtools::install_github("edgararuiz/dbplot")
```
## Connecting to a data source
- For more information on how to connect to databases, including Hive, please visit http://db.rstudio.com
- To use Spark, please visit the `sparklyr` official website: http://spark.rstudio.com
## Example
In addition to database connections, the functions work with `sparklyr`. A Spark DataFrame will be used for the examples in this README.
```{r}
library(sparklyr)
conf <- spark_config()
sc <- spark_connect(master = "local", version = "2.1.0")
spark_flights <- copy_to(sc, nycflights13::flights, "flights")
```
## `ggplot`
### Histogram
By default `dbplot_histogram()` creates a 30 bin histogram
```{r}
library(ggplot2)
spark_flights %>%
dbplot_histogram(sched_dep_time)
```
Use `binwidth` to fix the bin size
```{r}
spark_flights %>%
dbplot_histogram(sched_dep_time, binwidth = 200)
```
Because it outputs a `ggplot2` object, more customization can be done
```{r}
spark_flights %>%
dbplot_histogram(sched_dep_time, binwidth = 300) +
labs(title = "Flights - Scheduled Departure Time") +
theme_bw()
```
### Raster
To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.
A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.
- If no `fill` argument is passed, the default calculation will be count, `n()`
```{r}
spark_flights %>%
filter(!is.na(arr_delay)) %>%
dbplot_raster(arr_delay, dep_delay)
```
- Pass an aggregation formula that can run inside the database
```{r}
spark_flights %>%
filter(!is.na(arr_delay)) %>%
dbplot_raster(arr_delay, dep_delay, mean(distance))
```
- Increase or decrease for more, or less, definition. The `resolution` argument controls that, it defaults to 100
```{r}
spark_flights %>%
filter(!is.na(arr_delay)) %>%
dbplot_raster(arr_delay, dep_delay, mean(distance), resolution = 500)
```
### Bar Plot
- `dbplot_bar()` defaults to a tally of each value in a discrete variable
```{r}
spark_flights %>%
dbplot_bar(origin)
```
- Pass a formula that will be operated for each value in the discrete variable
```{r}
spark_flights %>%
dbplot_bar(origin, mean(dep_delay))
```
### Line plot
- `dbplot_line()` defaults to a tally of each value in a discrete variable
```{r}
spark_flights %>%
dbplot_line(month)
```
- Pass a formula that will be operated for each value in the discrete variable
```{r}
spark_flights %>%
dbplot_line(month, mean(dep_delay))
```
### Boxplot
- It expect a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn't calculate outliers. Currently, this feature works with sparklyr and Hive connections.
```{r}
spark_flights %>%
dbplot_boxplot(origin, dep_delay)
```
## Calculation functions
If a more customized plot is needed, the data the underpins the plots can also be accessed:
1. `db_compute_bins()` - Returns a data frame with the bins and count per bin
2. `db_compute_count()` - Returns a data frame with the count per discrete value
3. `db_compute_raster()` - Returns a data frame with the results per x/y intersection
4. `db_compute_boxplot()` - Returns a data frame with boxplot calculations
```{r}
spark_flights %>%
db_compute_bins(arr_delay)
```
The data can be piped to a plot
```{r}
spark_flights %>%
filter(arr_delay < 100 , arr_delay > -50) %>%
db_compute_bins(arr_delay) %>%
ggplot() +
geom_col(aes(arr_delay, count, fill = count))
```
## `db_bin()`
Uses 'rlang' to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.
```{r}
db_bin(var)
```
```{r}
spark_flights %>%
group_by(x = !! db_bin(arr_delay)) %>%
tally
```
```{r}
spark_flights %>%
filter(!is.na(arr_delay)) %>%
group_by(x = !! db_bin(arr_delay)) %>%
tally %>%
collect %>%
ggplot() +
geom_col(aes(x, n))
```