You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-16Lines changed: 13 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -7,16 +7,13 @@
7
7
8
8
In a typical data science workflow there are multiple steps involved from data aggregation, cleaning, exploratory analysis, modeling and so on. As the data science community matures, we are seeing that there are a variety of languages which provide better capabilities for specific steps in the data science workflow. *R* is typically used for data transformations, statistical models, and visualizations, while *Python* provides more robust functions for machine learning. In addition to this, *Spark* provides an environment to process high volume data - both as one-time/ batch or as streams.
9
9
10
-
The job of today's data scientist is changing from one where they are married to a specific tool or language, to one where they are using all these tools for their specialized purposes. The key problem then becomes one of translation between these tools for seamless analysis.
10
+
The job of today's data scientist is changing from one where they are married to a specific tool or language, to one where they are using all these tools for their specialized purposes. The key problem then becomes one of translation between these tools for seamless analysis. Additionally, in the work of a data scientist, there is a need to perform the same task repeatedly, as well as put certain analysis flows (or) pipelines into production to work on new data periodically, or work on streaming data.
11
11
12
-
Recently in the data science community, interfaces for using these various tools have been published. In terms of R packages, the *reticulate* package provides an interface to Python, and the *SparkR* and *sparklyr* packages provide an interface to Spark.
12
+
Recently, interfaces for using these various tools have been published. In terms of R packages, the *reticulate* package provides an interface to Python, and the *SparkR* and *sparklyr* packages provide an interface to Spark.
13
13
14
-
The *analysisPipelines* package uses these interfaces to enable *Interoperable Pipelines* i.e. the ability define an execute a reusable data science pipeline which can contain functions to be executed in an R environment, in a Python environment or in a Spark environment. These pipelines can saved and loaded, to enable batch operation as datasets get updated with new data.
15
-
16
-
Additionally, in the work of a data scientist, there is a need to perform the same task repeatedly, as well as put certain analysis flows (or) pipelines into production to work on new data periodically, or work on streaming data.
17
-
18
-
The goal of the *analysisPipelines* package is to enable data scientists to compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. It also aims to enable data scientists to use tools of their choice through an *R* interface, and compose **interoperable** pipelines between *R, Spark, and Python.*
14
+
The *analysisPipelines* package uses these interfaces to enable *Interoperable Pipelines* i.e. the ability compose and execute a reusable data science pipeline which can contain functions to be executed in an *R* environment, in a *Python* environment or in a *Spark* environment. These pipelines can saved and loaded, to enable batch operation as datasets get updated with new data.
19
15
16
+
The goal of the *analysisPipelines* package is to make the job of the data scientist easier and help them compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. The idea is for data scientists to use tools of their choice through an *R* interface, using this package
20
17
Essentially, it allows data scientists to:
21
18
22
19
* Compose **reusable, interoperable** pipelines in a flexible manner
@@ -26,29 +23,29 @@ Essentially, it allows data scientists to:
26
23
27
24
## Types of pipelines
28
25
29
-
This package supports for both **batch/ repeated** pipelines, as well as *streaming pipelines.*
26
+
This package supports for both *batch/ repeated* pipelines, as well as *streaming pipelines.*
30
27
31
28
For *batch* pipelines, the vision is to enable interoperable pipelines which execute efficiently with functions in *R*, *Spark* and *Python*
32
29
33
-
For *streaming* pipelines, the package allows for streaming analyses through *Spark Structured Streaming.*
30
+
For *streaming* pipelines, the package allows for streaming analyses through *Apache Spark Structured Streaming.*
34
31
35
32
## Classes and implementation
36
33
37
-
The *analysisPipelines* package uses S4 classes and methods to implement all the core functionality. The fundamental class exposed in this package is the *BaseAnalysisPipeline* class on which most of the core functions are implemented. The user, however, interacts with the *AnalysisPipeline* and *StreamingAnalysisPipeline* classes for batch and streaming analysis respectively. In this vignette, we work with the *AnalysisPipeline* class, with functions solely in R.
34
+
The *analysisPipelines* package uses S4 classes and methods to implement all the core functionality. The fundamental class exposed in this package is the *BaseAnalysisPipeline* class on which most of the core functions are implemented. The user, however, interacts with the *AnalysisPipeline* and *StreamingAnalysisPipeline* classes for batch and streaming analysis respectively.
38
35
39
36
## Pipelining semantics
40
37
41
-
The package stays true to the *tidyverse* pipelining style which also fits nicely into the idea of creating pipelines. The core mechanism in the package is too instantiate a pipeline with data and then pipeline required functions to the object itself.
38
+
The package stays true to the *tidyverse* pipelining style which also fits nicely into the idea of creating pipelines. The core mechanism in the package is to instantiate a pipeline with data and then pipeline required functions to the object itself.
42
39
43
40
The package allows both the use of *magrittr* pipe **(%>%)** or the *pipeR* pipe **(%>>%)**.
44
41
45
42
## Supported engines
46
43
47
-
As of this version, the package supports functions executed on *R*, or *Spark* through the SparkR interface for batch pipelines. It also supports *Spark Structured Streaming* pipelines for streaming analyses. In subsequent releases, *Python* will also be supported
44
+
As of this version, the package supports functions executed on *R*, or *Spark* through the SparkR interface for batch pipelines. It also supports *Apache Spark Structured Streaming* pipelines for streaming analyses. In subsequent releases, *Python* will also be supported.
48
45
49
46
## Available vignettes
50
47
51
-
This package contains 5 vignettes:
48
+
This package contains 6 vignettes:
52
49
53
50
***Analysis pipelines - Core functionality and working with R data frames and functions** - This is the main vignette describing the package's core functionality, and explaining this through **batch** pipelines in just **R**
54
51
***Analysis pipelines for working with Spark DataFrames for one-time/ batch analyses** - This vignette describes creating **batch** pipelines to execute in solely in a *Spark* environment
@@ -84,12 +81,12 @@ obj %>>% getInput %>>% str
84
81
getRegistry()
85
82
```
86
83
87
-
The *getRegistry* function retrieves the set of functions and their metadata available for pipelining. Any *AnalysisPipeline* object comes with a set of pre-registered functions which can be used **out-of-the-box**. Of course, the user can register her own functions, to be used in the pipeline. We will look at this later on in the vignette.
84
+
The *getRegistry* function retrieves the set of functions and their metadata available for pipelining. Any *AnalysisPipeline* object comes with a set of pre-registered functions which can be used **out-of-the-box**. Of course, the user can register her own functions, to be used in the pipeline. We will explore this later on.
88
85
89
86
There are two types of functions which can be pipelined:
90
87
91
-
***Data functions** - These functions necessarily take their **first** argument as a dataframe. These are functions focused on performing operations on data
92
-
***Non-data functions** - These are auxiliary helper functions which are required in a pipeline, which do not operate on data.
88
+
***Data functions** - These functions necessarily take their **first** argument as a dataframe. These are functions focused on performing operations on data. Specifically, the nomenclature *data functions* is used for those functions which work on the input dataframe set to the pipeline object, and perform some transformation or analysis on them. They help form the main *path* in a pipeline, constituting a linear flow from the input.
89
+
***Non-data functions** - These are auxiliary helper functions which are required in a pipeline, which may or may not operate on data. However, the *key* difference is that these functions do not operate on the **input (or some direct transformation of it)**. In essence, they help form auxiliary paths in the pipeline, which eventually merge into the main path.
93
90
94
91
Both pre-registered and user-defined functions work with the *AnalysisPipeline* object in the same way i.e. regardless of who writes the function, they follow the same semantics.
Copy file name to clipboardExpand all lines: vignettes/Analysis_pipelines_for_working_with_R_dataframes.Rmd
+8-11Lines changed: 8 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -16,16 +16,13 @@ vignette: >
16
16
17
17
In a typical data science workflow there are multiple steps involved from data aggregation, cleaning, exploratory analysis, modeling and so on. As the data science community matures, we are seeing that there are a variety of languages which provide better capabilities for specific steps in the data science workflow. *R* is typically used for data transformations, statistical models, and visualizations, while *Python* provides more robust functions for machine learning. In addition to this, *Spark* provides an environment to process high volume data - both as one-time/ batch or as streams.
18
18
19
-
The job of today's data scientist is changing from one where they are married to a specific tool or language, to one where they are using all these tools for their specialized purposes. The key problem then becomes one of translation between these tools for seamless analysis.
19
+
The job of today's data scientist is changing from one where they are married to a specific tool or language, to one where they are using all these tools for their specialized purposes. The key problem then becomes one of translation between these tools for seamless analysis. Additionally, in the work of a data scientist, there is a need to perform the same task repeatedly, as well as put certain analysis flows (or) pipelines into production to work on new data periodically, or work on streaming data.
20
20
21
-
Recently in the data science community, interfaces for using these various tools have been published. In terms of R packages, the *reticulate* package provides an interface to Python, and the *SparkR* and *sparklyr* packages provide an interface to Spark.
21
+
Recently, interfaces for using these various tools have been published. In terms of R packages, the *reticulate* package provides an interface to Python, and the *SparkR* and *sparklyr* packages provide an interface to Spark.
22
22
23
-
The *analysisPipelines* package uses these interfaces to enable *Interoperable Pipelines* i.e. the ability define an execute a reusable data science pipeline which can contain functions to be executed in an R environment, in a Python environment or in a Spark environment. These pipelines can saved and loaded, to enable batch operation as datasets get updated with new data.
24
-
25
-
Additionally, in the work of a data scientist, there is a need to perform the same task repeatedly, as well as put certain analysis flows (or) pipelines into production to work on new data periodically, or work on streaming data.
26
-
27
-
The goal of the *analysisPipelines* package is to enable data scientists to compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. It also aims to enable data scientists to use tools of their choice through an *R* interface, and compose **interoperable** pipelines between *R, Spark, and Python.*
23
+
The *analysisPipelines* package uses these interfaces to enable *Interoperable Pipelines* i.e. the ability compose and execute a reusable data science pipeline which can contain functions to be executed in an *R* environment, in a *Python* environment or in a *Spark* environment. These pipelines can saved and loaded, to enable batch operation as datasets get updated with new data.
28
24
25
+
The goal of the *analysisPipelines* package is to make the job of the data scientist easier and help them compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. The idea is for data scientists to use tools of their choice through an *R* interface, using this package
29
26
Essentially, it allows data scientists to:
30
27
31
28
* Compose **reusable, interoperable** pipelines in a flexible manner
@@ -35,11 +32,11 @@ Essentially, it allows data scientists to:
35
32
36
33
## Types of pipelines
37
34
38
-
This package supports for both **batch/ repeated** pipelines, as well as *streaming pipelines.*
35
+
This package supports for both *batch/ repeated* pipelines, as well as *streaming pipelines.*
39
36
40
37
For *batch* pipelines, the vision is to enable interoperable pipelines which execute efficiently with functions in *R*, *Spark* and *Python*
41
38
42
-
For *streaming* pipelines, the package allows for streaming analyses through *Spark Structured Streaming.*
39
+
For *streaming* pipelines, the package allows for streaming analyses through *Apache Spark Structured Streaming.*
43
40
44
41
## Classes and implementation
45
42
@@ -95,8 +92,8 @@ The *getRegistry* function retrieves the set of functions and their metadata ava
95
92
96
93
There are two types of functions which can be pipelined:
97
94
98
-
***Data functions** - These functions necessarily take their **first** argument as a dataframe. These are functions focused on performing operations on data
99
-
***Non-data functions** - These are auxiliary helper functions which are required in a pipeline, which do not operate on data.
95
+
***Data functions** - These functions necessarily take their **first** argument as a dataframe. These are functions focused on performing operations on data. Specifically, the nomenclature *data functions* is used for those functions which work on the input dataframe set to the pipeline object, and perform some transformation or analysis on them. They help form the main *path* in a pipeline, constituting a linear flow from the input.
96
+
***Non-data functions** - These are auxiliary helper functions which are required in a pipeline, which may or may not operate on data. However, the *key* difference is that these functions do not operate on the **input (or some direct transformation of it)**. In essence, they help form auxiliary paths in the pipeline, which eventually merge into the main path.
100
97
101
98
Both pre-registered and user-defined functions work with the *AnalysisPipeline* object in the same way i.e. regardless of who writes the function, they follow the same semantics.
0 commit comments