Skip to content

Commit 5a6365e

Browse files
committed
Merge branch 'dev-rpackage' into 'master'
Minor language improvements in the README and main vignette See merge request eoc_foundation_wip/analysis-pipelines!10
2 parents a8d3da9 + 07b8e2d commit 5a6365e

File tree

2 files changed

+21
-27
lines changed

2 files changed

+21
-27
lines changed

README.md

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,13 @@
77

88
In a typical data science workflow there are multiple steps involved from data aggregation, cleaning, exploratory analysis, modeling and so on. As the data science community matures, we are seeing that there are a variety of languages which provide better capabilities for specific steps in the data science workflow. *R* is typically used for data transformations, statistical models, and visualizations, while *Python* provides more robust functions for machine learning. In addition to this, *Spark* provides an environment to process high volume data - both as one-time/ batch or as streams.
99

10-
The job of today's data scientist is changing from one where they are married to a specific tool or language, to one where they are using all these tools for their specialized purposes. The key problem then becomes one of translation between these tools for seamless analysis.
10+
The job of today's data scientist is changing from one where they are married to a specific tool or language, to one where they are using all these tools for their specialized purposes. The key problem then becomes one of translation between these tools for seamless analysis. Additionally, in the work of a data scientist, there is a need to perform the same task repeatedly, as well as put certain analysis flows (or) pipelines into production to work on new data periodically, or work on streaming data.
1111

12-
Recently in the data science community, interfaces for using these various tools have been published. In terms of R packages, the *reticulate* package provides an interface to Python, and the *SparkR* and *sparklyr* packages provide an interface to Spark.
12+
Recently, interfaces for using these various tools have been published. In terms of R packages, the *reticulate* package provides an interface to Python, and the *SparkR* and *sparklyr* packages provide an interface to Spark.
1313

14-
The *analysisPipelines* package uses these interfaces to enable *Interoperable Pipelines* i.e. the ability define an execute a reusable data science pipeline which can contain functions to be executed in an R environment, in a Python environment or in a Spark environment. These pipelines can saved and loaded, to enable batch operation as datasets get updated with new data.
15-
16-
Additionally, in the work of a data scientist, there is a need to perform the same task repeatedly, as well as put certain analysis flows (or) pipelines into production to work on new data periodically, or work on streaming data.
17-
18-
The goal of the *analysisPipelines* package is to enable data scientists to compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. It also aims to enable data scientists to use tools of their choice through an *R* interface, and compose **interoperable** pipelines between *R, Spark, and Python.*
14+
The *analysisPipelines* package uses these interfaces to enable *Interoperable Pipelines* i.e. the ability compose and execute a reusable data science pipeline which can contain functions to be executed in an *R* environment, in a *Python* environment or in a *Spark* environment. These pipelines can saved and loaded, to enable batch operation as datasets get updated with new data.
1915

16+
The goal of the *analysisPipelines* package is to make the job of the data scientist easier and help them compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. The idea is for data scientists to use tools of their choice through an *R* interface, using this package
2017
Essentially, it allows data scientists to:
2118

2219
* Compose **reusable, interoperable** pipelines in a flexible manner
@@ -26,29 +23,29 @@ Essentially, it allows data scientists to:
2623

2724
## Types of pipelines
2825

29-
This package supports for both **batch/ repeated** pipelines, as well as *streaming pipelines.*
26+
This package supports for both *batch/ repeated* pipelines, as well as *streaming pipelines.*
3027

3128
For *batch* pipelines, the vision is to enable interoperable pipelines which execute efficiently with functions in *R*, *Spark* and *Python*
3229

33-
For *streaming* pipelines, the package allows for streaming analyses through *Spark Structured Streaming.*
30+
For *streaming* pipelines, the package allows for streaming analyses through *Apache Spark Structured Streaming.*
3431

3532
## Classes and implementation
3633

37-
The *analysisPipelines* package uses S4 classes and methods to implement all the core functionality. The fundamental class exposed in this package is the *BaseAnalysisPipeline* class on which most of the core functions are implemented. The user, however, interacts with the *AnalysisPipeline* and *StreamingAnalysisPipeline* classes for batch and streaming analysis respectively. In this vignette, we work with the *AnalysisPipeline* class, with functions solely in R.
34+
The *analysisPipelines* package uses S4 classes and methods to implement all the core functionality. The fundamental class exposed in this package is the *BaseAnalysisPipeline* class on which most of the core functions are implemented. The user, however, interacts with the *AnalysisPipeline* and *StreamingAnalysisPipeline* classes for batch and streaming analysis respectively.
3835

3936
## Pipelining semantics
4037

41-
The package stays true to the *tidyverse* pipelining style which also fits nicely into the idea of creating pipelines. The core mechanism in the package is too instantiate a pipeline with data and then pipeline required functions to the object itself.
38+
The package stays true to the *tidyverse* pipelining style which also fits nicely into the idea of creating pipelines. The core mechanism in the package is to instantiate a pipeline with data and then pipeline required functions to the object itself.
4239

4340
The package allows both the use of *magrittr* pipe **(%>%)** or the *pipeR* pipe **(%>>%)**.
4441

4542
## Supported engines
4643

47-
As of this version, the package supports functions executed on *R*, or *Spark* through the SparkR interface for batch pipelines. It also supports *Spark Structured Streaming* pipelines for streaming analyses. In subsequent releases, *Python* will also be supported
44+
As of this version, the package supports functions executed on *R*, or *Spark* through the SparkR interface for batch pipelines. It also supports *Apache Spark Structured Streaming* pipelines for streaming analyses. In subsequent releases, *Python* will also be supported.
4845

4946
## Available vignettes
5047

51-
This package contains 5 vignettes:
48+
This package contains 6 vignettes:
5249

5350
* **Analysis pipelines - Core functionality and working with R data frames and functions** - This is the main vignette describing the package's core functionality, and explaining this through **batch** pipelines in just **R**
5451
* **Analysis pipelines for working with Spark DataFrames for one-time/ batch analyses** - This vignette describes creating **batch** pipelines to execute in solely in a *Spark* environment
@@ -84,12 +81,12 @@ obj %>>% getInput %>>% str
8481
getRegistry()
8582
```
8683

87-
The *getRegistry* function retrieves the set of functions and their metadata available for pipelining. Any *AnalysisPipeline* object comes with a set of pre-registered functions which can be used **out-of-the-box**. Of course, the user can register her own functions, to be used in the pipeline. We will look at this later on in the vignette.
84+
The *getRegistry* function retrieves the set of functions and their metadata available for pipelining. Any *AnalysisPipeline* object comes with a set of pre-registered functions which can be used **out-of-the-box**. Of course, the user can register her own functions, to be used in the pipeline. We will explore this later on.
8885

8986
There are two types of functions which can be pipelined:
9087

91-
* **Data functions** - These functions necessarily take their **first** argument as a dataframe. These are functions focused on performing operations on data
92-
* **Non-data functions** - These are auxiliary helper functions which are required in a pipeline, which do not operate on data.
88+
* **Data functions** - These functions necessarily take their **first** argument as a dataframe. These are functions focused on performing operations on data. Specifically, the nomenclature *data functions* is used for those functions which work on the input dataframe set to the pipeline object, and perform some transformation or analysis on them. They help form the main *path* in a pipeline, constituting a linear flow from the input.
89+
* **Non-data functions** - These are auxiliary helper functions which are required in a pipeline, which may or may not operate on data. However, the *key* difference is that these functions do not operate on the **input (or some direct transformation of it)**. In essence, they help form auxiliary paths in the pipeline, which eventually merge into the main path.
9390

9491
Both pre-registered and user-defined functions work with the *AnalysisPipeline* object in the same way i.e. regardless of who writes the function, they follow the same semantics.
9592

vignettes/Analysis_pipelines_for_working_with_R_dataframes.Rmd

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,13 @@ vignette: >
1616

1717
In a typical data science workflow there are multiple steps involved from data aggregation, cleaning, exploratory analysis, modeling and so on. As the data science community matures, we are seeing that there are a variety of languages which provide better capabilities for specific steps in the data science workflow. *R* is typically used for data transformations, statistical models, and visualizations, while *Python* provides more robust functions for machine learning. In addition to this, *Spark* provides an environment to process high volume data - both as one-time/ batch or as streams.
1818

19-
The job of today's data scientist is changing from one where they are married to a specific tool or language, to one where they are using all these tools for their specialized purposes. The key problem then becomes one of translation between these tools for seamless analysis.
19+
The job of today's data scientist is changing from one where they are married to a specific tool or language, to one where they are using all these tools for their specialized purposes. The key problem then becomes one of translation between these tools for seamless analysis. Additionally, in the work of a data scientist, there is a need to perform the same task repeatedly, as well as put certain analysis flows (or) pipelines into production to work on new data periodically, or work on streaming data.
2020

21-
Recently in the data science community, interfaces for using these various tools have been published. In terms of R packages, the *reticulate* package provides an interface to Python, and the *SparkR* and *sparklyr* packages provide an interface to Spark.
21+
Recently, interfaces for using these various tools have been published. In terms of R packages, the *reticulate* package provides an interface to Python, and the *SparkR* and *sparklyr* packages provide an interface to Spark.
2222

23-
The *analysisPipelines* package uses these interfaces to enable *Interoperable Pipelines* i.e. the ability define an execute a reusable data science pipeline which can contain functions to be executed in an R environment, in a Python environment or in a Spark environment. These pipelines can saved and loaded, to enable batch operation as datasets get updated with new data.
24-
25-
Additionally, in the work of a data scientist, there is a need to perform the same task repeatedly, as well as put certain analysis flows (or) pipelines into production to work on new data periodically, or work on streaming data.
26-
27-
The goal of the *analysisPipelines* package is to enable data scientists to compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. It also aims to enable data scientists to use tools of their choice through an *R* interface, and compose **interoperable** pipelines between *R, Spark, and Python.*
23+
The *analysisPipelines* package uses these interfaces to enable *Interoperable Pipelines* i.e. the ability compose and execute a reusable data science pipeline which can contain functions to be executed in an *R* environment, in a *Python* environment or in a *Spark* environment. These pipelines can saved and loaded, to enable batch operation as datasets get updated with new data.
2824

25+
The goal of the *analysisPipelines* package is to make the job of the data scientist easier and help them compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. The idea is for data scientists to use tools of their choice through an *R* interface, using this package
2926
Essentially, it allows data scientists to:
3027

3128
* Compose **reusable, interoperable** pipelines in a flexible manner
@@ -35,11 +32,11 @@ Essentially, it allows data scientists to:
3532

3633
## Types of pipelines
3734

38-
This package supports for both **batch/ repeated** pipelines, as well as *streaming pipelines.*
35+
This package supports for both *batch/ repeated* pipelines, as well as *streaming pipelines.*
3936

4037
For *batch* pipelines, the vision is to enable interoperable pipelines which execute efficiently with functions in *R*, *Spark* and *Python*
4138

42-
For *streaming* pipelines, the package allows for streaming analyses through *Spark Structured Streaming.*
39+
For *streaming* pipelines, the package allows for streaming analyses through *Apache Spark Structured Streaming.*
4340

4441
## Classes and implementation
4542

@@ -95,8 +92,8 @@ The *getRegistry* function retrieves the set of functions and their metadata ava
9592

9693
There are two types of functions which can be pipelined:
9794

98-
* **Data functions** - These functions necessarily take their **first** argument as a dataframe. These are functions focused on performing operations on data
99-
* **Non-data functions** - These are auxiliary helper functions which are required in a pipeline, which do not operate on data.
95+
* **Data functions** - These functions necessarily take their **first** argument as a dataframe. These are functions focused on performing operations on data. Specifically, the nomenclature *data functions* is used for those functions which work on the input dataframe set to the pipeline object, and perform some transformation or analysis on them. They help form the main *path* in a pipeline, constituting a linear flow from the input.
96+
* **Non-data functions** - These are auxiliary helper functions which are required in a pipeline, which may or may not operate on data. However, the *key* difference is that these functions do not operate on the **input (or some direct transformation of it)**. In essence, they help form auxiliary paths in the pipeline, which eventually merge into the main path.
10097

10198
Both pre-registered and user-defined functions work with the *AnalysisPipeline* object in the same way i.e. regardless of who writes the function, they follow the same semantics.
10299

0 commit comments

Comments
 (0)