v0.0.0.9000: First major beta release of the package.
Pre-releaseAn overview of the package
- Compose reusable, interoperable pipelines in a flexible manner
Leverage available utility functions for performing different analytical operations
Put these pipelines into production in order to execute repeatedly
Generated analysis reports by executing these pipelines
Supported engines
As of this version, the package supports functions executed on R, or Spark through the SparkR interface for batch pipelines. It also supports Spark Structured Streaming pipelines for streaming analyses. In subsequent releases, Python will also be supported.
Available vignettes
This package contains 5 vignettes:
- Analysis pipelines - Core functionality and working with R data frames and functions - This is the main vignette describing the package's core functionality, and explaining this through batch pipelines in just R
- Analysis pipelines for working with Spark DataFrames for one-time/ batch analyses - This vignette describes creating batch pipelines to execute in solely in a Spark environment
- Interoperable analysis pipelines - This vignette describes creating and executing batch pipelines which are composed of functions executing across supported engines
- Streaming Analysis Pipelines for working with Apache Spark Structured Streaming - This vignette describes setting up streaming pipelines on Apache Spark Structured Streaming
- Using pipelines inside Shiny widgets or apps - A brief vignette which illustrates an example of using a pipeline inside a shiny widget with reactive elements and changing data
Key features
User-defined functions
You can register your own data or non-data functions. This adds the user-defined function to the registry. The registry is maintained by the package and once registered, functions can be used across pipeline objects.
Complex pipelines and formula semantics
In addition to simple linear pipelines, more complex pipelines can also be defined. There are cases when the outputs of previous functions in the pipeline, as inputs to arbitrary parameters of subsequent functions.
The package defines certain formula semantics to accomplish this. We take the example of two simple user-defined functions, both which simply return the color of the graph, as well as the column on which the graph should be plotted, in order to illustrate how this works.
Preceding outputs can be passed to subsequent functions simply by specifying a formula of the form 'fid' against the argument to which the output is to be passed . The ID represents the ID of the function in the pipeline. For example, to pass the output of function with ID '1' as an argument to a parameter of a subsequent function, the formula '~f1' is passed to that corresponding argument.
`obj %>>% getColor(color = "blue") %>>% getColumnName(columnName = "Occupancy") %>>%
univarCatDistPlots(uniCol = "building_type", priColor = ~f1, optionalPlots = 0, storeOutput = T) %>>%
outlierPlot(method = "iqr", columnName = ~f2, cutoffValue = 0.01, priColor = ~f1 , optionalPlots = 0) -> complexPipeline
complexPipeline %>>% getPipeline
complexPipeline %>>% generateOutput -> op
op %>>% getOutputById("4")`
Interoperable pipelines
Interoperable pipelines containing functions operating on different engines such as R, Spark and Python can be configured and executed through the analysisPipelines package. Currently, the package supports interoperable pipelines containing R and Spark batch functions.
Pipeline visualization
Pipelines can be visualized as directed graphs, providing information about the engines being used, function dependencies and so on.
Report generation
Outputs generated from pipelines can easily be exported to formatted reports, showcasing the results, generating pipeline as well as a peek at the data
Efficient execution
The 'analysisPipelines' package internally converts the pipeline defined by the user into a directed graph which captures dependencies of each function in the pipeline on data, other arguments as well as outputs as other functions.
- Topological sort and ordering - When it is required to generate the output, the pipeline is first prepped by performing a topological sort of the directed graph, and identifying sets (or) batches of independent functions and a sequence of batches for execution. A later release of the package will allow for parallel execution of these independent functions
- Memory management & garbage cleaning - Memory is managed efficiently, by only storing outputs which the user has explicitly specified, or temporarily storing intermediate outputs required for subsequent functions only until they are required for processing. Garbage cleaning is performed after the execution of each batch in order to manage memory effectively.
- Type conversions - In the case of Interoperable pipelines, executing across multiple engines such as R, Spark and Python, type conversions between data types in the different engines is minimized by identifying the optimal number of type conversions, before execution starts
Logging & Execution times
The package provides logging capabilities for execution of pipelines - By default, logs are written to the console, but alternatively the user can specify an output file to which the logs need to be written through the setLoggerDetails function.
Logs capture errors, as well as provide information on the steps being performed, execution times and so on.
Custom exception-handling
By default, when a function is registered, a generic exception handling function which captures the R error message, in case of error is registered against each function in the registry. The user can define a custom exception handling function, by defining it and providing it during the time of registration.
Upcoming features
- Support for interoperable pipelines with Python
- More utilities from the foundation bricks contributed by the Foundation Bricks team!
- More execution and core functionality improvements