Translates Data Processing Language (DPL) commands to Apache Spark actions and transformations. Uses ANTLR visitors to generate a list of step objects, which contain the actual implementations of the commands using the Apache Spark API.
-
Translates a string-based DPL command using the parse tree generated by the PTH_03 ANTLR-based parser to Apache Spark actions and transformations.
-
Fetch data from a datasource provider (by default, PTH_06 datasource provider) and filter the data with the filters specified in the DPL command.
-
Apply various transformations and actions to the data with simple easy-to-understand commands.
-
Supports parallel and sequential modes based on which kind of commands are used. If a command requires batch-based processing, sequential mode will be used. Otherwise, processing will remain on parallel mode, allowing stream processing.
-
Spark API implementations are enclosed in so-called Step objects, which take a Dataset as input and return the transformed dataset as the return value, allowing for easy reusability of these objects.
-
ANTLR-based visitor functions purely gather all the necessary parameters for these objects, not containing any implementation logic of the commands themselves.
See the official documentation on docs.teragrep.com.
Use:
-
Create a new DPLParserCatalystContext. It requires a
SparkSession
object and acom.typesafe.config.Config
. The config is usually provided from the Zeppelin component.
DPLParserCatalystContext catCtx = new DPLParserCatalystContext(sparkSession, config);
-
Create a new DPLParserCatalystVisitor, in which you set the DPLParserCatalystContext.
DPLParserCatalystVisitor catVisitor = new DPLParserCatalystVisitor(catCtx);
-
Visit the parse tree generated by PTH_03 using the visitor functions with the DPLParserCatalystVisitor.visit() function.
CatalystNode n = (CatalystNode) visitor.visit(tree);
-
The result of that function is a CatalystNode. It contains a DataStreamWriter, which can be started to start the execution.
n.getDataStreamWriter();
-
Set the visitor’s Consumer to a function of your liking to view or move the resulting Dataset to the desired component.
visitor.setConsumer((ds, id) -> { ds.show(); });
For a more concrete example, check out the PTH_07 Zeppelin DPL Interpreter project.
Compile:
mvn clean install -Pbuild
You can involve yourself with our project by opening an issue or submitting a pull request.
Contribution requirements:
-
All changes must be accompanied by a new or changed test. If you think testing is not required in your pull request, include a sufficient explanation as why you think so.
-
Security checks must pass
-
Pull requests must align with the principles and values of extreme programming.
-
Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).
Read more in our Contributing Guideline.
Contributors must sign Teragrep Contributor License Agreement before a pull request is accepted to organization’s repositories.
You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep’s repositories.