Skip to content
/ pth_10 Public
forked from teragrep/pth_10

Data Processing Language (DPL) translator for Apache Spark

License

Notifications You must be signed in to change notification settings

elliVM/pth_10

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PTH_10: DPL to Apache Spark Translator

Translates Data Processing Language (DPL) commands to Apache Spark actions and transformations. Uses ANTLR visitors to generate a list of step objects, which contain the actual implementations of the commands using the Apache Spark API.

Features

  • Translates a string-based DPL command using the parse tree generated by the PTH_03 ANTLR-based parser to Apache Spark actions and transformations.

  • Fetch data from a datasource provider (by default, PTH_06 datasource provider) and filter the data with the filters specified in the DPL command.

  • Apply various transformations and actions to the data with simple easy-to-understand commands.

  • Supports parallel and sequential modes based on which kind of commands are used. If a command requires batch-based processing, sequential mode will be used. Otherwise, processing will remain on parallel mode, allowing stream processing.

  • Spark API implementations are enclosed in so-called Step objects, which take a Dataset as input and return the transformed dataset as the return value, allowing for easy reusability of these objects.

  • ANTLR-based visitor functions purely gather all the necessary parameters for these objects, not containing any implementation logic of the commands themselves.

Documentation

See the official documentation on docs.teragrep.com.

Limitations

Not all commands in the Data Processing Language are yet implemented.

How to

Use:

  • Create a new DPLParserCatalystContext. It requires a SparkSession object and a com.typesafe.config.Config. The config is usually provided from the Zeppelin component.

DPLParserCatalystContext catCtx = new DPLParserCatalystContext(sparkSession, config);
  • Create a new DPLParserCatalystVisitor, in which you set the DPLParserCatalystContext.

DPLParserCatalystVisitor catVisitor = new DPLParserCatalystVisitor(catCtx);
  • Visit the parse tree generated by PTH_03 using the visitor functions with the DPLParserCatalystVisitor.visit() function.

CatalystNode n = (CatalystNode) visitor.visit(tree);
  • The result of that function is a CatalystNode. It contains a DataStreamWriter, which can be started to start the execution.

n.getDataStreamWriter();
  • Set the visitor’s Consumer to a function of your liking to view or move the resulting Dataset to the desired component.

visitor.setConsumer((ds, id) -> {
    ds.show();
});

For a more concrete example, check out the PTH_07 Zeppelin DPL Interpreter project.

Compile:

mvn clean install -Pbuild

Contributing

You can involve yourself with our project by opening an issue or submitting a pull request.

Contribution requirements:

  1. All changes must be accompanied by a new or changed test. If you think testing is not required in your pull request, include a sufficient explanation as why you think so.

  2. Security checks must pass

  3. Pull requests must align with the principles and values of extreme programming.

  4. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).

Read more in our Contributing Guideline.

Contributor License Agreement

Contributors must sign Teragrep Contributor License Agreement before a pull request is accepted to organization’s repositories.

You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep’s repositories.

About

Data Processing Language (DPL) translator for Apache Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%