Skip to content

Spark application for streaming transactional transformations between datasets

License

Notifications You must be signed in to change notification settings

kamu-data/kamu-engine-spark

Repository files navigation

Apache Spark Engine

This the implementation of the Engine contract of Open Data Fabric using the Apache Spark data processing framework. It is currently in use in kamu-cli data management tool.

Features

  • Spark engine currently provides the most rich SQL dialect for map/filter style transformations
  • Integrates GeoSpark to provide geo-spatial SQL functions
  • It is used by kamu-cli for ingesting data into Parquet
  • It is used by kamu-cli along with Apache Livy to provide SQL queries functionality in the Jupyter notebooks

Known Issues

  • Takes a long time to start up which is hurting the user experience
  • Does not support temporal table joins
    • You might be better off using Flink-based engine for joining and aggregating event streams
  • TODO

Developing

See the Developer Guide