Coming Soon: Open Source Visual Interface for Apache Spark Declarative Pipelines
π Get Notified About the Release β’ π Documentation
Ilum is open-sourcing the Spark Declarative Pipelines UI!
We're excited to announce that we'll be releasing a powerful, open-source visual interface for building and managing Apache Spark Declarative Pipelines (SDP). This tool will make it easier than ever to design, visualize, and deploy declarative data pipelines on Spark 4.1+.
Want to be the first to know when we launch? Register here to get notified about the official release and early access opportunities.
Spark Declarative Pipelines (SDP) is a revolutionary framework for building reliable, maintainable, and testable data pipelines on Apache Spark. Donated by Databricks to the Apache Spark open-source project in June 2025, it represents the evolution of Delta Live Tables into a vendor-neutral, community-driven standard.
Traditional Imperative Approach:
df_sales = spark.read.format("csv").load("s3://raw-data/sales.csv")
df_products = spark.read.format("json").load("s3://raw-data/products.json")
df_joined = df_sales.join(df_products, "product_id")
df_aggregated = df_joined.groupBy("product_category").agg(sum("amount").alias("total_sales"))
df_aggregated.write.format("delta").mode("overwrite").save("s3://curated-data/product_sales_summary")Declarative Approach with SDP:
from pyspark import pipelines as dp
@dp.materialized_view
def product_sales_summary():
sales = spark.table("sales")
products = spark.table("products")
return (sales.join(products, "product_id")
.groupBy("product_category")
.agg(sum("amount").alias("total_sales")))SDP automatically analyzes dependencies between datasets and orchestrates execution order with maximum parallelism. No manual DAG definition required.
Build pipelines up to 90% faster by eliminating boilerplate code for checkpoint management, incremental processing, and error handling.
Single API for both batch and streaming workloads. Toggle between processing modes with minimal code changes.
Automatic checkpointing, state management, and multi-level retry logic (task β flow β pipeline) for transient failures.
Automatically processes only new or changed data, avoiding expensive full table scans.
Define what your pipeline should produce, not how to execute it. Spark handles the orchestration, dependency management, and optimization automatically.
Unlike traditional workflows that require Apache Airflow or similar tools, SDP manages task dependencies internally.
The Spark Declarative Pipelines UI will bring visual development and management capabilities to SDP, making it accessible to a broader audience. Expected features include:
- π Visual Pipeline Designer: Drag-and-drop interface for building data pipelines
- π Dependency Graph Visualization: Interactive DAG view showing data flow and dependencies
- π Code Generation: Automatically generate Python and SQL pipeline definitions
- βοΈ Pipeline Configuration: Visual editor for
pipeline.ymlspecifications - π Real-time Monitoring: Track pipeline execution, flow status, and performance metrics
- π Debug Tools: Inspect checkpoints, view logs, and troubleshoot issues
- π Template Library: Pre-built pipeline templates for common use cases
- π Access Control: Manage permissions and collaboration features
The Spark Declarative Pipelines UI is designed for:
- Data Engineers building production ETL workflows on Apache Spark
- Data Analysts who want to create data pipelines without deep Spark expertise
- Platform Teams providing self-service data infrastructure
- Organizations migrating from proprietary platforms to open-source solutions
- Teams seeking vendor-neutral alternatives to cloud-specific tools
The foundational data processing unit supporting both streaming and batch semantics. Flows read data from sources, apply transformations, and write results to target datasets.
- Streaming Tables: Incremental processing of streaming data (Kafka, Kinesis, cloud storage)
- Materialized Views: Precomputed batch tables with incremental refresh capabilities
- Temporary Views: Scoped to pipeline execution, useful for reusable transformation logic
The primary unit of development containing one or more flows, streaming tables, and materialized views. SDP automatically analyzes dependencies and orchestrates execution.
Automatically constructed graph representing data dependencies, enabling optimization, parallelism, fault tolerance, and transparency.
- Apache Spark 4.1+: Built on Spark Declarative Pipelines framework
- Spark Connect: Leverages Spark Connect protocol for remote execution
- Python & SQL: Full support for both Python and SQL pipeline definitions
- Open Source: Fully open-source with community-driven development
- Get Notified About Release - Register for launch updates
- Ilum Documentation - Learn more about our platform
- Apache Spark - Official Spark documentation
- Spark Declarative Pipelines Guide - SDP documentation (coming with Spark 4.1 release)
This project will be open source and we welcome contributions! Once released, you'll be able to:
- Report bugs and request features
- Submit pull requests with improvements
- Help with documentation
- Share pipeline templates and examples
Register for updates to be notified when contribution guidelines are published.
The Spark Declarative Pipelines UI is currently under active development as a part of beta feature within Ilum Enterprise Edition. We're working towards an initial release that will include:
- β Visual pipeline designer
- β DAG visualization
- β Code generation (Python & SQL)
- β Pipeline execution monitoring
- π§ Data Lineage
- π§ Packaging (Docker) and Helm chart
- π§ Advanced debugging tools
- π Template library
- π Multi-user collaboration features
Stay informed about our progress - register to receive updates on development milestones and release dates.
At Ilum, we believe in the power of open-source collaboration. By open-sourcing the Spark Declarative Pipelines UI, we aim to:
- Accelerate Adoption: Make declarative pipelines accessible to everyone
- Foster Innovation: Enable community-driven feature development
- Ensure Vendor Neutrality: Provide a truly open alternative to proprietary tools
- Build Together: Create the best possible tool through collective expertise
Build streaming pipelines that ingest data from Kafka, enrich it with dimension tables, and produce real-time aggregates.
Create batch pipelines that extract data from multiple sources, transform it through multiple layers (bronze/silver/gold), and load results into data warehouses.
Implement slowly changing dimensions (SCD Type 2) with automatic change tracking and historical versioning.
Define data quality expectations and automatically track violations without failing entire pipelines.
- Website: ilum.cloud
- Documentation: ilum.cloud/docs
- Get Access: ilum.cloud/get-access
This project will be released under an open-source license. Specific license details will be announced with the initial release.
π Don't miss the launch!
Register now to be notified when we release
Made with β€οΈ by the Ilum team