Skip to content

Spark Declarative Pipelines UI - Standalone app for running sdp

License

Notifications You must be signed in to change notification settings

ilum-cloud/sdp-ui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ilum

Spark Declarative Pipelines UI

Coming Soon: Open Source Visual Interface for Apache Spark Declarative Pipelines

πŸ”” Get Notified About the Release β€’ πŸ“– Documentation


πŸŽ‰ Announcement

Ilum is open-sourcing the Spark Declarative Pipelines UI!

We're excited to announce that we'll be releasing a powerful, open-source visual interface for building and managing Apache Spark Declarative Pipelines (SDP). This tool will make it easier than ever to design, visualize, and deploy declarative data pipelines on Spark 4.1+.

πŸ“¬ Stay Updated

Want to be the first to know when we launch? Register here to get notified about the official release and early access opportunities.


πŸš€ What is Spark Declarative Pipelines?

Spark Declarative Pipelines (SDP) is a revolutionary framework for building reliable, maintainable, and testable data pipelines on Apache Spark. Donated by Databricks to the Apache Spark open-source project in June 2025, it represents the evolution of Delta Live Tables into a vendor-neutral, community-driven standard.

The Paradigm Shift: From Imperative to Declarative

Traditional Imperative Approach:

df_sales = spark.read.format("csv").load("s3://raw-data/sales.csv")
df_products = spark.read.format("json").load("s3://raw-data/products.json")
df_joined = df_sales.join(df_products, "product_id")
df_aggregated = df_joined.groupBy("product_category").agg(sum("amount").alias("total_sales"))
df_aggregated.write.format("delta").mode("overwrite").save("s3://curated-data/product_sales_summary")

Declarative Approach with SDP:

from pyspark import pipelines as dp

@dp.materialized_view
def product_sales_summary():
    sales = spark.table("sales")
    products = spark.table("products")
    return (sales.join(products, "product_id")
            .groupBy("product_category")
            .agg(sum("amount").alias("total_sales")))

✨ Key Features of Spark Declarative Pipelines

🎯 Automatic Orchestration

SDP automatically analyzes dependencies between datasets and orchestrates execution order with maximum parallelism. No manual DAG definition required.

⚑ Reduced Development Time

Build pipelines up to 90% faster by eliminating boilerplate code for checkpoint management, incremental processing, and error handling.

πŸ”„ Unified Batch and Streaming

Single API for both batch and streaming workloads. Toggle between processing modes with minimal code changes.

πŸ›‘οΈ Built-in Fault Tolerance

Automatic checkpointing, state management, and multi-level retry logic (task β†’ flow β†’ pipeline) for transient failures.

πŸ“ˆ Incremental Processing

Automatically processes only new or changed data, avoiding expensive full table scans.

🎨 Declarative Programming

Define what your pipeline should produce, not how to execute it. Spark handles the orchestration, dependency management, and optimization automatically.

πŸ”§ No External Orchestrator Required

Unlike traditional workflows that require Apache Airflow or similar tools, SDP manages task dependencies internally.


πŸ–₯️ What Will the UI Provide?

The Spark Declarative Pipelines UI will bring visual development and management capabilities to SDP, making it accessible to a broader audience. Expected features include:

  • πŸ“Š Visual Pipeline Designer: Drag-and-drop interface for building data pipelines
  • πŸ” Dependency Graph Visualization: Interactive DAG view showing data flow and dependencies
  • πŸ“ Code Generation: Automatically generate Python and SQL pipeline definitions
  • βš™οΈ Pipeline Configuration: Visual editor for pipeline.yml specifications
  • πŸ“ˆ Real-time Monitoring: Track pipeline execution, flow status, and performance metrics
  • πŸ› Debug Tools: Inspect checkpoints, view logs, and troubleshoot issues
  • πŸ“š Template Library: Pre-built pipeline templates for common use cases
  • πŸ” Access Control: Manage permissions and collaboration features

🎯 Who Is This For?

The Spark Declarative Pipelines UI is designed for:

  • Data Engineers building production ETL workflows on Apache Spark
  • Data Analysts who want to create data pipelines without deep Spark expertise
  • Platform Teams providing self-service data infrastructure
  • Organizations migrating from proprietary platforms to open-source solutions
  • Teams seeking vendor-neutral alternatives to cloud-specific tools

πŸ“– Core Concepts

Flows

The foundational data processing unit supporting both streaming and batch semantics. Flows read data from sources, apply transformations, and write results to target datasets.

Datasets

  • Streaming Tables: Incremental processing of streaming data (Kafka, Kinesis, cloud storage)
  • Materialized Views: Precomputed batch tables with incremental refresh capabilities
  • Temporary Views: Scoped to pipeline execution, useful for reusable transformation logic

Pipelines

The primary unit of development containing one or more flows, streaming tables, and materialized views. SDP automatically analyzes dependencies and orchestrates execution.

Dependency Graph (DAG)

Automatically constructed graph representing data dependencies, enabling optimization, parallelism, fault tolerance, and transparency.


πŸ› οΈ Technology Stack

  • Apache Spark 4.1+: Built on Spark Declarative Pipelines framework
  • Spark Connect: Leverages Spark Connect protocol for remote execution
  • Python & SQL: Full support for both Python and SQL pipeline definitions
  • Open Source: Fully open-source with community-driven development

πŸ“š Resources


🀝 Contributing

This project will be open source and we welcome contributions! Once released, you'll be able to:

  • Report bugs and request features
  • Submit pull requests with improvements
  • Help with documentation
  • Share pipeline templates and examples

Register for updates to be notified when contribution guidelines are published.


πŸ“‹ Roadmap

Current Status: Development

The Spark Declarative Pipelines UI is currently under active development as a part of beta feature within Ilum Enterprise Edition. We're working towards an initial release that will include:

  • βœ… Visual pipeline designer
  • βœ… DAG visualization
  • βœ… Code generation (Python & SQL)
  • βœ… Pipeline execution monitoring
  • 🚧 Data Lineage
  • 🚧 Packaging (Docker) and Helm chart
  • 🚧 Advanced debugging tools
  • πŸ“… Template library
  • πŸ“… Multi-user collaboration features

Stay informed about our progress - register to receive updates on development milestones and release dates.


🌟 Why Open Source?

At Ilum, we believe in the power of open-source collaboration. By open-sourcing the Spark Declarative Pipelines UI, we aim to:

  • Accelerate Adoption: Make declarative pipelines accessible to everyone
  • Foster Innovation: Enable community-driven feature development
  • Ensure Vendor Neutrality: Provide a truly open alternative to proprietary tools
  • Build Together: Create the best possible tool through collective expertise

πŸ’‘ Example Use Cases

Real-time Analytics

Build streaming pipelines that ingest data from Kafka, enrich it with dimension tables, and produce real-time aggregates.

ETL Workflows

Create batch pipelines that extract data from multiple sources, transform it through multiple layers (bronze/silver/gold), and load results into data warehouses.

Change Data Capture (CDC)

Implement slowly changing dimensions (SCD Type 2) with automatic change tracking and historical versioning.

Data Quality Monitoring

Define data quality expectations and automatically track violations without failing entire pipelines.


πŸ“ž Contact & Support


πŸ“„ License

This project will be released under an open-source license. Specific license details will be announced with the initial release.


πŸ”” Don't miss the launch!
Register now to be notified when we release

Made with ❀️ by the Ilum team

Releases

No releases published

Packages

No packages published