This project demonstrates how to build a versioned data lakehouse with an atomic version-controlled ETL pipeline using Project Nessie as the catalog provider, Apache Iceberg for table format, and Apache Spark for processing. It shows how to apply Git-like version control to your data engineering workflows, similar to how Git manages your code.
-
Git-like Version Control for Data: Create branches, tags, and manage versions of your data
-
Jupyter Integration: Interactive notebooks to explore and transform data
-
Object Storage: S3-compatible storage for your data lake
-
Docker-based Setup: Easy to run locally with all components containerized
-
IMDb Dataset: Real-world example using movie data
Our data lakehouse architecture combines several components working together:
- Raw imdb movie data is ingested into the data lake using Apache Spark.
- Data is stored in Apache Iceberg tables within MinIO object storage.
- Project Nessie provides Git-like version control for the data.
- Apache Spark processes and transforms the data.
- Jupyter Notebooks are used as a consumer for interactive data analysis and ETL.
The workflow mirrors Git's branching strategy:
- Raw Data Ingestion β
raw
branch - Data Transformation β
dev
branch - Validation & Quality Checks β
dev
branch - Promotion to Production β
main
branch - Time Travel β Tags and commit hashes
Note: Each branch does not create new data storage, it symbolically points to the same data until changes are made. This ensures efficient storage usage and allows for quick branching and merging operations.
%%{init: {'theme': 'base', 'themeVariables': { 'background': '#f0f0f0', 'primaryColor': '#ffffff', 'textColor': '#333333' }}}%%
gitGraph
commit id: "namespace imdb"
branch raw
commit id: "movies table"
commit id: "raw IMDb Data"
branch dev
commit id: "clean null directors"
commit id: "remove low ratings"
commit id: "validate data quality"
checkout main
merge dev id: "merge validated data" type: HIGHLIGHT tag: "report_202501"
-
Consumers can read from the stable main branch.
-
Data engineers perform transformations on development branches.
-
Validated changes are merged into the main branch ensuring atomic updates.
Component | Technology | Purpose |
---|---|---|
Processing | Apache Spark | Data processing |
Version Control | Project Nessie | Data version control and catalog |
Table Format | Apache Iceberg | Table format |
Storage | MinIO | S3-compatible object storage |
βββ conf # Spark configuration files
βββ data # Datasets
βββ docker-compose.yaml # Docker setup configuration
βββ jars # Nessie Spark extensions JAR
βββ notebooks # Jupyter notebooks
- Docker and Docker Compose
- At least 8GB of RAM allocated to Docker
- Basic understanding of data lakehouse concepts
git clone https://github.com/abeltavares/versioned-data-lakehouse.git
cd versioned-data-lakehouse
docker-compose up -d
docker-compose down --volumes
-
Jupyter Notebook: http://localhost:8888
-
Nessie UI: http://localhost:19120
-
MinIO Console: http://localhost:9001
- Username:
admin
- Password:
password
- Username:
-
Spark UI: http://localhost:4041
We use the IMDb movies dataset to demonstrate real-world data management scenarios:
- Comprehensive movie information including titles, genres, directors, and ratings
- Rich enough to showcase complex transformations and versioning
- Perfect for demonstrating branch-based development workflows
- Allows exploration of various Iceberg features like schema evolution and partition optimization
π IMDb Movies Dataset
Access the tutorial notebook at http://localhost:8888/doc/tree/imdb_movies.ipynb, which implements a complete ETL pipeline with version control:
- Setting up Spark with Nessie integration
- Loading raw IMDb data into a raw branch
- Configuring the data lake structure
- Creating development branches for transformations
- Implementing data cleaning operations
- Performing data quality validations
- Demonstrating time travel capabilities
- Merging validated changes to main
- Creating version tags for reporting
- Managing the promotion workflow
Go to Nessie UI:
- View existing branches and tags
- Explore repository contents
- Track changes as you follow along with the tutorial:
- Watch as new branches appear when created
- Observe commits as data changes are made
- See how tags mark versions
- Validate data promotions
The project uses the following Spark configuration (spark-defaults.conf
):
# Spark Configuration
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.101.3
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions
# Set Nessie as default catalog
spark.sql.defaultCatalog nessie
# Nessie Catalog Configuration
spark.sql.catalog.nessie org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.nessie.catalog-impl org.apache.iceberg.nessie.NessieCatalog
spark.sql.catalog.nessie.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.nessie.warehouse s3://warehouse/
spark.sql.catalog.nessie.uri http://nessie:19120/api/v2
spark.sql.catalog.nessie.ref main
spark.sql.catalog.nessie.s3.endpoint http://minio:9000
spark.sql.catalog.nessie.s3.path-style-access true
spark.sql.catalog.nessie.client-api-version 2
# AWS S3 credentials for Nessie catalog
spark.sql.catalog.nessie.s3.access-key-id admin
spark.sql.catalog.nessie.s3.secret-access-key password
The spark-defaults.conf
provides the necessary configurations to get started with these experiments. As you become more comfortable with the basics, you can extend the project to implement more complex data management patterns.
While the primary goal is to showcase Nessie's features, the project's integrated stack (Nessie, Iceberg, and Spark) provides a solid foundation for exploring:
- Build on the version control patterns shown here
- Experiment with feature-branch development for data
- Automated data validation in multi-stage transformation pipelines
- Try implementing data promotion across environments
- Different data sources and formats
- Leverage Iceberg's table format features
- Explore Spark processing capabilities
Common issues and solutions:
- Memory Issues: Increase Docker memory allocation
- Port Conflicts: Check for services using required ports
- Connection Problems: Verify network connectivity between containers
Contributions are welcome! Areas to consider:
- Additional notebook examples
- New data processing patterns
- Performance optimization techniques
- Documentation improvements
- Apache Iceberg Documentation
- Project Nessie Documentation
- Apache Spark Documentation
- MinIO Documentation
This project is licensed under the MIT License.