Created by gh-md-toc
This project intends to collect, analyze and synthetize referential material about data-lakes, data warehouses and data lake-houses.
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
- DataBricks blog - What is a data lakehouse
- Authors: Ben Lorica, Michael Armbrust, Reynold Xin, Matei Zaharia and Ali Ghodsi
- Date: Jan. 2020
- Snowflake guides - What is a data lakehouse
- Google Cloud - What is a data lakehouse
- Arxiv - The Data Lakehouse: Data Warehousing and More - 2023 -
- Authors: Dipankar Mazumdar, Jason Hughes, JB Onofré (all working at Dremio at the time)
- Date: October 2023
- Link to the article: https://www.linkedin.com/posts/dipankar-mazumdar_dataengineering-softwareengineering-activity-7283666426437980160-A33n
- What is Apache XTable (formerly OneTable) — Interoperability for Apache Hudi, Iceberg & Delta Lake
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn, Dipankar Mazumdar on Medium)
- Date: Dec. 2023
- Understanding Parquet, Iceberg and Data Lakehouses at Broad
- Author: David Gomes (David Gomes on LinkedIn, David Gomes profile page on his own blog)
- Date: December 2023
- Understanding Big Data File Formats
- Author: Vladimir Sivcevic (Vladimir Sivcevic on LinkeDIn, Vladimir Sivcevic profile page on his own blog)
- Date: April 2022
- Material for the Data platform - Meta-data
- Material for the Data platform - Data contracts
- Material for the Data platform - Data quality
- Material for the Data platform - Modern Data Stack (MDS) in a box
- Architecture principles for data engineering pipelines on the Modern Data Stack (MDS)
- Specifications/principles for a data engineering pipeline deployment tool