test: Support for Multi-Level Partition Tables #88

shamb0 · 2024-08-20T05:55:26Z

Closes #56

What

Implements a demonstration test for multi-level partition tables, addressing issue #56.

Why

This demo showcases the pg_analytics extension's capability to support multi-level partitioned tables. The implementation organizes data hierarchically, enabling efficient access to context-relevant information.

How

Generates a simulated Auto Sales dataset and creates a local Parquet data source.
Utilizes DataFusion to load sales data from the Parquet file into a DataFrame.
Partitions the data and uploads partitions to an S3 bucket.
Configures necessary tables and pg_analytics Foreign Data Wrapper (FDW) in PostgreSQL using S3 data.
Executes analytics test queries on the multi-level partitioned table.

Tests

To run the demonstration test:

RUST_LOG=info \
    cargo test \
    --test \
    test_mlp_auto_sales \
    -- \
    --nocapture

Test traces are available in the attached log file :: https://gist.githubusercontent.com/shamb0/2ed909ac9604c610af1d7fa0e87f9a82/raw/02a4203cdc1d675181d9f9700c578c81405becdb/wk2434-pg_analytics-mlp-demo.md

CLAassistant · 2024-08-20T05:55:31Z

All committers have signed the CLA.

rebasedming

Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.

shamb0 · 2024-08-22T10:10:45Z

Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.

Hi @rebasedming,

I’ve put together a detailed README for the PR, which you can find here.

I aimed to be thorough in capturing all the necessary details. If you find it too lengthy or in need of restructuring, please let me know, and I’ll make the necessary adjustments.

tests/datasets/auto_sales/mod.rs

rebasedming

I've put up #91 which brings all our text fixtures into this crate. If possible, I'd appreciate it if you could align your PR with this one once it's merged. For instance, you can now put your fixtures as a new file in the fixtures crate, and auto_sales can go under tables.

rebasedming · 2024-08-22T20:15:14Z

tests/common/print_utils.rs

+}
+
+// Implement Printable for tuples up to 12 elements
+impl_printable_for_tuple!(T1);


Let's not bundle this into this PR, it's a separate thing that we can discuss whether it's worth bringing into the repo.

Sure, done. I've removed the macros. Let me know if you want tests/common/print_utils.rs removed as well.

rebasedming · 2024-08-22T20:16:40Z

Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.

Hi @rebasedming,

I’ve put together a detailed README for the PR, which you can find here.

I aimed to be thorough in capturing all the necessary details. If you find it too lengthy or in need of restructuring, please let me know, and I’ll make the necessary adjustments.

Nice, thank you for the extremely thorough writeup.

What I was most interested in is your approach of putting partitioned heap tables in front of foreign tables to pass partition keys to the Parquet file string. I wasn't aware this was possible and was hoping you could elaborate on that.

shamb0 · 2024-08-24T03:15:36Z

Thanks for pointing out the missing critical piece. I've introduced the section Partitioned Table Structure and S3 Integration with the necessary details, and I've also created a TL;DR quick overview.

shamb0 · 2024-08-24T11:12:06Z

Hi @rebasedming,

I've made some changes to address the clippy warnings in this PR:

Created a new crate: pg_analytics_test_helpers
Moved fixtures and common modules from the tests folder to this new crate

These changes were necessary to get the CI build passing. I know this might be outside the scope of our current PR, so let me know if you'd prefer I revert these changes.

Thanks for your input on this.

philippemnoel · 2024-08-24T11:20:06Z

Hi @rebasedming,

I've made some changes to address the clippy warnings in this PR:

Created a new crate: pg_analytics_test_helpers

Moved fixtures and common modules from the tests folder to this new crate

These changes were necessary to get the CI build passing. I know this might be outside the scope of our current PR, so let me know if you'd prefer I revert these changes.

Thanks for your input on this.

Can you explain which Clippy warnings you needed a new crate for? I don't think we should complicate the project with an extra crate. We can simply ignore some clippy warnings in the test files if they're causing issues.

philippemnoel · 2024-08-24T11:29:54Z

I also took a look at your CI error, and I'm a bit confused by it. I suspect this has to do with the newly introduced crate. If you keep a single crate, it shouldn't find two different PG versions.

shamb0 · 2024-08-24T11:41:44Z

Hi @philippemnoel,

I've identified an issue with commit 9491b824e5dfffd3d636dcb7bb4b67a3fe9ee858:

cargo test runs successfully
cargo clippy --all-targets fails

The problem:

Error occurs in: use crate::common::{execute_query, fetch_results, print_utils};
Clippy searches for the common module in the pg_analytics crate instead of the tests folder

I tried several solutions (e.g., using super) to fix the module path. But no luck. As a temporary fix, I've:

Moved the module fixture and common out of the tests folder
Created a new crate: pg_analytics_test_helpers

philippemnoel · 2024-08-24T11:59:29Z

Hi @philippemnoel,

I've identified an issue with commit 9491b824e5dfffd3d636dcb7bb4b67a3fe9ee858:

cargo test runs successfully

cargo clippy --all-targets fails

The problem:

Error occurs in: use crate::common::{execute_query, fetch_results, print_utils};

Clippy searches for the common module in the pg_analytics crate instead of the tests folder

I tried several solutions (e.g., using super) to fix the module path. But no luck. As a temporary fix, I've:

Moved the module fixture and common out of the tests folder

Created a new crate: pg_analytics_test_helpers

This code is added by your PR right? Can we instead fix the import path and keep things in the same crate?

shamb0 · 2024-08-25T02:28:58Z

Hi @philippemnoel,

The PR is ready for the next level of review. I've removed the duplicate code, performed a cleanup, and ensured proper integration with PR #91. Additionally, fixtures/tables/auto_sales.rs now follows the pattern used in other test implementations across the workspace.

philippemnoel · 2024-08-25T02:49:57Z

Hi @shamb0. This looks super clean! Thank you for integrating everything properly, I'm very excited about this PR.

I believe it should also have documentation, so that the users can know that partitions are supported and know how to use them. Our documentation is stored in https://github.com/paradedb/paradedb/tree/dev/docs. Would you be willing to submit a PR with documentation to that repository as well? Then I think everything will be complete :). I'll let Ming do a more thorough review

shamb0 · 2024-08-25T11:06:11Z

Hi @philippemnoel,

I wanted to update you on PR#1568, which includes recent documentation changes.

Currently, I’ve placed the new topic under ingest/configuration/multi-level-partitioned-tables. Could you please review and suggest if this is the most appropriate location, or if you have any preferred path for this topic?

philippemnoel

Hi @shamb0! Thank you for doing this. We appreciate the docs PRs. We'll probably push a commit to it to make it more similar to our existing documentation, and then this should be good. I'll let @rebasedming do a proper review

philippemnoel · 2024-08-25T16:31:58Z

Sorry, we had to merge a few other PRs to get v0.1.1 out. There shouldn't be anything else that introduces a conflict, but could you please rebase it? We'll prioritize it :)

Signed-off-by: shamb0 <r.raajey@gmail.com>

shamb0 · 2024-08-26T02:48:19Z

Hi @philippemnoel,

The rebase is complete, and the PR should now be ready for intake review. Please let me know if you encounter any issues.

Thanks again!

philippemnoel · 2024-08-26T18:01:12Z

Hi @philippemnoel,

The rebase is complete, and the PR should now be ready for intake review. Please let me know if you encounter any issues.

Thanks again!

Thank you! Could you please take a look at the failing test?

rebasedming · 2024-08-26T20:57:18Z

Hi @shamb0 , I've spent some time testing this PR and I have some bad news.

While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

shamb0 · 2024-08-27T01:58:56Z

Hi @shamb0 , I've spent some time testing this PR and I have some bad news.

While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

Hi @rebasedming,

Thank you for the insightful feedback and for highlighting the performance issue with the strategy used in the PR.

Based on your comments, I understand that the current approach of using partitioned heap tables in front of foreign tables bypasses the executor hook, which normally pushes queries to DuckDB. This results in slower query execution through PostgreSQL's FDW API, as it ends up performing a sequential scan of the underlying Parquet files.

I will investigate this issue further and work on an improved strategy that maintains the performance benefits of DuckDB. I’ll get back to you soon with a better solution.

Thanks for your patience!

philippemnoel · 2024-08-27T23:30:28Z

Hi @shamb0 , I've spent some time testing this PR and I have some bad news.
While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

Hi @rebasedming,

Thank you for the insightful feedback and for highlighting the performance issue with the strategy used in the PR.

Based on your comments, I understand that the current approach of using partitioned heap tables in front of foreign tables bypasses the executor hook, which normally pushes queries to DuckDB. This results in slower query execution through PostgreSQL's FDW API, as it ends up performing a sequential scan of the underlying Parquet files.

I will investigate this issue further and work on an improved strategy that maintains the performance benefits of DuckDB. I’ll get back to you soon with a better solution.

Thanks for your patience!

We're excited to see the next iteration :)

rebasedming · 2024-08-28T16:07:11Z

Closing this now as per the above discussion

shamb0 marked this pull request as ready for review August 20, 2024 09:13

shamb0 requested review from rebasedming and neilyio as code owners August 20, 2024 09:13

shamb0 changed the title ~~Demo: Support for Multi-Level Partition Tables~~ test: Support for Multi-Level Partition Tables Aug 20, 2024

rebasedming reviewed Aug 21, 2024

View reviewed changes

rebasedming reviewed Aug 22, 2024

View reviewed changes

tests/datasets/auto_sales/mod.rs Outdated Show resolved Hide resolved

rebasedming reviewed Aug 22, 2024

View reviewed changes

tests/datasets/auto_sales/mod.rs Outdated Show resolved Hide resolved

rebasedming reviewed Aug 22, 2024

View reviewed changes

shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch from ffc9515 to 50cdc5b Compare August 24, 2024 04:35

shamb0 requested a review from philippemnoel as a code owner August 24, 2024 04:35

shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch 2 times, most recently from b8c27f9 to 9491b82 Compare August 24, 2024 05:15

shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch 2 times, most recently from e947e42 to 06f0b36 Compare August 25, 2024 02:21

philippemnoel reviewed Aug 25, 2024

View reviewed changes

Demo: Support for Multi-Level Partition Tables

cc644c2

Signed-off-by: shamb0 <r.raajey@gmail.com>

shamb0 added 2 commits August 26, 2024 07:37

Refactor: Address review comments

59018bb

Signed-off-by: shamb0 <r.raajey@gmail.com>

Refactor: Address review comments

3ea0fc0

Signed-off-by: shamb0 <r.raajey@gmail.com>

shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch from 06f0b36 to 3ea0fc0 Compare August 26, 2024 02:44

philippemnoel marked this pull request as draft August 27, 2024 23:31

philippemnoel mentioned this pull request Aug 27, 2024

docs: demo writeups update for multi-level partition tables in pg_analytics (PR#88 Pair) ... paradedb/paradedb#1568

Closed

rebasedming closed this Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Support for Multi-Level Partition Tables #88

test: Support for Multi-Level Partition Tables #88

shamb0 commented Aug 20, 2024 •

edited by philippemnoel

Loading

CLAassistant commented Aug 20, 2024 •

edited

Loading

rebasedming left a comment

shamb0 commented Aug 22, 2024

rebasedming left a comment

rebasedming Aug 22, 2024

shamb0 Aug 24, 2024

rebasedming commented Aug 22, 2024

shamb0 commented Aug 24, 2024 •

edited

Loading

shamb0 commented Aug 24, 2024

philippemnoel commented Aug 24, 2024

philippemnoel commented Aug 24, 2024

shamb0 commented Aug 24, 2024

philippemnoel commented Aug 24, 2024

shamb0 commented Aug 25, 2024

philippemnoel commented Aug 25, 2024

shamb0 commented Aug 25, 2024

philippemnoel left a comment

philippemnoel commented Aug 25, 2024

shamb0 commented Aug 26, 2024

philippemnoel commented Aug 26, 2024

rebasedming commented Aug 26, 2024

shamb0 commented Aug 27, 2024

philippemnoel commented Aug 27, 2024

rebasedming commented Aug 28, 2024

test: Support for Multi-Level Partition Tables #88

test: Support for Multi-Level Partition Tables #88

Conversation

shamb0 commented Aug 20, 2024 • edited by philippemnoel Loading

What

Why

How

Tests

CLAassistant commented Aug 20, 2024 • edited Loading

rebasedming left a comment

Choose a reason for hiding this comment

shamb0 commented Aug 22, 2024

rebasedming left a comment

Choose a reason for hiding this comment

rebasedming Aug 22, 2024

Choose a reason for hiding this comment

shamb0 Aug 24, 2024

Choose a reason for hiding this comment

rebasedming commented Aug 22, 2024

shamb0 commented Aug 24, 2024 • edited Loading

shamb0 commented Aug 24, 2024

philippemnoel commented Aug 24, 2024

philippemnoel commented Aug 24, 2024

shamb0 commented Aug 24, 2024

philippemnoel commented Aug 24, 2024

shamb0 commented Aug 25, 2024

philippemnoel commented Aug 25, 2024

shamb0 commented Aug 25, 2024

philippemnoel left a comment

Choose a reason for hiding this comment

philippemnoel commented Aug 25, 2024

shamb0 commented Aug 26, 2024

philippemnoel commented Aug 26, 2024

rebasedming commented Aug 26, 2024

shamb0 commented Aug 27, 2024

philippemnoel commented Aug 27, 2024

rebasedming commented Aug 28, 2024

shamb0 commented Aug 20, 2024 •

edited by philippemnoel

Loading

CLAassistant commented Aug 20, 2024 •

edited

Loading

shamb0 commented Aug 24, 2024 •

edited

Loading