Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Support for Multi-Level Partition Tables #115

Conversation

shamb0
Copy link

@shamb0 shamb0 commented Aug 29, 2024

Closes #56

What

Implements a demonstration test for multi-level partition tables, addressing issue #56.

Why

This demonstration showcases the pg_analytics extension's ability to support multi-level partitioned tables. By using Hive-style partitioning for organizing data hierarchically, it enables efficient access to context-specific information, enhancing query performance and scalability.

How

The implementation involves two key components:

  1. Hive-style Partitioning in S3

    • Organizes data in S3 bucket using a hierarchical structure:
      s3://{bucket}/year={year}/manufacturer={manufacturer}/data_{index}.parquet
      
    • Implementation in code:
      for (i, batch) in partitioned_batches.iter().enumerate() {
          let key = format!(
              "year={}/manufacturer={}/data_{}.parquet",
              year, manufacturer, i
          );
          s3.put_batch(s3_bucket, &key, batch).await?;
      }
  2. FOREIGN TABLE Configuration in pg_analytics

    • Configures a FOREIGN TABLE with options to access the Hive-style partitioned dataset:
      CREATE FOREIGN TABLE auto_sales (
          sale_id                 BIGINT,
          sale_date               DATE,
          manufacturer            TEXT,
          model                   TEXT,
          price                   DOUBLE PRECISION,
          dealership_id           INT,
          customer_id             INT,
          year                    INT,
          month                   INT
      )
      SERVER auto_sales_server
      OPTIONS (
          files 's3://{s3_bucket}/year=*/manufacturer=*/data_*.parquet',
          hive_partitioning '1'
      );

Tests

To run the demonstration test, use the following command:

RUST_LOG=info \
    cargo test \
    --test \
    test_mlp_auto_sales \
    -- \
    --nocapture

shamb0 and others added 3 commits August 26, 2024 07:29
Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0 shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch from a1ae9c3 to 4d664ed Compare August 29, 2024 17:38
@shamb0
Copy link
Author

shamb0 commented Aug 29, 2024

Hi @rebasedming,

I'd like to provide an update on the progress of the investigation:

  • I have refactored the execution hook to handle multi-level partition tables, enabling query pushdown to DuckDB.
  • I completed a sanity test using the existing auto sales dataset, and everything is working as expected.

Could you please provide feedback on the approach used for handling multi-level partition table queries in the execution hook? Additionally, I would appreciate any suggestions on how it can be further refined.

@rebasedming
Copy link
Contributor

Hi @rebasedming,

I'd like to provide an update on the progress of the investigation:

  • I have refactored the execution hook to handle multi-level partition tables, enabling query pushdown to DuckDB.
  • I completed a sanity test using the existing auto sales dataset, and everything is working as expected.

Could you please provide feedback on the approach used for handling multi-level partition table queries in the execution hook? Additionally, I would appreciate any suggestions on how it can be further refined.

Hi @shamb0 -

I admire your perseverance chasing down this issue! It's a tricky one.

I know this is just a draft, but I don't feel good about this implementation. Intercepting and rewriting the query feels pretty unsafe and introduces a lot of complexity to the code base. I'm open to being convinced but I think this is way too much technical overhead for just this one feature.

@shamb0
Copy link
Author

shamb0 commented Aug 30, 2024

Hi @rebasedming,

Thank you for your thorough review and candid feedback on the PR code patch. I truly appreciate the time you've taken and your openness to further discussion.

Regarding your concerns:

  1. Safety and Complexity:

    • I understand your reservations about query interception and SQL statement remapping (from root partition table to foreign table names).
    • This approach, while complex, aligns with the core design model of pg_analytics. It's not a new pattern in our system.
  2. Current State and Future Plans:

    • The current patch is in an initial prototype state. I acknowledge there's significant work ahead to refine and improve it.
    • I plan to explore additional possibilities to address the safety and complexity concerns you've raised.
  3. Moving Forward:

    • I'll continue working on this and will update you with any new approaches or improvements.
    • Your input is valuable. If you have any specific suggestions or ideas, I'd be grateful to hear them. They would be extremely helpful in building a more robust solution.

I'm looking forward to our continued collaboration on this.

@shamb0
Copy link
Author

shamb0 commented Aug 30, 2024

Hi @rebasedming,

  1. Firstly, I'd like to acknowledge that my earlier patches may have been overly complex. Thank you for your patience as I worked through this.

  2. I have some good news to share! Today, I discovered the hive_partitioning option in ParquetOption, which I had previously overlooked. After some experimentation, I found that this leads to a much simpler solution than I initially proposed.

  3. Key point: This simpler approach doesn't require any complex intercepting or SQL statement remapping at the executor hook.

  4. Next steps: I'll be cleaning up the code based on this new approach. I expect to have the updated patch ready by tomorrow morning.

  5. For your reference, I've attached a snapshot of the PostgreSQL server trace.

Thank you for your guidance throughout this process. I appreciate your support.

2024-08-30 22:01:28.251 IST [590249] STATEMENT:  drop database if exists "_sqlx_test_640";
2024-08-30 22:01:33.544 IST [590248] LOG:  statement: 
	            DROP TABLE IF EXISTS auto_sales CASCADE;
	        
2024-08-30 22:01:33.544 IST [590248] NOTICE:  table "auto_sales" does not exist, skipping
2024-08-30 22:01:33.544 IST [590248] LOG:  statement: 
	            DROP SERVER IF EXISTS auto_sales_server CASCADE;
	        
2024-08-30 22:01:33.544 IST [590248] NOTICE:  server "auto_sales_server" does not exist, skipping
2024-08-30 22:01:33.544 IST [590248] LOG:  statement: 
	            DROP FOREIGN DATA WRAPPER IF EXISTS parquet_wrapper CASCADE;
	        
2024-08-30 22:01:33.544 IST [590248] NOTICE:  foreign-data wrapper "parquet_wrapper" does not exist, skipping
2024-08-30 22:01:33.545 IST [590248] LOG:  statement: 
	            DROP USER MAPPING IF EXISTS FOR public SERVER auto_sales_server;
	        
2024-08-30 22:01:33.545 IST [590248] NOTICE:  server "auto_sales_server" does not exist, skipping
2024-08-30 22:01:33.545 IST [590248] LOG:  statement: CREATE FOREIGN DATA WRAPPER parquet_wrapper
	                HANDLER parquet_fdw_handler
	                VALIDATOR parquet_fdw_validator
2024-08-30 22:01:33.545 IST [590248] WARNING:  pga:: *** ParquetFdw::validator() X ***
2024-08-30 22:01:33.545 IST [590248] WARNING:  pga:: *** ParquetFdw::validator() Y ***
2024-08-30 22:01:33.551 IST [590248] LOG:  statement: CREATE SERVER auto_sales_server
	                FOREIGN DATA WRAPPER parquet_wrapper
2024-08-30 22:01:33.551 IST [590248] WARNING:  pga:: *** ParquetFdw::validator() X ***
2024-08-30 22:01:33.551 IST [590248] WARNING:  pga:: *** ParquetFdw::validator() Y ***
2024-08-30 22:01:33.553 IST [590248] LOG:  statement: CREATE USER MAPPING FOR public
	                SERVER auto_sales_server
	                OPTIONS (
	                    type 'S3',
	                    region 'us-east-1',
	                    endpoint 'localhost:33182',
	                    use_ssl 'false',
	                    url_style 'path'
	                )
2024-08-30 22:01:33.553 IST [590248] WARNING:  pga:: *** ParquetFdw::validator() X ***
2024-08-30 22:01:33.553 IST [590248] WARNING:  pga:: *** ParquetFdw::validator() Y ***
2024-08-30 22:01:33.556 IST [590248] LOG:  statement: 
	            CREATE FOREIGN TABLE auto_sales (
	                sale_id                 BIGINT,
	                sale_date               DATE,
	                manufacturer            TEXT,
	                model                   TEXT,
	                price                   DOUBLE PRECISION,
	                dealership_id           INT,
	                customer_id             INT,
	                year                    INT,
	                month                   INT
	            )
	            SERVER auto_sales_server
	            OPTIONS (
	                files 's3://demo-mlp-auto-sales/year=*/manufacturer=*/data_*.parquet',
	                hive_partitioning '1'
	            );

...

2024-08-30 22:01:33.606 IST [590248] WARNING:  pga:: *** fdw::register_duckdb_view() Y ***
2024-08-30 22:01:33.606 IST [590248] LOG:  execute sqlx_s_2: 
	            SELECT year, manufacturer, ROUND(SUM(price)::numeric, 4)::float8 as total_sales
	            FROM auto_sales
	            WHERE year BETWEEN 2020 AND 2024
	            GROUP BY year, manufacturer
	            ORDER BY year, total_sales DESC;
	        
2024-08-30 22:01:33.606 IST [590248] WARNING:  pga:: *** executor_run() X ***
2024-08-30 22:01:33.606 IST [590248] WARNING:  pga:: *** ExtensionHook::executor_run() X ***
2024-08-30 22:01:33.606 IST [590248] WARNING:  pga:: *** get_current_query() X ***
2024-08-30 22:01:33.606 IST [590248] WARNING:  pga:: *** get_current_query() "\n            SELECT year, manufacturer, ROUND(SUM(price)::numeric, 4)::float8 as total_sales\n            FROM auto_sales\n            WHERE year BETWEEN 2020 AND 2024\n            GROUP BY year, manufacturer\n            ORDER BY year, total_sales DESC" ***
2024-08-30 22:01:33.606 IST [590248] WARNING:  pga:: *** get_query_relations() X ***
2024-08-30 22:01:33.606 IST [590248] WARNING:  pga:: *** get_query_relations() 1 ***
2024-08-30 22:01:33.606 IST [590248] WARNING:  query_relations.is_empty() :: false
2024-08-30 22:01:33.606 IST [590248] WARNING:  pga:: *** duckdb::create_arrow() "\n            SELECT year, manufacturer, ROUND(SUM(price)::numeric, 4)::float8 as total_sales\n            FROM auto_sales\n            WHERE year BETWEEN 2020 AND 2024\n            GROUP BY year, manufacturer\n            ORDER BY year, total_sales DESC" ***
2024-08-30 22:01:34.091 IST [590248] WARNING:  pga:: *** duckdb::create_arrow() Y ***
2024-08-30 22:01:34.091 IST [590248] WARNING:  pga:: *** ExtensionHook::write_batches_to_slots() X 4 ***
2024-08-30 22:01:34.091 IST [590248] WARNING:  pga:: *** ExtensionHook::write_batches_to_slots() Y ***
2024-08-30 22:01:34.091 IST [590248] WARNING:  pga:: *** ExtensionHook::executor_run() Y ***
2024-08-30 22:01:34.091 IST [590248] WARNING:  pga:: *** executor_run() Y ***
2024-08-30 22:06:23.245 IST [61966] LOG:  checkpoint starting: time
2024-08-30 22:07:58.164 IST [61966] LOG:  checkpoint complete: wrote 948 buffers (5.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=94.872 s, sync=0.024 s, total=94.920 s; sync files=304, longest=0.006 s, average=0.001 s; distance=4300 kB, estimate=4306 kB; lsn=0/955A6278, redo lsn=0/955A6240

philippemnoel and others added 3 commits August 30, 2024 15:12
* fixed date functions support

* fixing lint & more testing

* fix & remove unecessary conversion based on PR comments

* date trunc test

* remove unnecessary code

---------

Co-authored-by: Evance Soumaoro <evanxg852000@gmail.com>
Co-authored-by: Ming Ying <ming.ying.nyc@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0 shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch from bdf1997 to 0b830ba Compare August 31, 2024 05:39
Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0 shamb0 marked this pull request as ready for review August 31, 2024 11:46
@shamb0 shamb0 changed the title Shamb0/demo multi level partition table dset auto sales test: Support for Multi-Level Partition Tables Aug 31, 2024
evanxg852000 and others added 2 commits September 1, 2024 16:34
* fixed json & jsonb cast support

* fixing test

* fixing test

* adding a better test

* refactoring tests

* bug fixes

* json tests passing

* remove debug

* Increase the runner size

Signed-off-by: Philippe Noël <21990816+philippemnoel@users.noreply.github.com>

---------

Signed-off-by: Philippe Noël <21990816+philippemnoel@users.noreply.github.com>
Co-authored-by: Vipul Vaibhaw <vaibhaw.vipul@gmail.com>
Co-authored-by: Ming Ying <ming.ying.nyc@gmail.com>
Co-authored-by: Philippe Noël <21990816+philippemnoel@users.noreply.github.com>
* Rm cargo clean

* Remove unnecessary test

Signed-off-by: Philippe Noël <21990816+philippemnoel@users.noreply.github.com>

---------

Signed-off-by: Philippe Noël <21990816+philippemnoel@users.noreply.github.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0
Copy link
Author

shamb0 commented Sep 6, 2024

Hi @philippemnoel,

I have resolved the code conflicts, and the branch should now be ready for intake review and potential merge. :)

@philippemnoel
Copy link
Collaborator

Hi @philippemnoel,

I have resolved the code conflicts, and the branch should now be ready for intake review and potential merge. :)

Thank you! I’m personally very excited for this iteration, it feels much better. We’re pretty busy with a few big features right now, but we’ll review ASAP and I’m hopeful we can get this merged next week. @rebasedming is the final authority on any merge, so whenever he has some time he’ll take a look!

Copy link
Collaborator

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @shamb0 👍 I left some minors for your consideration.

Cargo.toml Show resolved Hide resolved
tests/fixtures/mod.rs Show resolved Hide resolved
@shamb0
Copy link
Author

shamb0 commented Sep 23, 2024

@Weijun-H, thanks for your feedback! I’ve addressed all your comments. Let me know if there’s anything else to improve.

Signed-off-by: shamb0 <r.raajey@gmail.com>
Copy link
Contributor

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@@ -0,0 +1,98 @@
// Copyright (c) 2023-2024 Retake, Inc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name test theses a bit better? Perhaps something like test_hive_partitioning or something like that?

@@ -0,0 +1,614 @@
// Copyright (c) 2023-2024 Retake, Inc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is huge. Do we really need an entire new fixture or can we reuse our existing ones with nyc_trips? If not, is there any way to shorten this file?

Overall I'm quite happy with this PR, but I just find it to introduce a lot of new testing utilities. Some of them are also built in this file, but don't seem to be actual tables. These utilities should probably be moved to a separate file in tests/fixtures/. Perhaps something like parquet.rs for the batch writing, or s3.rs or something like that.

@Weijun-H Would love your help thoroughly reviewing this PR. Ming is busy with search-related work.

Once we clean up this file, I am happy to approve and merge this. Thank you for your patience @shamb0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tl;dr:

  • If we really need a full separate table, let's keep only that in this file and move the utilities to a properly scoped. file
  • If we don't need a full separate table, let's use the NYC trips test fixture

Thank you!

Copy link
Collaborator

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested the above change^ Once that's done/answered, I am willing to approve this so long as @Weijun-H is also onboard.

@shamb0
Copy link
Author

shamb0 commented Oct 10, 2024

Hi @philippemnoel,

Thank you for the review comments, I really appreciate it. Moving forward, I'll aim for smaller PRs with minimal changes to make them easier to review.

@shamb0 shamb0 closed this Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi-level partition tables
5 participants