feat(data-warehouse): Use a dynamic chunk size #29419

Gilbert09 · 2025-03-03T14:45:14Z

Problem

Our chunk size was too big for some tables, this is because the table was very wide - it had some very big JSON fields, and so we were processing too much data at once, causing pods to OOM

Changes

For just the postgres source, dynamically calculate how big a chunk should be so that we have some guarantees that a chunk won't be bigger than n size
We take 100 rows from the source DB, get the average size of those rows in bytes, and figure out how many rows fit into a ceiling limit of 150MB
- 150MB was taken from seeing what syncs perform well without spiking memory so much
- The table that was causing OOM's was 500MB per chunk before
- We also cap this at the default chunk size too, so we take the min(default_chunk_size, 150_mb_row_count)
This should only affect wide tables, all other tables will stay at the current 20k limit

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Tested locally

greptile-apps

PR Summary

This PR implements dynamic chunk size calculation for PostgreSQL data imports to prevent out-of-memory errors when processing tables with wide rows containing large JSON fields.

Added DEFAULT_TABLE_SIZE_BYTES constant (150MB) in posthog/temporal/data_imports/pipelines/sql_database/settings.py as a ceiling for chunk memory usage
Implemented _get_table_chunk_size function in posthog/temporal/data_imports/pipelines/postgres/postgres.py that samples 100 rows to calculate average row size
Added logic to determine optimal chunk size by dividing the 150MB ceiling by the average row size
Implemented fallback to DEFAULT_CHUNK_SIZE (20,000) when calculation fails or for smaller tables
Added detailed logging to track row size calculations and chosen chunk sizes

_{2 file(s) reviewed, 1 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

posthog/temporal/data_imports/pipelines/postgres/postgres.py

Use a dynamic chunk size

5cedc47

Gilbert09 requested a review from a team March 3, 2025 14:45

greptile-apps bot reviewed Mar 3, 2025

View reviewed changes

posthog/temporal/data_imports/pipelines/postgres/postgres.py Outdated Show resolved Hide resolved

Gilbert09 added 2 commits March 3, 2025 15:48

Use correct SQL syntax

d2b7b47

Avoid division by 1 errors

f423ce7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-warehouse): Use a dynamic chunk size #29419

feat(data-warehouse): Use a dynamic chunk size #29419

Gilbert09 commented Mar 3, 2025

greptile-apps bot left a comment

feat(data-warehouse): Use a dynamic chunk size #29419

Are you sure you want to change the base?

feat(data-warehouse): Use a dynamic chunk size #29419

Conversation

Gilbert09 commented Mar 3, 2025

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary