Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-warehouse): Use a dynamic chunk size #29419

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Gilbert09
Copy link
Member

Problem

  • Our chunk size was too big for some tables, this is because the table was very wide - it had some very big JSON fields, and so we were processing too much data at once, causing pods to OOM

Changes

  • For just the postgres source, dynamically calculate how big a chunk should be so that we have some guarantees that a chunk won't be bigger than n size
  • We take 100 rows from the source DB, get the average size of those rows in bytes, and figure out how many rows fit into a ceiling limit of 150MB
    • 150MB was taken from seeing what syncs perform well without spiking memory so much
    • The table that was causing OOM's was 500MB per chunk before
    • We also cap this at the default chunk size too, so we take the min(default_chunk_size, 150_mb_row_count)
  • This should only affect wide tables, all other tables will stay at the current 20k limit

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Tested locally

@Gilbert09 Gilbert09 requested a review from a team March 3, 2025 14:45
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR implements dynamic chunk size calculation for PostgreSQL data imports to prevent out-of-memory errors when processing tables with wide rows containing large JSON fields.

  • Added DEFAULT_TABLE_SIZE_BYTES constant (150MB) in posthog/temporal/data_imports/pipelines/sql_database/settings.py as a ceiling for chunk memory usage
  • Implemented _get_table_chunk_size function in posthog/temporal/data_imports/pipelines/postgres/postgres.py that samples 100 rows to calculate average row size
  • Added logic to determine optimal chunk size by dividing the 150MB ceiling by the average row size
  • Implemented fallback to DEFAULT_CHUNK_SIZE (20,000) when calculation fails or for smaller tables
  • Added detailed logging to track row size calculations and chosen chunk sizes

2 file(s) reviewed, 1 comment(s)
Edit PR Review Bot Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant