Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: give more memory to the Delta Like workers #271

Merged
merged 1 commit into from
Aug 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cumulus_etl/formats/deltalake.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def initialize_class(cls, root: store.Root) -> None:
builder = (
pyspark.sql.SparkSession.builder.appName("cumulus-etl")
.config("spark.databricks.delta.schema.autoMerge.enabled", "true")
.config("spark.driver.memory", "2g")
.config("spark.driver.memory", "4g")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
)
Expand Down
16 changes: 11 additions & 5 deletions docs/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,20 @@ So the larger it is, the more memory will be used and the larger the output file

### Recommended Setup

Use 16GB of memory with `--batch-size=300000`.
We've never found it to run out of memory that way, and it should still have some leeway for spikes of memory use.
We've found `--batch-size=100000` works well for 16GB of memory.
And `--batch-size=500000` works well for 32GB of memory.
Comment on lines +35 to +36
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are what I've been using. 🤷

(You might have expected the number to simply double, but there is overhead.)

If you are using an AWS EC2 instance, we recommend an `m5.xlarge` instance, which has 16GB.
Mileage may vary though, depending on how your FHIR data is formatted.
And some resources are naturally smaller than others.
While the numbers above are a good rule of thumb,
experiment to find what works for your environment.

If you have access to more memory, experiment with larger batch sizes and let us know what works.
(Running out of memory will look like a sudden closing of the app.
Docker will just immediately shut the container down when it runs out.)

If you only have access to less memory, try `--batch-size=100000` with 8GB.
If you are using an AWS EC2 instance, we recommend an `m5.xlarge` instance for 16GB,
or `m5.2xlarge` for 32GB.

## Disk Consumption

Expand Down