Enable Multithreading on `msgpack` Chunking in `BulkImportWriter` #142

DavidLandup0 · 2024-12-05T08:33:51Z

Similar to how multithreading is enabled on the msgpack temp file uploads - this PR enables multithreading in the chunking process itself, using the same number of max workers as the uploads.

Benchmark Script

import pytd
import os
import numpy as np
import pandas as pd

API_KEY = "xxxxxxxxxxxxxxxx"
ENDPOINT = "xxxxxxxxxxxxx"

os.environ["TD_PRESTO_API"] = "xxxxxxxxxxxx"
os.environ["TD_API_KEY"] = API_KEY
os.environ["TD_API_SERVER"] = ENDPOINT

num_users = 100_000_000

# create a dataframe with 500M entries and a few columns
users = np.random.randint(1, num_users, num_users)
recency = np.random.randint(0, 365, num_users)
frequency = np.random.randint(1, 10, num_users)
monetary_value = np.random.randint(1, 1000, num_users)
df = pd.DataFrame({"user": users, "recency": recency, "frequency": frequency, "monetary_value": monetary_value})

client = pytd.Client(database="some_db", retry_post_requests=True, endpoint=ENDPOINT)
table = client.api_client.table("some_db", "some_table")
client.create_database_if_not_exists("some_db")

client.load_table_from_dataframe(
    df,
    "some_db.some_table",
    writer="bulk_import",
    if_exists="overwrite",
    fmt="msgpack",
    keep_list=True,
    max_workers=64
)

Script Results

Progress Bar - 100M rows

The script works nicely with the show_progress flag from #141 since it's easy to observe the individual steps.

On main + feature/show_progress_bulk_import:

python script.py

...
Chunking into msgpack: 100%|██████████| 200/200 [11:38<00:00,  3.49s/it]
...

On feature/multi_threaded_chunking + feature/show_progress_bulk_import:

python script.py

...
Chunking into msgpack: 100%|██████████| 200/200 [00:59<00:00,  3.38it/s]
...

No Progress Bar - 1M rows

Without the progress bar, due to the variable bulk import times in the performing a bulk import job step, it's not as easy to observe the direct effect of the PR. One can only measure the end-to-end time holistically (confounding variable, i.e. the import time, is not accounted for).

On main:

time python script.py
# add...

On feature/multi_threaded_chunking:

time python script.py
# add...

On my local environment:

Decreased chunking time from ~55min to ~5min for a 500M row dataframe
Similarly, decreased the chunking time from ~10min to ~1min for a 100M row dataframe
No effect on small datasets (such as 10k, since they're one chunk by default)

P.S. Resulting table in the database seems to have no side effects. Double checking tomorrow morning with fresh eyes if everything is as it should be in the results.

chezou · 2024-12-05T15:42:31Z

@DavidLandup0 Thanks for the additional optimization! It's amazing

tung-vu-td

LGTM

Enable multithreading on msgpack chunking

2988a78

DavidLandup0 requested a review from chezou December 5, 2024 08:34

DavidLandup0 added 2 commits December 5, 2024 17:36

remove redundant parameter from different PR

b0d53ec

Slight refactor

3989b20

chezou approved these changes Dec 5, 2024

View reviewed changes

tung-vu-td approved these changes Dec 6, 2024

View reviewed changes

DavidLandup0 merged commit 7357c4e into master Dec 6, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Multithreading on `msgpack` Chunking in `BulkImportWriter` #142

Enable Multithreading on `msgpack` Chunking in `BulkImportWriter` #142

DavidLandup0 commented Dec 5, 2024 •

edited

Loading

chezou commented Dec 5, 2024

tung-vu-td left a comment

Enable Multithreading on msgpack Chunking in BulkImportWriter #142

Enable Multithreading on msgpack Chunking in BulkImportWriter #142

Conversation

DavidLandup0 commented Dec 5, 2024 • edited Loading

Benchmark Script

Script Results

Progress Bar - 100M rows

No Progress Bar - 1M rows

chezou commented Dec 5, 2024

tung-vu-td left a comment

Choose a reason for hiding this comment

Enable Multithreading on `msgpack` Chunking in `BulkImportWriter` #142

Enable Multithreading on `msgpack` Chunking in `BulkImportWriter` #142

DavidLandup0 commented Dec 5, 2024 •

edited

Loading