-
Notifications
You must be signed in to change notification settings - Fork 9
[pandas] faster alternative to .to_sql() for large uploads
Gord Thompson edited this page Mar 20, 2021
·
5 revisions
If you are using .to_sql()
to upload a large DataFrame and are unhappy with the performance then you may want to consider using a different approach. On a very old (and slow) machine, the following code takes just over 400 seconds (almost 7 minutes) to execute:
from time import perf_counter
import urllib
import pandas as pd
from sqlalchemy import create_engine
accdb_path = r"C:\Users\Public\test\sqlalchemy-access\gord_test.accdb"
connection_string = (
"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
f"DBQ={accdb_path};"
"ExtendedAnsiSQL=1;"
)
connection_uri = f"access+pyodbc:///?odbc_connect={urllib.parse.quote_plus(connection_string)}"
engine = create_engine(connection_uri)
num_cols = 127
num_rows = 1000
data = [tuple(list(range(num_cols))) for i in range(num_rows)]
col_names = [f"col{x:03d}" for x in range(num_cols)]
df = pd.DataFrame(data, columns=col_names)
t0 = perf_counter()
table_name = "bulk_upload_test"
df.to_sql(table_name, engine, index=False, if_exists="append")
print(f"{num_rows} rows inserted in {(perf_counter() - t0):0.1f} seconds")
After doing pip install openpyxl
to enable .to_excel()
and pip install pywin32
to enable COM automation we can bypass .to_sql()
(and therefore SQLAlchemy) completely and the same result is achieved in just under 30 seconds:
import os
from time import perf_counter
import pandas as pd
import win32com.client
accdb_path = r"C:\Users\Public\test\sqlalchemy-access\gord_test.accdb"
num_cols = 127
num_rows = 1000
data = [tuple(list(range(num_cols))) for i in range(num_rows)]
col_names = [f"col{x:03d}" for x in range(num_cols)]
df = pd.DataFrame(data, columns=col_names)
t0 = perf_counter()
table_name = "bulk_upload_test"
xlsx_path = r"C:\Users\Gord\Desktop\bulk_upload_test.xlsx"
df.to_excel(xlsx_path, index=False)
msa = win32com.client.Dispatch("Access.Application")
msa.OpenCurrentDatabase(accdb_path)
acImport = 0
acSpreadsheetTypeExcel12Xml = 10
msa.DoCmd.TransferSpreadsheet(
acImport, acSpreadsheetTypeExcel12Xml, table_name, xlsx_path, True
)
msa.Quit()
os.remove(xlsx_path)
print(f"{num_rows} rows inserted in {(perf_counter() - t0):0.1f} seconds")
The one disadvantage of this approach is that it won't work on machines that only have the Access Database Engine redistributable installed; it requires a full install of Access (although installing the Access runtime might suffice).