-
Notifications
You must be signed in to change notification settings - Fork 9
[pandas] faster alternative to .to_sql() for large uploads
Gord Thompson edited this page Jun 8, 2022
·
5 revisions
If you are using .to_sql()
to upload a large DataFrame and are unhappy with the performance then you may want to consider using a different approach. On a very old (and slow) machine, the following code takes about 110 seconds (almost 2 minutes) to execute:
from time import perf_counter
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.engine import URL
accdb_path = r"C:\Users\Public\test\sqlalchemy-access\gord_test.accdb"
connection_string = (
"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
f"DBQ={accdb_path};"
"ExtendedAnsiSQL=1;"
)
connection_url = URL.create("access+pyodbc", query={"odbc_connect": connection_string})
engine = create_engine(connection_url)
num_cols = 127
num_rows = 1000
data = [tuple(list(range(num_cols))) for i in range(num_rows)]
col_names = [f"col{x:03d}" for x in range(num_cols)]
df = pd.DataFrame(data, columns=col_names)
t0 = perf_counter()
table_name = "bulk_upload_test"
df.to_sql(table_name, engine, index=False, if_exists="append")
print(f"{num_rows} rows inserted in {(perf_counter() - t0):0.1f} seconds")
After doing pip install openpyxl
to enable .to_excel()
and pip install pywin32
to enable COM automation we can bypass .to_sql()
(and therefore SQLAlchemy) completely and the same result is achieved in just under 8 seconds:
import os
from time import perf_counter
import pandas as pd
import win32com.client
accdb_path = r"C:\Users\Public\test\sqlalchemy-access\gord_test.accdb"
num_cols = 127
num_rows = 1000
data = [tuple(list(range(num_cols))) for i in range(num_rows)]
col_names = [f"col{x:03d}" for x in range(num_cols)]
df = pd.DataFrame(data, columns=col_names)
t0 = perf_counter()
table_name = "bulk_upload_test"
xlsx_path = r"C:\Users\Gord\Desktop\bulk_upload_test.xlsx"
df.to_excel(xlsx_path, index=False)
con = win32com.client.Dispatch("ADODB.Connection")
con.Open(
"Provider=Microsoft.ACE.OLEDB.12.0;"
f"Data Source={accdb_path};"
)
sql = f"""\
INSERT INTO [{table_name}]
SELECT * FROM [Sheet1$] IN \"{xlsx_path}\"'Excel 12.0 Macro;HDR=Yes'
"""
con.Execute(sql)
os.remove(xlsx_path)
print(f"{num_rows} rows inserted in {(perf_counter() - t0):0.1f} seconds")
(Workaround adapted from this Stack Overflow answer.)