Skip to content

[pandas] faster alternative to .to_sql() for large uploads

Gord Thompson edited this page Jun 8, 2022 · 5 revisions

If you are using .to_sql() to upload a large DataFrame and are unhappy with the performance then you may want to consider using a different approach. On a very old (and slow) machine, the following code takes about 110 seconds (almost 2 minutes) to execute:

from time import perf_counter

import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.engine import URL

accdb_path = r"C:\Users\Public\test\sqlalchemy-access\gord_test.accdb"
connection_string = (
    "DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
    f"DBQ={accdb_path};"
    "ExtendedAnsiSQL=1;"
)
connection_url = URL.create("access+pyodbc", query={"odbc_connect": connection_string})
engine = create_engine(connection_url)

num_cols = 127
num_rows = 1000
data = [tuple(list(range(num_cols))) for i in range(num_rows)]

col_names = [f"col{x:03d}" for x in range(num_cols)]

df = pd.DataFrame(data, columns=col_names)
t0 = perf_counter()

table_name = "bulk_upload_test"

df.to_sql(table_name, engine, index=False, if_exists="append")

print(f"{num_rows} rows inserted in {(perf_counter() - t0):0.1f} seconds")

After doing pip install openpyxl to enable .to_excel() and pip install pywin32 to enable COM automation we can bypass .to_sql() (and therefore SQLAlchemy) completely and the same result is achieved in just under 8 seconds:

import os
from time import perf_counter

import pandas as pd
import win32com.client

accdb_path = r"C:\Users\Public\test\sqlalchemy-access\gord_test.accdb"

num_cols = 127
num_rows = 1000
data = [tuple(list(range(num_cols))) for i in range(num_rows)]

col_names = [f"col{x:03d}" for x in range(num_cols)]

df = pd.DataFrame(data, columns=col_names)
t0 = perf_counter()

table_name = "bulk_upload_test"

xlsx_path = r"C:\Users\Gord\Desktop\bulk_upload_test.xlsx"
df.to_excel(xlsx_path, index=False)
con = win32com.client.Dispatch("ADODB.Connection")
con.Open(
    "Provider=Microsoft.ACE.OLEDB.12.0;"
    f"Data Source={accdb_path};"
)
sql = f"""\
INSERT INTO [{table_name}]
SELECT * FROM [Sheet1$] IN \"{xlsx_path}\"'Excel 12.0 Macro;HDR=Yes'
"""
con.Execute(sql)
os.remove(xlsx_path)

print(f"{num_rows} rows inserted in {(perf_counter() - t0):0.1f} seconds")

(Workaround adapted from this Stack Overflow answer.)