-
Notifications
You must be signed in to change notification settings - Fork 41
Description
What happens?
I have an application where I query a mysql table (via the standard ATTACH method for the mysql extension) and build a local reverse index for fast local lookups. It was impossible for two of my columns to be NULL so I updated my duckdb table schema to mirror the mysql schema.
My tests blew up because I got a None value in my arn column on the first and only the first row. Playing around with the WHERE clause and I was able to get the primary key resource_id to come back as null. Regardless of SELECT ordering or WHERE conditions, lowering the resource_id string length < 64 characters and the data populates just fine.
Here are some other observations:
- I get this result using any of the python interfaces for reading the query result like
fetchone,fetchmany, orarrow. - The bug does NOT reproduce when using the DuckDB CLI!
- When using CLI with 65+ char resource_id: ALL VALUES ARE RETURNED CORRECTLY
- This suggests the bug is specific to the Python extension
- The issue is likely in how the Python bindings handle result sets from MySQL
- See kms_debugging.sh for CLI reproduction test
- mysql column varchar length and character set don't appear to have an effect.
- I messed around with table values a bunch so it doesn't feel like an issue with the table and column catalog.
To Reproduce
My example uses pytest because it was easier to create a fast feedback loop with something like watchexec as I was messing with different values.
uv venv
uv pip install pytest duckdb
watchexec -c -- pytest -vvv kms_debugging.pyThe test is setup with a fixture to setup and teardown the db. It expects that you have a mysql instance running locally with a database named test. I'm running an older version of mysql 8.0.42. I'll try testing against a newer version.
Each test is parameterized to run twice. First with a string 64 characters long and the second 65 characters long which will cause the failure. I kept it this way so it was easy to add or modify those inputs..
I want to draw attention to the third test which demonstrates adding arn IS NOT NULL causes the null column to change.
from __future__ import annotations
import itertools
from dataclasses import dataclass
import duckdb
import pytest
@dataclass
class ExampleRow:
# [0, 64] will pass
# [65, 255] will change the value to None
# [256, ~] will exceed the column size
resource_id: str
region_name: str
arn: str
@classmethod
def from_size(cls, index: int, resource_id_length: int) -> ExampleRow:
return cls(
"".join(itertools.repeat(str(index), resource_id_length)),
"us-east-1",
"arn:aws:kms:us-east-1:123456789012:key/00000000-0000-0000-0000-000000000000",
)
def as_tuple(self) -> tuple[str, str, str]:
return (self.resource_id, self.region_name, self.arn)
@pytest.fixture()
def conn():
# Config
host = "localhost"
port = "3306"
user = "root"
database = "test"
# Setup
conn = duckdb.connect("test.db")
conn.sql(f"ATTACH 'host={host} user={user} port={port} database={database}' AS mysql (TYPE mysql)")
conn.sql("SET mysql_experimental_filter_pushdown = true")
conn.sql("SET mysql_debug_show_queries = true")
conn.sql(""" CALL mysql_execute( "mysql", "DROP TABLE IF EXISTS example" ) """)
conn.sql("""
CALL mysql_execute(
"mysql",
"CREATE TABLE example (
resource_id varchar(255) NOT NULL,
region_name varchar(64) NOT NULL,
arn varchar(1024) DEFAULT NULL
)"
)
""")
yield conn
conn.close()
@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_does_param_binding_matter(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
"""
Does using prepared parameter binding matter? It appears not I get the same incorrect result in both cases.
"""
example_row = ExampleRow.from_size(0, resource_id_size)
conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
bug = (
example_row.resource_id,
# `arn` will be None if len(resource_id) > 64
example_row.arn,
)
# Test
with_binding = "SELECT resource_id, arn FROM mysql.example WHERE region_name = ?"
assert conn.execute(with_binding, [example_row.region_name]).fetchone() == bug
without_binding = f"SELECT resource_id, arn FROM mysql.example WHERE region_name = '{example_row.region_name}'"
assert conn.sql(without_binding).fetchone() == bug
@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_does_it_work_with_mysql_query(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
"""
If I bypass duckdb and go directly to mysql using `mysql_query` it does give me the correct value
"""
example_row = ExampleRow.from_size(0, resource_id_size)
conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
query = f"""CALL mysql_query("mysql", "SELECT resource_id, arn FROM example WHERE region_name = '{example_row.region_name}'") """
assert conn.sql(query).fetchone() == (example_row.resource_id, example_row.arn)
@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_select_order(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
"""
Changing the select order, still gives you null resource_id.
"""
example_row = ExampleRow.from_size(0, resource_id_size)
conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
# SELECT `region_name`, `arn`, `resource_id` FROM `test`.`example` WHERE `region_name` = 'us-east-1';
result = conn.execute(
"SELECT arn, resource_id FROM mysql.example WHERE region_name = ?",
[example_row.region_name],
).fetchone()
assert result == (example_row.arn, example_row.resource_id)
# No filtering but resource_id is still null
# SELECT `resource_id`, `arn` FROM `test`.`example`;
cur = conn.sql("SELECT resource_id, arn FROM mysql.example")
assert cur.fetchone() == (
example_row.resource_id,
example_row.arn,
)
# when you include the `arn` in the where clause the null field becomes the resource id and the arn is now non-null
cur = conn.execute(
"SELECT resource_id, arn FROM mysql.example WHERE region_name = ? AND arn IS NOT NULL",
[example_row.region_name],
)
result = cur.fetchone()
assert result == (
example_row.resource_id,
example_row.arn,
)CLI Script
I've also included a CLI repro example to show this doesn't happen when outside of python.
kms_debugging.sh
OS:
MacOS aarch64
DuckDB Package Version:
1.4.1
Python Version:
3.10.15
Full Name:
Peter Snelgrove
Affiliation:
Rapid7
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration to reproduce the issue?
- Yes, I have