Skip to content

VARCHAR column converted to None when length of string > 64 characters #184

@psnelgrove-r7

Description

@psnelgrove-r7

What happens?

I have an application where I query a mysql table (via the standard ATTACH method for the mysql extension) and build a local reverse index for fast local lookups. It was impossible for two of my columns to be NULL so I updated my duckdb table schema to mirror the mysql schema.

My tests blew up because I got a None value in my arn column on the first and only the first row. Playing around with the WHERE clause and I was able to get the primary key resource_id to come back as null. Regardless of SELECT ordering or WHERE conditions, lowering the resource_id string length < 64 characters and the data populates just fine.

Here are some other observations:

  • I get this result using any of the python interfaces for reading the query result like fetchone, fetchmany, or arrow.
  • The bug does NOT reproduce when using the DuckDB CLI!
    • When using CLI with 65+ char resource_id: ALL VALUES ARE RETURNED CORRECTLY
    • This suggests the bug is specific to the Python extension
    • The issue is likely in how the Python bindings handle result sets from MySQL
    • See kms_debugging.sh for CLI reproduction test
  • mysql column varchar length and character set don't appear to have an effect.
  • I messed around with table values a bunch so it doesn't feel like an issue with the table and column catalog.

To Reproduce

My example uses pytest because it was easier to create a fast feedback loop with something like watchexec as I was messing with different values.

uv venv
uv pip install pytest duckdb
watchexec -c -- pytest -vvv kms_debugging.py

The test is setup with a fixture to setup and teardown the db. It expects that you have a mysql instance running locally with a database named test. I'm running an older version of mysql 8.0.42. I'll try testing against a newer version.

Each test is parameterized to run twice. First with a string 64 characters long and the second 65 characters long which will cause the failure. I kept it this way so it was easy to add or modify those inputs..

I want to draw attention to the third test which demonstrates adding arn IS NOT NULL causes the null column to change.

from __future__ import annotations

import itertools
from dataclasses import dataclass

import duckdb
import pytest


@dataclass
class ExampleRow:
    # [0, 64] will pass
    # [65, 255] will change the value to None
    # [256, ~] will exceed the column size
    resource_id: str
    region_name: str
    arn: str

    @classmethod
    def from_size(cls, index: int, resource_id_length: int) -> ExampleRow:
        return cls(
            "".join(itertools.repeat(str(index), resource_id_length)),
            "us-east-1",
            "arn:aws:kms:us-east-1:123456789012:key/00000000-0000-0000-0000-000000000000",
        )

    def as_tuple(self) -> tuple[str, str, str]:
        return (self.resource_id, self.region_name, self.arn)


@pytest.fixture()
def conn():
    # Config
    host = "localhost"
    port = "3306"
    user = "root"
    database = "test"
    # Setup
    conn = duckdb.connect("test.db")
    conn.sql(f"ATTACH 'host={host} user={user} port={port} database={database}' AS mysql (TYPE mysql)")
    conn.sql("SET mysql_experimental_filter_pushdown = true")
    conn.sql("SET mysql_debug_show_queries = true")
    conn.sql(""" CALL mysql_execute( "mysql", "DROP TABLE IF EXISTS example" ) """)
    conn.sql("""
        CALL mysql_execute(
            "mysql",
            "CREATE TABLE example (
                resource_id varchar(255) NOT NULL,
                region_name varchar(64) NOT NULL,
                arn varchar(1024) DEFAULT NULL
            )"
      )
    """)
    yield conn
    conn.close()


@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_does_param_binding_matter(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
    """
    Does using prepared parameter binding matter? It appears not I get the same incorrect result in both cases.
    """
    example_row = ExampleRow.from_size(0, resource_id_size)
    conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
    bug = (
        example_row.resource_id,
        # `arn` will be None if len(resource_id) > 64
        example_row.arn,
    )
    # Test
    with_binding = "SELECT resource_id, arn FROM mysql.example WHERE region_name = ?"
    assert conn.execute(with_binding, [example_row.region_name]).fetchone() == bug
    without_binding = f"SELECT resource_id, arn FROM mysql.example WHERE region_name = '{example_row.region_name}'"
    assert conn.sql(without_binding).fetchone() == bug


@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_does_it_work_with_mysql_query(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
    """
    If I bypass duckdb and go directly to mysql using `mysql_query` it does give me the correct value
    """
    example_row = ExampleRow.from_size(0, resource_id_size)
    conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
    query = f"""CALL mysql_query("mysql", "SELECT resource_id, arn FROM example WHERE region_name = '{example_row.region_name}'") """
    assert conn.sql(query).fetchone() == (example_row.resource_id, example_row.arn)


@pytest.mark.parametrize("resource_id_size", [64, 65])
def test_select_order(conn: duckdb.DuckDBPyConnection, resource_id_size: int):
    """
    Changing the select order, still gives you null resource_id.
    """
    example_row = ExampleRow.from_size(0, resource_id_size)
    conn.execute("INSERT INTO mysql.example VALUES (?, ?, ?)", example_row.as_tuple())
    # SELECT `region_name`, `arn`, `resource_id` FROM `test`.`example` WHERE `region_name` = 'us-east-1';
    result = conn.execute(
        "SELECT arn, resource_id FROM mysql.example WHERE region_name = ?",
        [example_row.region_name],
    ).fetchone()
    assert result == (example_row.arn, example_row.resource_id)
    # No filtering but resource_id is still null
    # SELECT `resource_id`, `arn` FROM `test`.`example`;
    cur = conn.sql("SELECT resource_id, arn FROM mysql.example")
    assert cur.fetchone() == (
        example_row.resource_id,
        example_row.arn,
    )

    # when you include the `arn` in the where clause the null field becomes the resource id and the arn is now non-null
    cur = conn.execute(
        "SELECT resource_id, arn FROM mysql.example WHERE region_name = ? AND arn IS NOT NULL",
        [example_row.region_name],
    )
    result = cur.fetchone()
    assert result == (
        example_row.resource_id,
        example_row.arn,
    )

CLI Script

I've also included a CLI repro example to show this doesn't happen when outside of python.
kms_debugging.sh

OS:

MacOS aarch64

DuckDB Package Version:

1.4.1

Python Version:

3.10.15

Full Name:

Peter Snelgrove

Affiliation:

Rapid7

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions