Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Sampling on MSSQL Table #10142

Closed
1 task done
JonaGeishauser opened this issue Sep 16, 2024 · 1 comment · Fixed by #10173
Closed
1 task done

bug: Sampling on MSSQL Table #10142

JonaGeishauser opened this issue Sep 16, 2024 · 1 comment · Fixed by #10173
Assignees
Labels
bug Incorrect behavior inside of ibis mssql The Microsoft SQL Server backend
Milestone

Comments

@JonaGeishauser
Copy link

JonaGeishauser commented Sep 16, 2024

What happened?

Sampling on a MSSQL table seems to use the passed fraction to either return the entire table data or 0 rows.
I did some local testing against a local Postgres DB, an in memory duckdb instance and an AzureSQL DB.

I ran

test = [len(table.sample(0.9).to_polars()) for _ in range(99)]
print(test)
print(all(x in {0, 1000} for x in test))

Against a table in my AzureSQL DB and against a table in my local Postgres DB. Both tables containing exactly 1000 rows.

AzureSQL result:

>>> [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 0, 1000, 0, 1000, 0, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 0, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 0, 1000, 1000, 1000, 1000, 0, 1000, 1000, 1000, 1000, 0, 1000, 1000, 1000, 1000, 1000, 0, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 0, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 0, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]

>>> True

Postgres result:

>>> [893, 896, 901, 899, 895, 905, 895, 889, 897, 905, 902, 877, 905, 886, 914, 910, 904, 907, 919, 899, 898, 905, 897, 885, 890, 898, 885, 912, 885, 907, 909, 899, 900, 910, 896, 906, 919, 896, 895, 892, 896, 920, 899, 890, 892, 898, 914, 891, 900, 893, 901, 927, 892, 882, 896, 900, 896, 891, 910, 892, 919, 908, 899, 913, 910, 903, 904, 914, 917, 896, 868, 894, 900, 898, 906, 910, 897, 903, 891, 892, 903, 904, 908, 897, 910, 886, 887, 915, 907, 901, 894, 886, 912, 900, 922, 885, 894, 899, 882]

>>> False

The result makes it clear that for mssql backends the sampling either returns the entire data or no data at all while, for other backends (only verified postgres and duckdb) the sampling behaves as expected: using the fraction to return a sample of the data

What version of ibis are you using?

9.4.0

What backend(s) are you using, if any?

MSSQL

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@JonaGeishauser JonaGeishauser added the bug Incorrect behavior inside of ibis label Sep 16, 2024
@jcrist
Copy link
Member

jcrist commented Sep 19, 2024

Thanks for opening this, this is indeed a bug in our mssql implementation. Apparently in mssql, multiple calls to RAND in a single query will all result in the same value. See https://web.archive.org/web/20110829015850/http://blogs.lessthandot.com/index.php/DataMgmt/DataDesign/sql-server-set-based-random-numbers. In ibis, we expect ibis.random() to return different results for each call in a query (or each row), and the default implementation of sample() is based on random(). The proper fix here is to fixup how we implement random for mssql, which will also fix sample. Will take care of this.

@jcrist jcrist self-assigned this Sep 19, 2024
@jcrist jcrist added the mssql The Microsoft SQL Server backend label Sep 19, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Sep 19, 2024
@github-actions github-actions bot added this to the 10.0 milestone Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis mssql The Microsoft SQL Server backend
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants