SNOW-502111: fetch_pandas_all returns ValueError: Found non-unique column index #970

mbkupfer · 2021-11-15T21:40:27Z

Please answer these questions before submitting your issue. Thanks!

What version of Python are you using?

Python 3.8.10

What operating system and processor architecture are you using?

Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29

What are the component versions in the environment (pip freeze)?

anyio==3.3.4
argon2-cffi==21.1.0
asn1crypto==1.4.0
astroid==2.8.0
attrs==19.3.0
Automat==0.8.0
azure-common==1.1.27
azure-core==1.15.0
azure-storage-blob==12.8.1
Babel==2.9.1
backcall==0.2.0
black==21.9b0
bleach==4.0.0
blinker==1.4
boto3==1.17.98
botocore==1.20.98
certifi==2021.5.30
cffi==1.14.5
chardet==4.0.0
charset-normalizer==2.0.7
click==8.0.1
cloud-init==21.3
cloudpickle==2.0.0
colorama==0.4.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==3.3.2
cycler==0.10.0
dask==2021.10.0
dbus-python==1.2.16
debugpy==1.5.1
decorator==5.0.9
defusedxml==0.7.1
distributed==2021.10.0
distro==1.4.0
distro-info===0.23ubuntu1
entrypoints==0.3
et-xmlfile==1.1.0
fsspec==2021.10.1
greenlet==1.1.1
HeapDict==1.0.1
httplib2==0.14.0
hyperlink==19.0.0
idna==2.10
importlib-metadata==1.5.0
incremental==16.10.1
ipykernel==6.4.2
ipython==7.26.0
ipython-genutils==0.2.0
isodate==0.6.0
isort==5.9.3
jedi==0.18.0
Jinja2==2.10.1
jmespath==0.10.0
joblib==1.1.0
json5==0.9.6
jsonpatch==1.22
jsonpointer==2.0
jsonschema==3.2.0
jupyter-client==7.0.0
jupyter-core==4.7.1
jupyter-server==1.11.1
jupyterlab==3.2.1
jupyterlab-pygments==0.1.2
jupyterlab-server==2.8.2
keyring==18.0.1
kiwisolver==1.3.2
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
lazy-object-proxy==1.6.0
locket==0.2.1
MarkupSafe==1.1.0
matplotlib==3.4.3
matplotlib-inline==0.1.2
mccabe==0.6.1
mistune==0.8.4
more-itertools==4.2.0
msgpack==1.0.2
msrest==0.6.21
mypy-extensions==0.4.3
nbclassic==0.3.3
nbclient==0.5.4
nbconvert==6.1.0
nbformat==5.1.3
nest-asyncio==1.5.1
netifaces==0.10.4
notebook==6.4.5
numpy==1.21.1
oauthlib==3.1.1
openpyxl==3.0.7
oscrypto==1.2.1
packaging==21.0
pandas==1.3.1
pandocfilters==1.4.3
parso==0.8.2
partd==1.2.0
pathspec==0.9.0
pexpect==4.6.0
pickleshare==0.7.5
Pillow==8.4.0
platformdirs==2.3.0
prometheus-client==0.11.0
prompt-toolkit==3.0.19
psutil==5.8.0
ptyprocess==0.7.0
pyarrow==5.0.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycparser==2.20
pycryptodomex==3.10.1
Pygments==2.9.0
PyGObject==3.36.0
PyHamcrest==1.9.0
PyJWT==2.1.0
pylint==2.11.1
pylint-venv==2.1.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==21.0.0
pyparsing==2.4.7
pyrsistent==0.15.5
pyserial==3.4
python-apt==2.0.0+ubuntu0.20.4.6
python-dateutil==2.8.1
python-debian===0.1.36ubuntu1
pytz==2021.1
PyYAML==5.3.1
pyzmq==22.2.1
regex==2021.8.28
requests==2.25.1
requests-oauthlib==1.3.0
requests-unixsocket==0.2.0
s3transfer==0.4.2
scikit-learn==1.0.1
scipy==1.7.1
seaborn==0.11.2
SecretStorage==2.3.1
Send2Trash==1.8.0
service-identity==18.1.0
simplejson==3.16.0
six==1.16.0
sniffio==1.2.0
snowflake-connector-python==2.7.0
snowflake-sqlalchemy==1.3.2
sortedcontainers==2.4.0
sos==4.1
SQLAlchemy==1.4.22
ssh-import-id==5.10
systemd-python==234
tblib==1.7.0
terminado==0.12.1
testpath==0.5.0
threadpoolctl==3.0.0
toml==0.10.2
tomli==1.2.1
toolz==0.11.1
tornado==6.1
traitlets==5.0.5
Twisted==18.9.0
typing-extensions==3.10.0.2
ubuntu-advantage-tools==27.2
ufw==0.36
unattended-upgrades==0.1
urllib3==1.26.5
wadllib==1.3.3
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.1
wrapt==1.12.1
zict==2.0.0
zipp==1.0.0
zope.interface==4.7.1

What did you do?

So sorry, but I'm working with proprietary data and really have no idea how to recreate using dummy data.

What did you expect to see?

Should of gotten a dataframe

Can you set logging to DEBUG and collect the logs?

I tried this and it looked like a lot of identifiable information was in there. Happy to DM, but I don't want to make those logs public.

The text was updated successfully, but these errors were encountered:

mbkupfer · 2021-11-16T19:18:15Z

Ok, I can actually recreate the problem now that I understand it, see below.

sql = "select 'a', 'a'"
cur.execute(sql)
df = cur.fetch_pandas_all()

Is there a way we can avoid an error being thrown as this is in fact valid sql. For instance, can we just delete duplicated columns and issue a warning that some columns were redundant and couldn't be handled by pyarrow?

sfc-gh-mkeller · 2021-11-23T22:51:37Z

While the SQL is valid, it looks like pyarrow.Table.to_pandas doesn't support duplicate column names on purpose.
You could name the columns different names in your SQL.

I'd be against issuing warnings and de-duping columns, as this could lead to some users losing information (unknowingly, if they have warnings disabled).

mbkupfer · 2021-11-24T17:03:39Z

The user wouldn't loose any information in that case; it would just throw a runtime error and fail.

What about enhancing the error so that there is more context? i.e catching ValueError: Found non-unique column index and then providing the offending columns. Otherwise, it's hard to tell that something innocuous as a duplicate column name is the reason for the pyarrow stack trace.

onerishabh · 2021-12-22T04:21:44Z

Hi,

Is there anyone working on this issue? I would like to further discuss on this and contribute too.

shacker · 2022-07-21T21:47:18Z

This issue comes up when you select * and are doing a join on a common column name - very straightforward use case. Seems like the only way around it is to specifically request every single column you want back in the select, which can be a pita.

ibuda · 2022-08-25T15:11:08Z

@shacker I had the same error with select * from , and I am confirming that indicating the column list in the select statement solves the problem.

Gonzalo933 · 2023-02-27T15:59:44Z

Any plans to solve this issue in the future? It's really annoying to deal with it when working with tables that have a lot of columns

sfc-gh-aling · 2023-04-05T22:28:06Z

hey folks, thanks for your feedbacks.

I'm thinking of two things:

enhance the current error message, we should be able to catch error, and inspect the pyarrow tables column name to find out the duplicate columns.
Technically we can solve it -- we can introduce a new keyword argument to method fetch_pandas_all, something like dedup_columns which by default is False, and if set to True, then fetch_pandas_all can do the dedup. In this way we can avoid breaking changes and provides the ability to dedup columns. However, I'm still a bit concerned about the data loss that @sfc-gh-mkeller mentioned. If the flag causes confusion as well as misunderstanding, people use it incorrectly and shoot themself in the foot, then I tend not to introduce the error.

I think we need more information on whether it will lead to real data loss, do you have samples to share us with on how your table is defined and what detailed select statement you write?

on the other side, the 1st item is definitely a low-hanging fruit we can get it done. Is there any of you interested in enhancing the error information?

mbkupfer · 2023-04-05T23:31:52Z

To provide some context, as it's been a long time since I submitted this issue, I was encountering this prior to Snowpark being released. Now with snowpark it seems like this is either much less likely to come up just from how the API works, or helpfully warning me if there is a conflict.

My vote would be to not make any changes if we are not 100% sure this wouldn't lead to data loss and instead let the connector just do one thing and do that one thing really well.

sfc-gh-aalam · 2023-04-18T20:51:41Z

Based on @mbkupfer comments, I am closing this issue for now. If we have more concerns in the future, please consider the above comments by @sfc-gh-aling and @sfc-gh-mkeller. Happy to listen to any other ideas/suggestions on how to proceed with this issue.

breckognize · 2023-10-12T15:46:47Z

I just hit this issue. @mbkupfer, how does Snowpark mitigate it?

My preference would be a new optional flag that resolves column name conflicts by appending '_1', '_2', etc. Then there's no data loss.

In many cases, applications are just showing the results to a user who can deal with the ambiguity.

mbkupfer · 2023-10-12T16:41:13Z

I just hit this issue. @mbkupfer, how does Snowpark mitigate it?

My preference would be a new optional flag that resolves column name conflicts by appending '_1', '_2', etc. Then there's no data loss.

In many cases, applications are just showing the results to a user who can deal with the ambiguity.

I'd have to see your use case to understand/advise. Happy to bring this offline from this thread if you want to just DM personally.

pblankley · 2024-07-29T16:55:16Z

Was there any resolution here to @breckognize proposal to add _1, _2, etc to duplicate column names?

That strikes me as the most logical solution and also matches the behavior of the SDK for other competitors like BQ.

mbkupfer added bug needs triage labels Nov 15, 2021

github-actions bot changed the title ~~fetch_pandas_all returns ValueError: Found non-unique column index~~ SNOW-502111: fetch_pandas_all returns ValueError: Found non-unique column index Nov 15, 2021

sfc-gh-mkeller self-assigned this Nov 18, 2021

sfc-gh-mkeller removed the needs triage label Nov 18, 2021

sfc-gh-aalam closed this as completed Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-502111: fetch_pandas_all returns ValueError: Found non-unique column index #970

SNOW-502111: fetch_pandas_all returns ValueError: Found non-unique column index #970

mbkupfer commented Nov 15, 2021

mbkupfer commented Nov 16, 2021

sfc-gh-mkeller commented Nov 23, 2021

mbkupfer commented Nov 24, 2021

onerishabh commented Dec 22, 2021

shacker commented Jul 21, 2022

ibuda commented Aug 25, 2022

Gonzalo933 commented Feb 27, 2023

sfc-gh-aling commented Apr 5, 2023

mbkupfer commented Apr 5, 2023

sfc-gh-aalam commented Apr 18, 2023

breckognize commented Oct 12, 2023 •

edited

Loading

mbkupfer commented Oct 12, 2023

pblankley commented Jul 29, 2024

SNOW-502111: fetch_pandas_all returns ValueError: Found non-unique column index #970

SNOW-502111: fetch_pandas_all returns ValueError: Found non-unique column index #970

Comments

mbkupfer commented Nov 15, 2021

mbkupfer commented Nov 16, 2021

sfc-gh-mkeller commented Nov 23, 2021

mbkupfer commented Nov 24, 2021

onerishabh commented Dec 22, 2021

shacker commented Jul 21, 2022

ibuda commented Aug 25, 2022

Gonzalo933 commented Feb 27, 2023

sfc-gh-aling commented Apr 5, 2023

mbkupfer commented Apr 5, 2023

sfc-gh-aalam commented Apr 18, 2023

breckognize commented Oct 12, 2023 • edited Loading

mbkupfer commented Oct 12, 2023

pblankley commented Jul 29, 2024

breckognize commented Oct 12, 2023 •

edited

Loading