Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-502111: fetch_pandas_all returns ValueError: Found non-unique column index #970

Closed
mbkupfer opened this issue Nov 15, 2021 · 13 comments
Assignees
Labels

Comments

@mbkupfer
Copy link

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

Python 3.8.10

  1. What operating system and processor architecture are you using?

Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29

  1. What are the component versions in the environment (pip freeze)?

anyio==3.3.4
argon2-cffi==21.1.0
asn1crypto==1.4.0
astroid==2.8.0
attrs==19.3.0
Automat==0.8.0
azure-common==1.1.27
azure-core==1.15.0
azure-storage-blob==12.8.1
Babel==2.9.1
backcall==0.2.0
black==21.9b0
bleach==4.0.0
blinker==1.4
boto3==1.17.98
botocore==1.20.98
certifi==2021.5.30
cffi==1.14.5
chardet==4.0.0
charset-normalizer==2.0.7
click==8.0.1
cloud-init==21.3
cloudpickle==2.0.0
colorama==0.4.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==3.3.2
cycler==0.10.0
dask==2021.10.0
dbus-python==1.2.16
debugpy==1.5.1
decorator==5.0.9
defusedxml==0.7.1
distributed==2021.10.0
distro==1.4.0
distro-info===0.23ubuntu1
entrypoints==0.3
et-xmlfile==1.1.0
fsspec==2021.10.1
greenlet==1.1.1
HeapDict==1.0.1
httplib2==0.14.0
hyperlink==19.0.0
idna==2.10
importlib-metadata==1.5.0
incremental==16.10.1
ipykernel==6.4.2
ipython==7.26.0
ipython-genutils==0.2.0
isodate==0.6.0
isort==5.9.3
jedi==0.18.0
Jinja2==2.10.1
jmespath==0.10.0
joblib==1.1.0
json5==0.9.6
jsonpatch==1.22
jsonpointer==2.0
jsonschema==3.2.0
jupyter-client==7.0.0
jupyter-core==4.7.1
jupyter-server==1.11.1
jupyterlab==3.2.1
jupyterlab-pygments==0.1.2
jupyterlab-server==2.8.2
keyring==18.0.1
kiwisolver==1.3.2
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
lazy-object-proxy==1.6.0
locket==0.2.1
MarkupSafe==1.1.0
matplotlib==3.4.3
matplotlib-inline==0.1.2
mccabe==0.6.1
mistune==0.8.4
more-itertools==4.2.0
msgpack==1.0.2
msrest==0.6.21
mypy-extensions==0.4.3
nbclassic==0.3.3
nbclient==0.5.4
nbconvert==6.1.0
nbformat==5.1.3
nest-asyncio==1.5.1
netifaces==0.10.4
notebook==6.4.5
numpy==1.21.1
oauthlib==3.1.1
openpyxl==3.0.7
oscrypto==1.2.1
packaging==21.0
pandas==1.3.1
pandocfilters==1.4.3
parso==0.8.2
partd==1.2.0
pathspec==0.9.0
pexpect==4.6.0
pickleshare==0.7.5
Pillow==8.4.0
platformdirs==2.3.0
prometheus-client==0.11.0
prompt-toolkit==3.0.19
psutil==5.8.0
ptyprocess==0.7.0
pyarrow==5.0.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycparser==2.20
pycryptodomex==3.10.1
Pygments==2.9.0
PyGObject==3.36.0
PyHamcrest==1.9.0
PyJWT==2.1.0
pylint==2.11.1
pylint-venv==2.1.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==21.0.0
pyparsing==2.4.7
pyrsistent==0.15.5
pyserial==3.4
python-apt==2.0.0+ubuntu0.20.4.6
python-dateutil==2.8.1
python-debian===0.1.36ubuntu1
pytz==2021.1
PyYAML==5.3.1
pyzmq==22.2.1
regex==2021.8.28
requests==2.25.1
requests-oauthlib==1.3.0
requests-unixsocket==0.2.0
s3transfer==0.4.2
scikit-learn==1.0.1
scipy==1.7.1
seaborn==0.11.2
SecretStorage==2.3.1
Send2Trash==1.8.0
service-identity==18.1.0
simplejson==3.16.0
six==1.16.0
sniffio==1.2.0
snowflake-connector-python==2.7.0
snowflake-sqlalchemy==1.3.2
sortedcontainers==2.4.0
sos==4.1
SQLAlchemy==1.4.22
ssh-import-id==5.10
systemd-python==234
tblib==1.7.0
terminado==0.12.1
testpath==0.5.0
threadpoolctl==3.0.0
toml==0.10.2
tomli==1.2.1
toolz==0.11.1
tornado==6.1
traitlets==5.0.5
Twisted==18.9.0
typing-extensions==3.10.0.2
ubuntu-advantage-tools==27.2
ufw==0.36
unattended-upgrades==0.1
urllib3==1.26.5
wadllib==1.3.3
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.1
wrapt==1.12.1
zict==2.0.0
zipp==1.0.0
zope.interface==4.7.1

  1. What did you do?

So sorry, but I'm working with proprietary data and really have no idea how to recreate using dummy data.

  1. What did you expect to see?

Should of gotten a dataframe

  1. Can you set logging to DEBUG and collect the logs?

I tried this and it looked like a lot of identifiable information was in there. Happy to DM, but I don't want to make those logs public.

@github-actions github-actions bot changed the title fetch_pandas_all returns ValueError: Found non-unique column index SNOW-502111: fetch_pandas_all returns ValueError: Found non-unique column index Nov 15, 2021
@mbkupfer
Copy link
Author

Ok, I can actually recreate the problem now that I understand it, see below.

sql = "select 'a', 'a'"
cur.execute(sql)
df = cur.fetch_pandas_all()

Is there a way we can avoid an error being thrown as this is in fact valid sql. For instance, can we just delete duplicated columns and issue a warning that some columns were redundant and couldn't be handled by pyarrow?

@sfc-gh-mkeller
Copy link
Collaborator

While the SQL is valid, it looks like pyarrow.Table.to_pandas doesn't support duplicate column names on purpose.
You could name the columns different names in your SQL.

I'd be against issuing warnings and de-duping columns, as this could lead to some users losing information (unknowingly, if they have warnings disabled).

@mbkupfer
Copy link
Author

The user wouldn't loose any information in that case; it would just throw a runtime error and fail.

What about enhancing the error so that there is more context? i.e catching ValueError: Found non-unique column index and then providing the offending columns. Otherwise, it's hard to tell that something innocuous as a duplicate column name is the reason for the pyarrow stack trace.

@onerishabh
Copy link

Hi,

Is there anyone working on this issue? I would like to further discuss on this and contribute too.

@shacker
Copy link

shacker commented Jul 21, 2022

This issue comes up when you select * and are doing a join on a common column name - very straightforward use case. Seems like the only way around it is to specifically request every single column you want back in the select, which can be a pita.

@ibuda
Copy link

ibuda commented Aug 25, 2022

@shacker I had the same error with select * from , and I am confirming that indicating the column list in the select statement solves the problem.

@Gonzalo933
Copy link

Any plans to solve this issue in the future? It's really annoying to deal with it when working with tables that have a lot of columns

@sfc-gh-aling
Copy link
Collaborator

hey folks, thanks for your feedbacks.

I'm thinking of two things:

  1. enhance the current error message, we should be able to catch error, and inspect the pyarrow tables column name to find out the duplicate columns.

  2. Technically we can solve it -- we can introduce a new keyword argument to method fetch_pandas_all, something like dedup_columns which by default is False, and if set to True, then fetch_pandas_all can do the dedup. In this way we can avoid breaking changes and provides the ability to dedup columns. However, I'm still a bit concerned about the data loss that @sfc-gh-mkeller mentioned. If the flag causes confusion as well as misunderstanding, people use it incorrectly and shoot themself in the foot, then I tend not to introduce the error.

I think we need more information on whether it will lead to real data loss, do you have samples to share us with on how your table is defined and what detailed select statement you write?

on the other side, the 1st item is definitely a low-hanging fruit we can get it done. Is there any of you interested in enhancing the error information?

@mbkupfer
Copy link
Author

mbkupfer commented Apr 5, 2023

To provide some context, as it's been a long time since I submitted this issue, I was encountering this prior to Snowpark being released. Now with snowpark it seems like this is either much less likely to come up just from how the API works, or helpfully warning me if there is a conflict.

My vote would be to not make any changes if we are not 100% sure this wouldn't lead to data loss and instead let the connector just do one thing and do that one thing really well.

@sfc-gh-aalam
Copy link
Collaborator

Based on @mbkupfer comments, I am closing this issue for now. If we have more concerns in the future, please consider the above comments by @sfc-gh-aling and @sfc-gh-mkeller. Happy to listen to any other ideas/suggestions on how to proceed with this issue.

@breckognize
Copy link

breckognize commented Oct 12, 2023

I just hit this issue. @mbkupfer, how does Snowpark mitigate it?

My preference would be a new optional flag that resolves column name conflicts by appending '_1', '_2', etc. Then there's no data loss.

In many cases, applications are just showing the results to a user who can deal with the ambiguity.

@mbkupfer
Copy link
Author

I just hit this issue. @mbkupfer, how does Snowpark mitigate it?

My preference would be a new optional flag that resolves column name conflicts by appending '_1', '_2', etc. Then there's no data loss.

In many cases, applications are just showing the results to a user who can deal with the ambiguity.

I'd have to see your use case to understand/advise. Happy to bring this offline from this thread if you want to just DM personally.

@pblankley
Copy link

Was there any resolution here to @breckognize proposal to add _1, _2, etc to duplicate column names?

That strikes me as the most logical solution and also matches the behavior of the SDK for other competitors like BQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants