ENH: show the raw unicode in the output formatting of Index/array? #60819

jorisvandenbossche · 2025-01-30T10:48:43Z

I ran into a somewhat wrong CSV file. We automatically remove the BOM character from the data, but this file started with two such characters .. and then right now we keep the second. So essentially I had a dataframe like this:

>>> df = pd.DataFrame({"\ufeffCol": [1, 2, 3]})
>>> df 
   Col
0     1
1     2
2     3

In the dataframe repr, I think it is expected we don't show the character (since it is unicode for a "zero width space" ..). In any case I was also using a notebook, and in the html repr we certainly would render the unicode.

But to diagnose the issue of df["Col"] failing with a KeyError, I looked at the columns:

>>> df.columns
Index(['Col'], dtype='str')

Here we do show the value as a string (i.e. it is quoted), but still don't show the unicode character, while the python repr of the string or the equivalent numpy array repr both show it:

>>> df.columns[0]
'\ufeffCol'
>>> df.columns.to_numpy()
array(['\ufeffCol'], dtype=object)

(the above is showing with the new "str" dtype, but originally I ran into it with object dtype, so both have the same issue)

It would have been much easier to debug this issue if the Index repr showed the unicode character.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2025-01-30T10:53:19Z

Of course for other, visible, unicode characters, we do want to show it, and it seems in that case both Python and numpy also show it:

In [51]: pd.Index(["\u2190"])
Out[51]: Index(['←'], dtype='object')

In [52]: pd.Index(["\u2190"])[0]
Out[52]: '←'

In [53]: pd.Index(["\u2190"]).to_numpy()
Out[53]: array(['←'], dtype=object)

But so it seems that Python and numpy have some logic to render unicode in general, but not for certain characters. And it is strange that we don't follow the same (as I would think we are just printing the python objects under the hood)

jorisvandenbossche · 2025-01-30T10:57:00Z

And it is strange that we don't follow the same (as I would think we are just printing the python objects under the hood)

Ah, I suppose that is the difference between str() and repr(). I thought that for the Index repr we used repr and not str to show the individual elements, but apparently not.
For the ExtensionArray._formatter it is the boxed=True/False keyword that controls the use of repr vs str. But not entirely sure this is used for Index as well.

purnabunty · 2025-01-30T13:35:21Z

Detect and Remove BOM Characters
Modify the logic to detect and remove all BOM characters from the DataFrame’s columns.

import pandas as pd
data = {'\ufeff\ufeffCol': [1, 2, 3]}
df = pd.DataFrame(data)
def strip_bom(df):
df.columns = [col.lstrip('\ufeff') for col in df.columns]
return df
df = strip_bom(df)
print(df.columns)

jbrockmendel · 2025-02-01T17:59:28Z

IIRC the ‘boxed’ parameter is not used consistently in the DTI/TDI cases. I dont have it in front of me but I remember it being confusing. I’d be +1 on making its use more consistent, possibly renaming it eg use_repr to be obvious what it does.

jorisvandenbossche added Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: show the raw unicode in the output formatting of Index/array? #60819

ENH: show the raw unicode in the output formatting of Index/array? #60819

jorisvandenbossche commented Jan 30, 2025

jorisvandenbossche commented Jan 30, 2025

jorisvandenbossche commented Jan 30, 2025

purnabunty commented Jan 30, 2025 •

edited

Loading

jbrockmendel commented Feb 1, 2025

ENH: show the raw unicode in the output formatting of Index/array? #60819

ENH: show the raw unicode in the output formatting of Index/array? #60819

Comments

jorisvandenbossche commented Jan 30, 2025

jorisvandenbossche commented Jan 30, 2025

jorisvandenbossche commented Jan 30, 2025

purnabunty commented Jan 30, 2025 • edited Loading

jbrockmendel commented Feb 1, 2025

purnabunty commented Jan 30, 2025 •

edited

Loading