Skip to content

get_dataframe fails when input DataFrame already contains a column named N_match #12

@technic960183

Description

@technic960183

Summary

When calling get_dataframe1() on a result object from xmatch, a ValueError is raised if the original input DataFrame contains a column named N_match. This appears to be caused by a naming conflict when the method attempts to append its own N_match column to the result DataFrame.

Error log

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[72], line 3
      1 result = xmatch(df_A, df_B, 1)
      2 print(result.number_distribution())
----> 3 result.get_dataframe1()

File spherimatch/result_xmatch.py:67, in XMatchResult.get_dataframe1(self, min_match, coord_columns, retain_all_columns, retain_columns)
     65 if len(append_df.columns) > 0:
     66     data_df = pd.concat([data_df, append_df], axis=1)
---> 67 data_df = data_df[data_df['N_match'] >= min_match]
     68 return data_df
...
ValueError: cannot reindex on an axis with duplicate labels

Root cause

The method get_dataframe1() internally generates a column named N_match, but does not check whether this column already exists in the input DataFrame. If the input already contains a column with the same name, pd.concat results in duplicate column labels. Later filtering operations that rely on a unique N_match column then raise a ValueError.

Minimal steps to reproduce

import pandas as pd
from spherimatch import xmatch

df_A = pd.DataFrame({
    'RA': [10.0, 20.0],
    'DEC': [10.0, 20.0],
    'N_match': [1, 2],  # This triggers the bug
})
df_B = pd.DataFrame({
    'RA': [10.01, 19.99],
    'DEC': [10.01, 20.01],
})

result = xmatch(df_A, df_B, 1)
df_out = result.get_dataframe1()

Expected behavior

The method should either:

  • Avoid overwriting/conflicting with existing N_match columns, or
  • Raise a clear and informative error if the column name already exists, or
  • Allow users to specify the name of the output column.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions