Extraneous columns (e.g., metadata not related to the FeatureBuilder inputs) with different types can cause FB to fail

When the user inputs their data, columns other than the ones required by our FeatureBuilder are meant to be ignored. However, it appears that sometimes, those columns can cause the FB to fail. In this case, a metadata column containing 'set' type data throws an error in the final step:

# Stack Trace
```
Initializing Featurization...
Confirmed that data has conversation_id: gameID, speaker_id: sender_id and message: text columns!
Generating RoBERTa sentiments...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 207/207 [05:36<00:00,  1.62s/it]
Chat Level Features ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [02:58<00:00, 11.93s/it]
Generating features for the first 100.0% of messages...
Generating User Level Features ...
Generating Conversation Level Features ...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 10
      1 teamcomm_feature_builder = FeatureBuilder(input_df = df_all_chats,
      2                                           conversation_id_col="gameID",
      3                                           speaker_id_col="sender_id",
   (...)
      8                                           turns=True,
      9                                         )
---> 10 teamcomm_feature_builder.featurize()

File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/team_comm_tools/feature_builder.py:498, in FeatureBuilder.featurize(self)
    496 print("Generating Conversation Level Features ...")
    497 self.conv_level_features()
--> 498 self.merge_conv_data_with_original()
    500 # Step 4. Write the feartures into the files defined in the output paths.
    501 print("All Done!")

File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/team_comm_tools/feature_builder.py:433, in FeatureBuilder.merge_conv_data_with_original(self)
    425 # Use the 1st item in the row, as they are all the same at the conv level
    426 orig_conv_data = orig_conv_data.groupby([self.conversation_id_col]).nth(0).reset_index()
    428 final_conv_output = pd.merge(
    429     left= self.conv_data,
    430     right = orig_conv_data,
    431     on=[self.conversation_id_col],
    432     how="left"
--> 433 ).drop_duplicates()
    435 self.conv_data = final_conv_output
    437 # drop index column, if present

File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/frame.py:6818, in DataFrame.drop_duplicates(self, subset, keep, inplace, ignore_index)
   6815 inplace = validate_bool_kwarg(inplace, "inplace")
   6816 ignore_index = validate_bool_kwarg(ignore_index, "ignore_index")
-> 6818 result = self[-self.duplicated(subset, keep=keep)]
   6819 if ignore_index:
   6820     result.index = default_index(len(result))

File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/frame.py:6958, in DataFrame.duplicated(self, subset, keep)
   6956 else:
   6957     vals = (col.values for name, col in self.items() if name in subset)
-> 6958     labels, shape = map(list, zip(*map(f, vals)))
   6960     ids = get_group_index(labels, tuple(shape), sort=False, xnull=False)
   6961     result = self._constructor_sliced(duplicated(ids, keep), index=self.index)

File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/frame.py:6926, in DataFrame.duplicated.<locals>.f(vals)
   6925 def f(vals) -> tuple[np.ndarray, int]:
-> 6926     labels, shape = algorithms.factorize(vals, size_hint=len(self))
   6927     return labels.astype("i8", copy=False), len(shape)

File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/algorithms.py:795, in factorize(values, sort, use_na_sentinel, size_hint)
    792             # Don't modify (potentially user-provided) array
    793             values = np.where(null_mask, na_value, values)
--> 795     codes, uniques = factorize_array(
    796         values,
    797         use_na_sentinel=use_na_sentinel,
    798         size_hint=size_hint,
    799     )
    801 if sort and len(uniques) > 0:
    802     uniques, codes = safe_sort(
    803         uniques,
    804         codes,
   (...)
    807         verify=False,
    808     )

File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/algorithms.py:595, in factorize_array(values, use_na_sentinel, size_hint, na_value, mask)
    592 hash_klass, values = _get_hashtable_algo(values)
    594 table = hash_klass(size_hint or len(values))
--> 595 uniques, codes = table.factorize(
    596     values,
    597     na_sentinel=-1,
    598     na_value=na_value,
    599     mask=mask,
    600     ignore_na=use_na_sentinel,
    601 )
    603 # re-cast e.g. i8->dt64/td64, uint8->bool
    604 uniques = _reconstruct_data(uniques, original.dtype, original)

File pandas/_libs/hashtable_class_helper.pxi:7281, in pandas._libs.hashtable.PyObjectHashTable.factorize()

File pandas/_libs/hashtable_class_helper.pxi:7195, in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'set'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extraneous columns (e.g., metadata not related to the FeatureBuilder inputs) with different types can cause FB to fail #342

Stack Trace

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extraneous columns (e.g., metadata not related to the FeatureBuilder inputs) with different types can cause FB to fail #342

Description

Stack Trace

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions