-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
bugSomething isn't workingSomething isn't working
Description
When the user inputs their data, columns other than the ones required by our FeatureBuilder are meant to be ignored. However, it appears that sometimes, those columns can cause the FB to fail. In this case, a metadata column containing 'set' type data throws an error in the final step:
Stack Trace
Initializing Featurization...
Confirmed that data has conversation_id: gameID, speaker_id: sender_id and message: text columns!
Generating RoBERTa sentiments...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 207/207 [05:36<00:00, 1.62s/it]
Chat Level Features ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [02:58<00:00, 11.93s/it]
Generating features for the first 100.0% of messages...
Generating User Level Features ...
Generating Conversation Level Features ...
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[12], line 10
1 teamcomm_feature_builder = FeatureBuilder(input_df = df_all_chats,
2 conversation_id_col="gameID",
3 speaker_id_col="sender_id",
(...)
8 turns=True,
9 )
---> 10 teamcomm_feature_builder.featurize()
File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/team_comm_tools/feature_builder.py:498, in FeatureBuilder.featurize(self)
496 print("Generating Conversation Level Features ...")
497 self.conv_level_features()
--> 498 self.merge_conv_data_with_original()
500 # Step 4. Write the feartures into the files defined in the output paths.
501 print("All Done!")
File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/team_comm_tools/feature_builder.py:433, in FeatureBuilder.merge_conv_data_with_original(self)
425 # Use the 1st item in the row, as they are all the same at the conv level
426 orig_conv_data = orig_conv_data.groupby([self.conversation_id_col]).nth(0).reset_index()
428 final_conv_output = pd.merge(
429 left= self.conv_data,
430 right = orig_conv_data,
431 on=[self.conversation_id_col],
432 how="left"
--> 433 ).drop_duplicates()
435 self.conv_data = final_conv_output
437 # drop index column, if present
File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/frame.py:6818, in DataFrame.drop_duplicates(self, subset, keep, inplace, ignore_index)
6815 inplace = validate_bool_kwarg(inplace, "inplace")
6816 ignore_index = validate_bool_kwarg(ignore_index, "ignore_index")
-> 6818 result = self[-self.duplicated(subset, keep=keep)]
6819 if ignore_index:
6820 result.index = default_index(len(result))
File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/frame.py:6958, in DataFrame.duplicated(self, subset, keep)
6956 else:
6957 vals = (col.values for name, col in self.items() if name in subset)
-> 6958 labels, shape = map(list, zip(*map(f, vals)))
6960 ids = get_group_index(labels, tuple(shape), sort=False, xnull=False)
6961 result = self._constructor_sliced(duplicated(ids, keep), index=self.index)
File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/frame.py:6926, in DataFrame.duplicated.<locals>.f(vals)
6925 def f(vals) -> tuple[np.ndarray, int]:
-> 6926 labels, shape = algorithms.factorize(vals, size_hint=len(self))
6927 return labels.astype("i8", copy=False), len(shape)
File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/algorithms.py:795, in factorize(values, sort, use_na_sentinel, size_hint)
792 # Don't modify (potentially user-provided) array
793 values = np.where(null_mask, na_value, values)
--> 795 codes, uniques = factorize_array(
796 values,
797 use_na_sentinel=use_na_sentinel,
798 size_hint=size_hint,
799 )
801 if sort and len(uniques) > 0:
802 uniques, codes = safe_sort(
803 uniques,
804 codes,
(...)
807 verify=False,
808 )
File ~/anaconda3/envs/msr_env/lib/python3.10/site-packages/pandas/core/algorithms.py:595, in factorize_array(values, use_na_sentinel, size_hint, na_value, mask)
592 hash_klass, values = _get_hashtable_algo(values)
594 table = hash_klass(size_hint or len(values))
--> 595 uniques, codes = table.factorize(
596 values,
597 na_sentinel=-1,
598 na_value=na_value,
599 mask=mask,
600 ignore_na=use_na_sentinel,
601 )
603 # re-cast e.g. i8->dt64/td64, uint8->bool
604 uniques = _reconstruct_data(uniques, original.dtype, original)
File pandas/_libs/hashtable_class_helper.pxi:7281, in pandas._libs.hashtable.PyObjectHashTable.factorize()
File pandas/_libs/hashtable_class_helper.pxi:7195, in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'set'
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working