-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numeric updates #1103
Numeric updates #1103
Conversation
df_series_clean = df_series_clean[df_series_clean != np.nan] | ||
if df_series_clean.size == 0: | ||
df_np_series_clean = df_series_clean.to_numpy() | ||
df_np_series_clean = df_np_series_clean[df_np_series_clean != np.nan] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this saying that there is always a 0 or 1 for index values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may not be a question for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If i am reading it correctly it is just dropping any nans in the array
if self._greater_than_64_bit and type(df_series) is pl.Series: | ||
batch_biased_skewness = profiler_utils.biased_skew(df_series.to_numpy()) | ||
else: | ||
df_series = pl.from_pandas(df_series, nan_to_null=False) | ||
batch_biased_skewness = profiler_utils.biased_skew(df_series) | ||
batch_biased_skewness = profiler_utils.biased_skew(df_series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a couple of places in this code where we are doing checks in a similar way to this. I wonder if we could create a helper function. I don't know exactly how it would work, but just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its kind of a weird edge case due to how polars works internally it won't do math on values greater than 64 bits. What would the helper function do exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"helper" function would simply be a single definition of this code that is reused throughout .... instead of doing basically the same code in multiple places, just do one definition of that repeatable code and reuse that throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, I don't there is a great way to wrap this since the called function is usually different. Would also harm readability I think.
@@ -6,7 +6,7 @@ | |||
from unittest import mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general question about the tests for your implementation. Do you have test cases that cover where df_series.to_numpy()
would be called? For example:
if df_series.is_empty():
num_negatives_value = 0
elif self._greater_than_64_bit:
num_negatives_value = int((df_series.to_numpy() < 0).sum())
else:
num_negatives_value = int((df_series < 0).sum())
do you have test cases that will go in both the elif
and the else
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup these are covered by the int_column_profiler when it tests values of 64 bits.
if self._greater_than_64_bit and type(df_series) is pl.Series: | ||
batch_biased_skewness = profiler_utils.biased_skew(df_series.to_numpy()) | ||
else: | ||
df_series = pl.from_pandas(df_series, nan_to_null=False) | ||
batch_biased_skewness = profiler_utils.biased_skew(df_series) | ||
batch_biased_skewness = profiler_utils.biased_skew(df_series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"helper" function would simply be a single definition of this code that is reused throughout .... instead of doing basically the same code in multiple places, just do one definition of that repeatable code and reuse that throughout
* update profiler utils * finish updates * finish int updates * update float precision * finish float col profile updates * update text_col_profile * update float col profiler completely * finish int col tests * update text profiler tests * fully finished * fix pandas df in update
* update profiler utils * finish updates * finish int updates * update float precision * finish float col profile updates * update text_col_profile * update float col profiler completely * finish int col tests * update text profiler tests * fully finished * fix pandas df in update
* update profiler utils * finish updates * finish int updates * update float precision * finish float col profile updates * update text_col_profile * update float col profiler completely * finish int col tests * update text profiler tests * fully finished * fix pandas df in update
* update profiler utils * finish updates * finish int updates * update float precision * finish float col profile updates * update text_col_profile * update float col profiler completely * finish int col tests * update text profiler tests * fully finished * fix pandas df in update
No description provided.