Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric updates #1103

Merged
merged 11 commits into from
Mar 21, 2024
Merged

Conversation

atl1502
Copy link
Contributor

@atl1502 atl1502 commented Mar 3, 2024

No description provided.

@atl1502 atl1502 requested a review from a team as a code owner March 3, 2024 23:10
df_series_clean = df_series_clean[df_series_clean != np.nan]
if df_series_clean.size == 0:
df_np_series_clean = df_series_clean.to_numpy()
df_np_series_clean = df_np_series_clean[df_np_series_clean != np.nan]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this saying that there is always a 0 or 1 for index values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not be a question for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i am reading it correctly it is just dropping any nans in the array

Comment on lines +1912 to +1915
if self._greater_than_64_bit and type(df_series) is pl.Series:
batch_biased_skewness = profiler_utils.biased_skew(df_series.to_numpy())
else:
df_series = pl.from_pandas(df_series, nan_to_null=False)
batch_biased_skewness = profiler_utils.biased_skew(df_series)
batch_biased_skewness = profiler_utils.biased_skew(df_series)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a couple of places in this code where we are doing checks in a similar way to this. I wonder if we could create a helper function. I don't know exactly how it would work, but just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its kind of a weird edge case due to how polars works internally it won't do math on values greater than 64 bits. What would the helper function do exactly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"helper" function would simply be a single definition of this code that is reused throughout .... instead of doing basically the same code in multiple places, just do one definition of that repeatable code and reuse that throughout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, I don't there is a great way to wrap this since the called function is usually different. Would also harm readability I think.

@@ -6,7 +6,7 @@
from unittest import mock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general question about the tests for your implementation. Do you have test cases that cover where df_series.to_numpy() would be called? For example:

if df_series.is_empty():
    num_negatives_value = 0
elif self._greater_than_64_bit:
    num_negatives_value = int((df_series.to_numpy() < 0).sum())
else:
    num_negatives_value = int((df_series < 0).sum())

do you have test cases that will go in both the elif and the else here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup these are covered by the int_column_profiler when it tests values of 64 bits.

Comment on lines +1912 to +1915
if self._greater_than_64_bit and type(df_series) is pl.Series:
batch_biased_skewness = profiler_utils.biased_skew(df_series.to_numpy())
else:
df_series = pl.from_pandas(df_series, nan_to_null=False)
batch_biased_skewness = profiler_utils.biased_skew(df_series)
batch_biased_skewness = profiler_utils.biased_skew(df_series)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"helper" function would simply be a single definition of this code that is reused throughout .... instead of doing basically the same code in multiple places, just do one definition of that repeatable code and reuse that throughout

@taylorfturner taylorfturner merged commit 6e2c2fa into capitalone:feature/polars Mar 21, 2024
5 checks passed
taylorfturner pushed a commit that referenced this pull request Mar 22, 2024
* update profiler utils

* finish updates

* finish int updates

* update float precision

* finish float col profile updates

* update text_col_profile

* update float col profiler completely

* finish int col tests

* update text profiler tests

* fully finished

* fix pandas df in update
abajpai15 pushed a commit to abajpai15/DataProfiler that referenced this pull request Apr 11, 2024
* update profiler utils

* finish updates

* finish int updates

* update float precision

* finish float col profile updates

* update text_col_profile

* update float col profiler completely

* finish int col tests

* update text profiler tests

* fully finished

* fix pandas df in update
taylorfturner pushed a commit that referenced this pull request Apr 15, 2024
* update profiler utils

* finish updates

* finish int updates

* update float precision

* finish float col profile updates

* update text_col_profile

* update float col profiler completely

* finish int col tests

* update text profiler tests

* fully finished

* fix pandas df in update
atl1502 added a commit that referenced this pull request Apr 16, 2024
* update profiler utils

* finish updates

* finish int updates

* update float precision

* finish float col profile updates

* update text_col_profile

* update float col profiler completely

* finish int col tests

* update text profiler tests

* fully finished

* fix pandas df in update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants