Numeric updates #1103

atl1502 · 2024-03-03T23:10:51Z

No description provided.

dataprofiler/profilers/float_column_profile.py

ksneab7 · 2024-03-06T19:30:37Z

dataprofiler/profilers/numerical_column_stats.py

-            df_series_clean = df_series_clean[df_series_clean != np.nan]
-            if df_series_clean.size == 0:
+            df_np_series_clean = df_series_clean.to_numpy()
+            df_np_series_clean = df_np_series_clean[df_np_series_clean != np.nan]


is this saying that there is always a 0 or 1 for index values?

may not be a question for this PR

If i am reading it correctly it is just dropping any nans in the array

dataprofiler/profilers/float_column_profile.py

micdavis · 2024-03-06T19:37:37Z

dataprofiler/profilers/numerical_column_stats.py

+        if self._greater_than_64_bit and type(df_series) is pl.Series:
+            batch_biased_skewness = profiler_utils.biased_skew(df_series.to_numpy())
        else:
-            df_series = pl.from_pandas(df_series, nan_to_null=False)
-        batch_biased_skewness = profiler_utils.biased_skew(df_series)
+            batch_biased_skewness = profiler_utils.biased_skew(df_series)


there are a couple of places in this code where we are doing checks in a similar way to this. I wonder if we could create a helper function. I don't know exactly how it would work, but just a thought.

its kind of a weird edge case due to how polars works internally it won't do math on values greater than 64 bits. What would the helper function do exactly?

"helper" function would simply be a single definition of this code that is reused throughout .... instead of doing basically the same code in multiple places, just do one definition of that repeatable code and reuse that throughout

gotcha, I don't there is a great way to wrap this since the called function is usually different. Would also harm readability I think.

micdavis · 2024-03-06T19:44:49Z

dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py

@@ -6,7 +6,7 @@
 from unittest import mock


general question about the tests for your implementation. Do you have test cases that cover where df_series.to_numpy() would be called? For example:

if df_series.is_empty(): num_negatives_value = 0 elif self._greater_than_64_bit: num_negatives_value = int((df_series.to_numpy() < 0).sum()) else: num_negatives_value = int((df_series < 0).sum())

do you have test cases that will go in both the elif and the else here?

yup these are covered by the int_column_profiler when it tests values of 64 bits.

dataprofiler/profilers/float_column_profile.py

taylorfturner · 2024-03-07T12:52:55Z

dataprofiler/profilers/numerical_column_stats.py

+        if self._greater_than_64_bit and type(df_series) is pl.Series:
+            batch_biased_skewness = profiler_utils.biased_skew(df_series.to_numpy())
        else:
-            df_series = pl.from_pandas(df_series, nan_to_null=False)
-        batch_biased_skewness = profiler_utils.biased_skew(df_series)
+            batch_biased_skewness = profiler_utils.biased_skew(df_series)


"helper" function would simply be a single definition of this code that is reused throughout .... instead of doing basically the same code in multiple places, just do one definition of that repeatable code and reuse that throughout

dataprofiler/tests/profilers/test_float_column_profile.py

dataprofiler/tests/profilers/test_int_column_profile.py

* update profiler utils * finish updates * finish int updates * update float precision * finish float col profile updates * update text_col_profile * update float col profiler completely * finish int col tests * update text profiler tests * fully finished * fix pandas df in update

atl1502 requested a review from a team as a code owner March 3, 2024 23:10

ksneab7 reviewed Mar 6, 2024

View reviewed changes

dataprofiler/profilers/float_column_profile.py Show resolved Hide resolved

ksneab7 reviewed Mar 6, 2024

View reviewed changes

micdavis reviewed Mar 6, 2024

View reviewed changes

taylorfturner reviewed Mar 7, 2024

View reviewed changes

atl1502 added 11 commits March 8, 2024 12:10

update profiler utils

ea58aed

finish updates

f23f701

finish int updates

ac888ff

update float precision

02aadef

finish float col profile updates

8dc5106

update text_col_profile

117d0aa

update float col profiler completely

4db07ac

finish int col tests

cf68568

update text profiler tests

c0d90a2

fully finished

a9da02e

fix pandas df in update

30a4c24

atl1502 force-pushed the numeric_updates branch from 68ed89c to 30a4c24 Compare March 8, 2024 18:10

taylorfturner enabled auto-merge (squash) March 19, 2024 15:10

taylorfturner assigned atl1502 Mar 19, 2024

taylorfturner added the polars label Mar 19, 2024

taylorfturner approved these changes Mar 19, 2024

View reviewed changes

micdavis approved these changes Mar 21, 2024

View reviewed changes

taylorfturner merged commit 6e2c2fa into capitalone:feature/polars Mar 21, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric updates #1103

Numeric updates #1103

atl1502 commented Mar 3, 2024

ksneab7 Mar 6, 2024

ksneab7 Mar 6, 2024

atl1502 Mar 6, 2024

micdavis Mar 6, 2024

atl1502 Mar 6, 2024

taylorfturner Mar 7, 2024

atl1502 Mar 8, 2024

micdavis Mar 6, 2024

atl1502 Mar 6, 2024

taylorfturner Mar 7, 2024

Numeric updates #1103

Numeric updates #1103

Conversation

atl1502 commented Mar 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment