Skip to content

Releases: capitalone/DataProfiler

v0.4.3

22 Apr 19:15
2238d32
Compare
Choose a tag to compare

Runtime Changes

Migrating from v0.4.2 to v0.4.3 should result in a 30-90% reduction in profiling time.
Largely dependent on system resources and data size.

Notes

  • Remove requirement for tensorflow-addons
  • Library now works with tensorflow nightly (Python 3.9)
  • Added example on generating a new data labeler

Profiler

  • Multiprocessing data preprocessing
  • Improved histogram accuracy
  • Reduced histogram generation runtime
  • Option to set the bin count for histogram
  • Expanded precision and switch to precision estimation (as opposed to exact calculations)
  • Limit pool size based on cpu and memory limitations

Data

  • Improved JSON detection method
    • Option (default) pulls metadata and data separately (data.meta and data.data)
    • data.meta would be part of the JSON which contains no records
    • data.data would be part of the JSON which contains records
    • Added option to select keys which represent records

Report

  • Precision report now contains additional details
"precision": {
   'min': int,
   'max': int,
   'mean': float,
   'var': float,
   'std': float,
   'sample_size': int,
   'margin_of_error': float,
   'confidence_level': float		
},

Bug fixes

  • Fixed error in merging options
  • Fixed issue related to merging DateTimeColumns
  • Fixed multiprocessing on OSX
  • Fixed row calculations if min_true_samples is greater than zero

v0.4.2

06 Apr 18:51
f766ce7
Compare
Choose a tag to compare

Runtime Changes

Notes

This update reduces runtime by on average 50%.

Profiler

  • Add support for HistogramOptions
  • Add multiprocessing support
  • Reduced runtime for shuffling indices
  • Vectorized precision function
  • Improved unique set & vocab merging
  • By default histogram only runs 'auto' bin edge detection

Data

  • Add length attribute to the data class data.length() or len(data)

Report

  • Added optional omit_keys to the report options function, remove keys from the final report
  • Added row_has_null_count (global), one or more nulls in the row
  • Added row_is_null_count (global), the entire row is null
  • Rename total_samples (global) -> row_count
  • Rename label BACKGROUND -> UNKNOWN (column)
  • Removed covariance (global)
  • Removed data_classification (global)
  • Removed data_label_probability (column)
  • Removed median (column)

Bug fixes

  • Accurate null count and total_samples on profile updates
  • Each column now receives the same sampled indices; enabling row_is_null_count

v0.4.1

25 Mar 16:34
d1be6d8
Compare
Choose a tag to compare

BUGFIX: Enables running data profiler without the TensorFlow library

v0.4.0

New Features

  • Reduce profiling memory usage by ~50%
  • Reduce profiling runtime by >75%
  • Improve delimiter and header detection in delimited (CSV) data
  • Add progress notifications for profiling

Fixes

  • Adds warnings for sampling
  • Selects proper options on profile mergers
  • Fix repeated tensorflow warnings
  • Thresholds input for large CSV files by bytes or lines (whichever is smaller)

v0.4.0

25 Mar 03:04
f76ed25
Compare
Choose a tag to compare

New Features

  • Reduce profiling memory usage by ~50%
  • Reduce profiling runtime by >75%
  • Improve delimiter and header detection in delimited (CSV) data
  • Add progress notifications for profiling

Fixes

  • Adds warnings for sampling
  • Selects proper options on profile mergers
  • Fix repeated tensorflow warnings
  • Thresholds input for large CSV files by bytes or lines (whichever is smaller)

v0.3.5

16 Mar 21:06
f63cad6
Compare
Choose a tag to compare
  • Enhancement: 50-90% reduced profiling time
    • Improved methods for unique row and null-in-row prediction(s)
  • Enhancement: Users can now select header row for delimited files
  • Bug Fix: Added header detection on delimited files with only strings

v0.3.4

12 Mar 19:28
5e5f64e
Compare
Choose a tag to compare
  • Significantly improved header detection on structured datasets
  • Updated model
    • New entities: DATE, TIME, US_STATE, DRIVERS_LICENSE
    • Removed entities: INTEGER_BIG
  • New [easier] way to extend labels to the model
  • ML requirements installed separately via pip install dataprofiler[ml] - required for labeler
  • Profiler & Labeler only load TensorFlow when necessary
  • Minor bug fixes & improved testing

v0.3.2

04 Mar 05:09
7c05449
Compare
Choose a tag to compare
  • TensorFlow only runs when a labeler executes
  • Improved CSV detection
  • 2-8x memory reduction in profiling
  • Various bug fixes

v0.3.1

23 Feb 20:49
93a9b6e
Compare
Choose a tag to compare
  • Dramatically reduced memory requirements for the data labeler
  • Renamed the module: data_profiler -> dataprofiler
  • Improved delimiter (CSV) file detection

v0.3.0

11 Feb 20:01
07e8b3b
Compare
Choose a tag to compare

Initial Data Profiler release.
Load a file. Extract profile. Save output.
See README.md for full information regarding release.