22 Apr 19:15

2238d32

v0.4.3

Runtime Changes

Migrating from v0.4.2 to v0.4.3 should result in a 30-90% reduction in profiling time.
Largely dependent on system resources and data size.

Notes

Remove requirement for tensorflow-addons
Library now works with tensorflow nightly (Python 3.9)
Added example on generating a new data labeler

Profiler

Multiprocessing data preprocessing
Improved histogram accuracy
Reduced histogram generation runtime
Option to set the bin count for histogram
Expanded precision and switch to precision estimation (as opposed to exact calculations)
Limit pool size based on cpu and memory limitations

Data

Improved JSON detection method
- Option (default) pulls metadata and data separately (data.meta and data.data)
- data.meta would be part of the JSON which contains no records
- data.data would be part of the JSON which contains records
- Added option to select keys which represent records

Report

Precision report now contains additional details

"precision": {
   'min': int,
   'max': int,
   'mean': float,
   'var': float,
   'std': float,
   'sample_size': int,
   'margin_of_error': float,
   'confidence_level': float		
},

Bug fixes

Fixed error in merging options
Fixed issue related to merging DateTimeColumns
Fixed multiprocessing on OSX
Fixed row calculations if min_true_samples is greater than zero

Assets 2

06 Apr 18:51

lettergram

0.4.2

f766ce7

v0.4.2

Runtime Changes

Notes

This update reduces runtime by on average 50%.

Profiler

Add support for HistogramOptions
Add multiprocessing support
Reduced runtime for shuffling indices
Vectorized precision function
Improved unique set & vocab merging
By default histogram only runs 'auto' bin edge detection

Data

Add length attribute to the data class data.length() or len(data)

Report

Added optional omit_keys to the report options function, remove keys from the final report
Added row_has_null_count (global), one or more nulls in the row
Added row_is_null_count (global), the entire row is null
Rename total_samples (global) -> row_count
Rename label BACKGROUND -> UNKNOWN (column)
Removed covariance (global)
Removed data_classification (global)
Removed data_label_probability (column)
Removed median (column)

Bug fixes

Accurate null count and total_samples on profile updates
Each column now receives the same sampled indices; enabling row_is_null_count

Assets 2

25 Mar 16:34

lettergram

0.4.1

d1be6d8

v0.4.1

BUGFIX: Enables running data profiler without the TensorFlow library

v0.4.0

New Features

Reduce profiling memory usage by ~50%
Reduce profiling runtime by >75%
Improve delimiter and header detection in delimited (CSV) data
Add progress notifications for profiling

Fixes

Adds warnings for sampling
Selects proper options on profile mergers
Fix repeated tensorflow warnings
Thresholds input for large CSV files by bytes or lines (whichever is smaller)

Assets 2

25 Mar 03:04

lettergram

0.4.0

f76ed25

v0.4.0

New Features

Reduce profiling memory usage by ~50%
Reduce profiling runtime by >75%
Improve delimiter and header detection in delimited (CSV) data
Add progress notifications for profiling

Fixes

Adds warnings for sampling
Selects proper options on profile mergers
Fix repeated tensorflow warnings
Thresholds input for large CSV files by bytes or lines (whichever is smaller)

Assets 2

16 Mar 21:06

lettergram

0.3.5

f63cad6

v0.3.5

Enhancement: 50-90% reduced profiling time
- Improved methods for unique row and null-in-row prediction(s)
Enhancement: Users can now select header row for delimited files
Bug Fix: Added header detection on delimited files with only strings

Assets 2

12 Mar 19:28

lettergram

0.3.4

5e5f64e

v0.3.4

Significantly improved header detection on structured datasets
Updated model
- New entities: DATE, TIME, US_STATE, DRIVERS_LICENSE
- Removed entities: INTEGER_BIG
New [easier] way to extend labels to the model
ML requirements installed separately via pip install dataprofiler[ml] - required for labeler
Profiler & Labeler only load TensorFlow when necessary
Minor bug fixes & improved testing

Assets 2

04 Mar 05:09

lettergram

0.3.2

7c05449

v0.3.2

TensorFlow only runs when a labeler executes
Improved CSV detection
2-8x memory reduction in profiling
Various bug fixes

Assets 2

23 Feb 20:49

lettergram

0.3.1

93a9b6e

v0.3.1

Dramatically reduced memory requirements for the data labeler
Renamed the module: data_profiler -> dataprofiler
Improved delimiter (CSV) file detection

Assets 2

11 Feb 20:01

lettergram

0.3.0

07e8b3b

v0.3.0

Initial Data Profiler release.
Load a file. Extract profile. Save output.
See README.md for full information regarding release.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Runtime Changes

Notes

Profiler

Data

Report

Bug fixes

Uh oh!

Runtime Changes

Notes

Profiler

Data

Report

Bug fixes

Uh oh!

v0.4.0

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: capitalone/DataProfiler

v0.4.3

Runtime Changes

Notes

Profiler

Data

Report

Bug fixes

Uh oh!

v0.4.2

Runtime Changes

Notes

Profiler

Data

Report

Bug fixes

Uh oh!

v0.4.1

v0.4.0

Uh oh!

v0.4.0

Uh oh!

v0.3.5

Uh oh!

v0.3.4

Uh oh!

v0.3.2

Uh oh!

v0.3.1

Uh oh!

v0.3.0

Uh oh!