Releases: capitalone/DataProfiler
Releases · capitalone/DataProfiler
v0.4.3
Runtime Changes
Migrating from v0.4.2 to v0.4.3 should result in a 30-90% reduction in profiling time.
Largely dependent on system resources and data size.
Notes
- Remove requirement for tensorflow-addons
- Library now works with tensorflow nightly (Python 3.9)
- Added example on generating a new data labeler
Profiler
- Multiprocessing data preprocessing
- Improved histogram accuracy
- Reduced histogram generation runtime
- Option to set the bin count for histogram
- Expanded precision and switch to precision estimation (as opposed to exact calculations)
- Limit pool size based on cpu and memory limitations
Data
- Improved JSON detection method
- Option (default) pulls metadata and data separately (
data.meta
anddata.data
) - data.meta would be part of the JSON which contains no records
- data.data would be part of the JSON which contains records
- Added option to select keys which represent records
- Option (default) pulls metadata and data separately (
Report
- Precision report now contains additional details
"precision": {
'min': int,
'max': int,
'mean': float,
'var': float,
'std': float,
'sample_size': int,
'margin_of_error': float,
'confidence_level': float
},
Bug fixes
- Fixed error in merging options
- Fixed issue related to merging DateTimeColumns
- Fixed multiprocessing on OSX
- Fixed row calculations if
min_true_samples
is greater than zero
v0.4.2
Runtime Changes
Notes
This update reduces runtime by on average 50%.
Profiler
- Add support for HistogramOptions
- Add multiprocessing support
- Reduced runtime for shuffling indices
- Vectorized precision function
- Improved unique set & vocab merging
- By default histogram only runs 'auto' bin edge detection
Data
- Add length attribute to the data class
data.length()
orlen(data)
Report
- Added optional
omit_keys
to the report options function, remove keys from the final report - Added
row_has_null_count
(global), one or more nulls in the row - Added
row_is_null_count
(global), the entire row is null - Rename
total_samples
(global) ->row_count
- Rename label
BACKGROUND
->UNKNOWN
(column) - Removed
covariance
(global) - Removed
data_classification
(global) - Removed
data_label_probability
(column) - Removed
median
(column)
Bug fixes
- Accurate null count and total_samples on profile updates
- Each column now receives the same sampled indices; enabling
row_is_null_count
v0.4.1
BUGFIX: Enables running data profiler without the TensorFlow library
v0.4.0
New Features
- Reduce profiling memory usage by ~50%
- Reduce profiling runtime by >75%
- Improve delimiter and header detection in delimited (CSV) data
- Add progress notifications for profiling
Fixes
- Adds warnings for sampling
- Selects proper options on profile mergers
- Fix repeated tensorflow warnings
- Thresholds input for large CSV files by bytes or lines (whichever is smaller)
v0.4.0
New Features
- Reduce profiling memory usage by ~50%
- Reduce profiling runtime by >75%
- Improve delimiter and header detection in delimited (CSV) data
- Add progress notifications for profiling
Fixes
- Adds warnings for sampling
- Selects proper options on profile mergers
- Fix repeated tensorflow warnings
- Thresholds input for large CSV files by bytes or lines (whichever is smaller)
v0.3.5
- Enhancement: 50-90% reduced profiling time
- Improved methods for unique row and null-in-row prediction(s)
- Enhancement: Users can now select header row for delimited files
- Bug Fix: Added header detection on delimited files with only strings
v0.3.4
- Significantly improved header detection on structured datasets
- Updated model
- New entities:
DATE
,TIME
,US_STATE
,DRIVERS_LICENSE
- Removed entities:
INTEGER_BIG
- New entities:
- New [easier] way to extend labels to the model
- ML requirements installed separately via
pip install dataprofiler[ml]
- required for labeler - Profiler & Labeler only load TensorFlow when necessary
- Minor bug fixes & improved testing
v0.3.2
- TensorFlow only runs when a labeler executes
- Improved CSV detection
- 2-8x memory reduction in profiling
- Various bug fixes
v0.3.1
- Dramatically reduced memory requirements for the data labeler
- Renamed the module: data_profiler -> dataprofiler
- Improved delimiter (CSV) file detection
v0.3.0
Initial Data Profiler release.
Load a file. Extract profile. Save output.
See README.md for full information regarding release.