[[TOC]]
ncqc is a Python library for performing quality control on netCDF files. It was developed by TU Delft, within the framework of the Ruisdael observatory for atmospheric science. This library is focused around the QualityControl
class, to which a netCDF file and configuration can be added to then perform quality control checks.
Installing ncqc from source requires two steps: creating a wheel file, and using that to install the library. This is done by running the following commands:
pip install wheel
pip install setuptools
pip install twine
python setup.py bdist_wheel
pip install ./dist/ncqc-0.1.0-py3-none-any.whl
There are a couple steps to perform quality control checks. These are:
- Creating a configuration file or dictionary
- Setting up a QualityControl object
- Running checks with a QualityControl object
- Getting a report from a QualityControl object
To remove the manual labor from setting up the configuration for the QualityControl
object, there are two methods: create_config_dict_from_yaml
and create_config_dict_from_dict
to create the base for a configuration dictionary by parsing an existing .yaml file or dictionary respectively. By specifying the names of the groups containing the dimensions, variables and global attributes via the paramaters dimensions_name
, variables_name
, and global_attributes_name
, these fields get added to the output dictionary with the structure for specifying what checks to perform already set up. The types for all the values are given, but the specific values will still need to be filled in. Below is an example of how this can be used and a link to the respective in- and outputs.
Method call:
from ncqc.create_config import create_config_dict_from_yaml, create_config_dict_from_dict
# with a yaml file
output_dict = create_config_dict_from_yaml(
input_dict=path_to_yaml_file,
dimensions_name="dims",
variables_name="vars",
global_attributes_name="gl_attrs"
)
# with a dictionary
output_dict = create_config_dict_from_dict(
input_dict=config_dictionary,
dimensions_name="dims",
variables_name="vars",
global_attributes_name="gl_attrs"
)
Some input dictionaries might have variables where the name is not at the top layer, for example with this structure:
fields:
'01':
dimensions:
- time
attrs:
units: 'mm/h'
long_name: 'Rain intensity'
short_name: 'rain_intensity'
variables_name=fields
would result in 01
getting added as the name of the variable to check, so for names at a deeper layer there is other_variable_name_paths
, which can take multiple lists which specify the path to a variable's name. other_variable_name_paths=[['fields', 'attrs', 'short_name']]
will cause it to loop over all items in fields
and then access the name by following the remainder of the path, so here this would be input_dict['fields']['01']['attrs']['short_name']
, resulting the following output:
'variables': {
'field_1': {
'existence_check': 'bool',
'emptiness_check': 'bool',
'data_boundaries_check': {
'lower_bound': 'int',
'upper_bound': 'int'
},
'data_points_amount_check': {
'minimum': 'int'
},
'adjacent_values_difference_check': {
'over_which_dimension': 'List[int]',
'maximum_difference': 'List[int]'
},
'consecutive_identical_values_check': {
'maximum': 'int'
}
}
}
The following methods can be used with a QualityControl
object to set up the quality control:
add_qc_checks_conf
/add_qc_checks_dict
: adds what dimensions, variables, and global attriibutes should be checked for what checks by passing a .yaml file or a dictionaryreplace_qc_checks_conf
/replace_qc_checks_dict
: similar to the previous two functions, but removes any previously added checksload_netcdf
: stores the netCDF file at the given path in theQualityControl
object
Code example:
from ncqc.QCnetCDF import QualityControl
qc_obj = QualityControl()
qc_obj.add_qc_checks_dict(config_dictionary)
qc_obj.replace_qc_check_dict(path_to_yaml_file)
qc_obj.load_netcdf(nc_path)
These are the quality control checks that can be performed on a QualityControl
object with a set up configuration and loaded netCDF file:
file_size_check
: logs an error of the size of the provided netCDF file falls outside of the specified boundsexistence_check
: logs an error for each dimension, variable, or global attribute which according to the configuration should be present in the netCDF file but is not, and logs info for each category how many of the checked fields existemptiness_check
: logs an error for each variable or global attribute which has (a) missing value(s), in the case of variables also specifying how many data poins are empty, and logs info for each category how many of the checked fields are fully populateddata_points_amount_check
: logs an error for each variable which has less data points than the specified minimum data points for that variabledata_boundaries_check
: logs an error for each data point which falls outside of the specified variable boundsconsecutive_identical_values_check
: logs an error for each variable which has more consecutive identical than the specified maximum for that variableadjacent_values_difference_check
: logs an error if the difference between two adjacent data points is greater than the specified maximum difference for that variable Additionally, calling the methodperform_all_checks
will run all the previously mentioned checks in the order of that list.
Code example:
# Separately
qc_obj.data_boundaries_check()
qc_obj.file_size_check()
# Chained
qc_obj.existence_check().emptiness_check()
# All checks
qc_obj.perform_all_checks()
Once quality control checks have been performed, it is possible to get a report by accessing the LoggerQC
object of the QualityControl
object:
create_report
: creates a dictionary containing the logged errors, warnings, and info, in addition to the date and time. This dictionary gets stored in the logger's list of reports. This method also automatically clears the logger's errors, warnings, and info, so future reports won't contain old logs.create_report
takes an optional boolean parameterget_all_reports
, and if that is true it will return the list of all reports, otherwise it will return only most recently created report.
Code example:
# Create a report and access it
latest_report = qc_obj.create_report()
qc_obj.perform_all_checks()
# Create a new report and access all reports
all_reports = qc_obj.create_report(get_all_reports=True)
(add something about how to contribute)
ncqc is developed in the context of the Ruisdael Observatory by
- Vasil Chirov
- Mels Lutgerink
- Ella Milinovic
- Noky Soekarman
- Jesse Vleeschdraager
Example yaml file (input):
dims:
dim1:
# ...
dim2:
# ...
vars:
var1:
# ...
gl_attrs:
glattr1: 'text1'
glattr2: 'text2'
Example config dictionary (input):
{
'dims': {
'dim1': 'value1',
'dim2': 'value2'
},
'vars': {
'var1': 'value1',
},
'gl_attrs': {
'glattr1': 'value1'
'glattr2': 'value2'
}
}
Example output dictionary:
{
'dimensions': {
'dim1': {'existence_check': 'bool'},
'dim2': {'existence_check': 'bool'}
},
'variables': {
'var1': {
'existence_check': 'bool',
'emptiness_check': 'bool',
'data_boundaries_check': {
'lower_bound': 'int',
'upper_bound': 'int'
},
'data_points_amount_check': {
'minimum': 'int'
},
'adjacent_values_difference_check': {
'over_which_dimension': 'List[int]',
'maximum_difference': 'List[int]'
},
'consecutive_identical_values_check': {
'maximum': 'int'
}
}
},
'global_attributes': {
'glattr1': {
'existence_check': 'bool',
'emptiness_check': 'bool'
},
'glattr2': {
'existence_check': 'bool',
'emptiness_check': 'bool'
}
},
'file_size': {
'lower_bound': 'int',
'upper_bound': 'int'
}
}
GPLv3. See LICENSE