Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent in data rendering and in QARTOD flags creation #106

Open
leilabbb opened this issue Mar 22, 2024 · 17 comments
Open

inconsistent in data rendering and in QARTOD flags creation #106

leilabbb opened this issue Mar 22, 2024 · 17 comments

Comments

@leilabbb
Copy link

I run into issues when using the qartod package to QC the Glider DAC files. I have seen flags being applied inconsistently, which raises concern. Before I file an issue about that, I run into a new issue when reading the data that I am exposing here for some feedback.

Issues: (1) The salinity data in the netCDF file shows different results when read with different packages. (2) The spike_test and rate_of_change_test flags are questionably using the missing-flag and the suspect-flag on existing or non-erroneous data.

Description: The notebook is a walk-through process to help reproduce the above issues.
Files needed to run the notebook:
netCDF file
configuration file
notebook file

Debugging Info Issue (1):

When using netCDF4 to read the salinity array, all values show as nan: [this is the method I use in the Glider DAC QC process] _

with open(nc_path, 'r') as nc:
    nc  = Dataset(nc_path, 'r')
print('number of data point', len(nc.variables['salinity'][:]), '\n', nc.variables['salinity'][:])
number of data points 36 
 [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- --]

When using the xarray to read the salinity array, not all values show as nan: [this is used in the ioos_qc.streams Class XarrayStream]

nd = xr.open_dataset(
                nc_path,
                decode_cf=True,
                decode_coords=True,
                decode_times=True,
                mask_and_scale=True
            )
print('number of data point', len(nd['salinity'].values), '\n', nd['salinity'].values)
number of data point 36 
 [32.80403       nan       nan       nan 32.771416 32.67339  32.671986
       nan       nan 32.636482 32.141727 32.114475       nan       nan
 32.103085       nan 32.122192       nan       nan 32.09773        nan
       nan 32.050667       nan 32.04136        nan       nan 32.010563
       nan 31.924274       nan 31.775118       nan       nan       nan
 31.886553]

When using ncdump to read the salinity array, not all values are masked: [using another method]

!ncdump -v salinity '/Users/leilabelabassi/Downloads/ru34_20220529T202312Z_rt.nc'
salinity = 32.80403, _, _, _, 32.77142, 32.67339, 32.67199, _, _, 32.63648, 
    32.14173, 32.11448, _, _, 32.10308, _, 32.12219, _, _, 32.09773, _, _, 
    32.05067, _, 32.04136, _, _, 32.01056, _, 31.92427, _, 31.77512, _, _, _, 
    31.88655 ;

Debugging Info Issue (2):

The flag selection for the spike_test shows a missing flag (9) applied to an existing data point.

KEYS for the right column: Missing 9 UNKNOWN 2

    salinity
2  32.804031
9        NaN
9        NaN
9        NaN
9  32.771416
1  32.673389
9  32.671986
9        NaN
9        NaN
9  32.636482
1  32.141727
9  32.114475
9        NaN
9        NaN
9  32.103085
9        NaN
9  32.122192
9        NaN
9        NaN
9  32.097729
9        NaN
9        NaN
9  32.050667
9        NaN
9  32.041359
9        NaN
9        NaN
9  32.010563
9        NaN
9  31.924274
9        NaN
9  31.775118
9        NaN
9        NaN
9        NaN
2  31.886553

The flag selection for the rate_of_change_test shows a suspect flag (3) applied to a good data point.

KEYS for the right column: Missing 9 UNKNOWN 2 SUSPECT 3

    salinity
1  32.804031
9        NaN
9        NaN
9        NaN
1  32.771416
3  32.673389
3  32.671986
9        NaN
9        NaN
1  32.636482
3  32.141727
3  32.114475
9        NaN
9        NaN
1  32.103085
9        NaN
1  32.122192
9        NaN
9        NaN
1  32.097729
9        NaN
9        NaN
1  32.050667
9        NaN
1  32.041359
9        NaN
9        NaN
1  32.010563
9        NaN
1  31.924274
9        NaN
1  31.775118
9        NaN
9        NaN
9        NaN
1  31.886553

Clarification:
The suspect flag may be acceptable, but I am not completely clear on labeling existent data with a missing flag. Also, the data read from the 3 methods may be unrelated to flagging data but any recommendations or best practices on how to work with the ioos_qc package will be appreciated.

@leilabbb
Copy link
Author

@ocefpaf could you look at the issues i am seeing with the QARTOD application? Thanks

@ocefpaf
Copy link
Member

ocefpaf commented Mar 29, 2024

@ocefpaf could you look at the issues i am seeing with the QARTOD application? Thanks

@leilabbb I'm not too familiar with the code and QARTOD but I will take a look as soon as possible.

Can you send the version of the ioos_qc library that you are using?

@leilabbb
Copy link
Author

leilabbb commented Apr 1, 2024

@ocefpaf could you look at the issues i am seeing with the QARTOD application? Thanks

@leilabbb I'm not too familiar with the code and QARTOD but I will take a look as soon as possible.

Can you send the version of the ioos_qc library that you are using?

@ocefpaf Here is the SHA number: 0b1406b

@ocefpaf
Copy link
Member

ocefpaf commented Apr 2, 2024

@leilabbb can you try the latest commit? @iwensu0313 made some improvements and fixed a few bugs that may impact your example. Please let us know what you find.

@leilabbb
Copy link
Author

leilabbb commented Apr 5, 2024

@leilabbb can you try the latest commit? @iwensu0313 made some improvements and fixed a few bugs that may impact your example. Please let us know what you find.

I rerun the code using the latest commits and I do not see any change in the results. The same flags were rendered by the ioos_qc/qartod library.

@ocefpaf
Copy link
Member

ocefpaf commented Apr 5, 2024

I guess we need to wait for #108

@iwensu0313
Copy link
Contributor

#108 is merged now, @leilabbb do you want to try and see if that fixed it?

@leilabbb
Copy link
Author

leilabbb commented Jun 8, 2024

Not sure what is not working but the merge of 108 did not fix it. @iwensu0313 could use this netCDF file and see if you see a different response.

@iwensu0313
Copy link
Contributor

iwensu0313 commented Aug 13, 2024

@leilabbb - apologies, somehow I didn't get an email notification of your response. I need to check to see if it's being filtered into my email quarantine. Thanks for sending along the file, I'll take a look at this again as soon as I can

@iwensu0313
Copy link
Contributor

I haven't had a chance to look at this linked PR #129 in more detail. @ocefpaf is the issue being addressed there

@ocefpaf
Copy link
Member

ocefpaf commented Aug 13, 2024

I haven't had a chance to look at this linked PR #129 in more detail. @ocefpaf is the issue being addressed there

It is. The PR is still in its first draft. If you have some time, please give your input there.

@iwensu0313
Copy link
Contributor

iwensu0313 commented Sep 3, 2024

Hi @leilabbb -

Spike Test Missing Flags
Looks like PR #129 will resolve the issue you're seeing in the spike test, where missing flags were being applied to valid data points. I ran your inconsistent_data_rendering_and_QARTOD_flags_creation.ipynb notebook (using the config_legacy_vars.yml file and the test ru34_20220529T202312Z_rt.nc file you provided above) with the changes in that PR. and this is the salinity results array I get for the spike test. Looks as expected to me:

Screenshot 2024-09-03 at 3 37 02 PM

salinity xarray values copied from your original post, for comparison

 [32.80403       nan       nan       nan 32.771416 32.67339  32.671986
       nan       nan 32.636482 32.141727 32.114475       nan       nan
 32.103085       nan 32.122192       nan       nan 32.09773        nan
       nan 32.050667       nan 32.04136        nan       nan 32.010563
       nan 31.924274       nan 31.775118       nan       nan       nan
 31.886553]

Rate of Change Suspect Flags
Following up on your question regarding the suspect flags in the salinity rate of change. The results seem to be expected based on the threshold you set of 3.334577679634094e-10 salinity units per second.

I ran the debugger in your inconsistent_data...ipynb notebook and put a breakpoint in the rate of change test to investigate. Here are some screenshots showing what the test is doing. Let me know if you have questions

Screenshot 2024-09-03 at 3 58 17 PM

Example rate of change calculation from the 5th to 6th measurement

fifth_meas = 32.77141571044922
sixth_meas = 32.67338943481445
change_in_sec_from_5th_to_6th_meas = 9
roc_val = abs((sixth_meas - fifth_meas) / change_in_sec_from_5th_to_6th_meas)

print(roc_val)
> 0.010891808403862847

Reading in Data using Different Methods
I'm not so sure about this one and why netCDF4.Dataset and xr.open_dataset resulted in different masked values. Will see if my colleagues have any input here.

@leilabbb
Copy link
Author

Reading in Data using Different Methods
I'm not so sure about this one and why netCDF4.Dataset and xr.open_dataset resulted in different masked values. Will see if my colleagues have any input here.

@benjwadams figured out that the valid_min and valid_max of the salinity variable were switched and therefore netCDF4.Dataset masked the salinity values.

@leilabbb
Copy link
Author

Rate of Change Suspect Flags
Following up on your question regarding the suspect flags in the salinity rate of change. The results seem to be expected based on the threshold you set of 3.334577679634094e-10 salinity units per second.

I ran the debugger in your inconsistent_data...ipynb notebook and put a breakpoint in the rate of change test to investigate. Here are some screenshots showing what the test is doing. Let me know if you have questions

I realized that incorrect flag results happen when the time input to the rate_of_change_test function isn't in seconds. See rate_of_change_test flags are corrupted when the time array is in nano seconds #376

@iwensu0313
Copy link
Contributor

iwensu0313 commented Oct 8, 2024

@leilabbb - if helpful, here's what my colleague (@kthyng) said about Reading in Data using Different Methods. I also generally use xarray over netcdf as well, just preference.

  1. xarray interprets a lot of attributes under the hood which is usually really helpful but occasionally can be confusing. You can probably find more info about the details in the xarray docs and/or stack overflow.
  2. A major secret piece of information xarray interprets is the encoding. In xarray you can find it under ds.encoding. It has already interpreted it once you’ve read in the Dataset, but you can still see the info there. Stuff like fill values.
  3. mask values being turned into 9.9692100e+36 looks like the netcdf library using the fill value when it does that function.
  4. I know less about how the netcdf library does things but it is generally less under-the-hood and more just on the surface.

@leilabbb
Copy link
Author

leilabbb commented Nov 21, 2024

@ocefpaf @iwensu0313 When will the commit 5ffa1c6 related to the changes in spike_test be pushed to the main branch?

@iwensu0313
Copy link
Contributor

iwensu0313 commented Nov 22, 2024

@ocefpaf , @leilabbb - the changes I suggested should be incorporated before we can merge that PR #129. The first suggestion makes the code logic more representative of the logic we'd like to implement even though the end result is the same, for readability. The second suggestion (which needs to be updated for the 3 tests I mentioned) is required in order for the tests to pass.

@Sakshamgupta90 - let me know if it's helpful if I push the suggestions to a branch off of yours for your review Sakshamgupta90:fix/inconsistency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants