Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: NetCDF: Not a valid ID #80

Open
joelfiddes opened this issue May 17, 2023 · 8 comments
Open

RuntimeError: NetCDF: Not a valid ID #80

joelfiddes opened this issue May 17, 2023 · 8 comments

Comments

@joelfiddes
Copy link
Collaborator

This is a strange and somewhat random error - not always reproducible. From reading and as often case with strange random errors it may be related to multipe threads accessing same file at same time, here is a discussion.

https://forum.access-hive.org.au/t/netcdf-not-a-valid-id-errors/389

Nice find! To summarise in this thread, it looks like a work-around in netcdf4-python to deal with netcdf-c not being thread safe was removed in 1.6.1. The solution (for now) is to [make sure your cluster only uses 1 thread per worker](https://forum.access-hive.org.au/t/netcdf-not-a-valid-id-errors/389/14).

@joelfiddes
Copy link
Collaborator Author

joelfiddes commented May 17, 2023

I think we only have 1 thread per worker anyway with this?

multithread_pooling(_subset_climate_dataset, fun_param, n_threads=n_core)

@joelfiddes
Copy link
Collaborator Author

i understand 1 worker = 1 core?

@joelfiddes
Copy link
Collaborator Author

joelfiddes commented May 17, 2023

changed

ds_ = xr.open_mfdataset(flist, parallel=True)

to

ds_ = xr.open_mfdataset(flist, parallel=False)

and ran fine with no errors. I dont fully understand it so cant be confidently claimed to be a fix. WIll need to run a bunch more times to see if it really is a fix.

@joelfiddes
Copy link
Collaborator Author

This is on a branch "slurm" where am developing an embarrasingly paralilsable way of dealing with time dimension as current method only works if the script is run on a multicore machine NOT using a SLURM scheduler as on many HPC machines. This problem may be unique to that usecase (many workers accessing climate data netcdfs simultaneously. But I think @ArcticSnow mentioned seeing this issue and as discussion above shows - seems to happen with multi thread access to nc files.

@ArcticSnow
Copy link
Owner

The multiprocessing library has both multithread and multicore. one core can handle multithreads. It is very convenient for instance to send and handle the download request (requiring little computation). Maybe in the config file we should separate and have one n_cores and n_threads to clarify a bit.

Also, notice that v0.2.2 does not parallelise in the time dimension. Parallelisation is only happening in space. Each time split are run sequentialy, when the previous one is done.

@joelfiddes
Copy link
Collaborator Author

of course - so actually this is a more general contribution - will write up the approach in discussions and link back here

@joelfiddes
Copy link
Collaborator Author

#83 (comment)

joelfiddes added a commit that referenced this issue Jun 1, 2023
…s giving random errors as refernce here #80. If True is needed and works OK in other circumstances we should consider a config item
@joelfiddes
Copy link
Collaborator Author

some more I think related info on this issue

ecmwf/cfgrib#110

Basically seems safer to use parallel =False with mf_opendataset otherwise there is a chance of conflict between threads doing "stuff" on the nc file at the same time. There used to be a "lock" and "autoclose" args to the function but no longer. Maybe these are somehow implicitly in Parallel =False (this is also the default setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants