Better handling of overlapping times #334
Replies: 2 comments
-
|
I definitely like the idea to make this an optional additional configuration. Requiring it to be turned on for a dataset removes a major concern of changing existing code behavior. I'd recommend at least one of the options for determining file priority is based on (part of) file name (like a model run timestamp in the name of the file). While you could rely on the OS file last modified time, there can be weird behavior across operating systems for file last modified, and so I think there should be an option besides that. |
Beta Was this translation helpful? Give feedback.
-
|
@clifford-harms The reason ERDDAP doesn't allow overlapping times is because the CF convention does not. One possible solution to what you are after is to look at what the THREDDS data server (TDS) does with forecast files (mostly GRIB files). It has what it calls the "Best Time Series" or some such name that navigates through the data and creates a series with non-overlapping times. ERDDAP doesn't have to do exactly what TDS does, but it would be a good place otherwise start and the code is already Java, plus I believe it is in netcdf-java already (don't quote me on this) which is what ERDDAP uses to read in a lot of datatypes. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We use a lot of forecast model data at my organization. Typically, because this data is large and changes daily, we want to avoid duplicating this data. This results in directories of files that have overlapping forecast validity times. ERDDAP currently cannot load this data "as is", because it produces an error when it detects an overlapping time, or essentially data for the same time and location. This requires us (and presumably other users) to maintain a curated "erddap copy" of data where duplicate times are removed, or other scripting solutions that result in the dataset being unavailable for a large period of time while the dataset is updating.
I would like to add the ability for ERDDAP to make a decision when encountering duplicate times to allow continued ingest of the EDDGrid dataset instead of failure. For forecast model run datasets, one reasonable strategy when there are duplicate validity times is to use the one with the most recent runtime. Another strategy might be to keep the file with the most recent modified time.
My proposal would be to add an optional dataset.xml element for each dataset that would allow a selection of a "strategy" for dealing with duplicate times (most recent runtime, most recent modified time, etc.). The default behavior, if the configuration element is not present, would behave exactly as ERDDAP does now to maintain backwards compatibility. This approach would also allow for introduction of different strategies to be introduced later if needed per use case.
I'm essentially looking for any comments, warnings, or duplication of existing efforts or functionality that I am not aware of before I start diving into this. I intend to start work on this in late August.
edit: grammar/spelling
Beta Was this translation helpful? Give feedback.
All reactions