-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #708 Refactor dataset initialization #722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting! I see advantages to the approach. As put out in my comments, I would hope that it can be combined with a reduction of metadata in the source files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I for now implemented the wrapper idea, I think it nicely cleans up the code. My only slight doubt is the disadvantage which my initial implemention also had: it appears as arguments / keyword arguments are unused.
Just as an extra check to see if xr.merge causes any loss of performance in this case as reported in: #736 (comment) . Apparently with nicely overlapping coordinates, merging is not such a slow process and the changes made in this PR should only (very slightly) enhance performance. # %% import
import numpy as np
import xarray as xr
# %% functions
def make_da(nlay, nrow, ncol):
shape = nlay, nrow, ncol
dx = 10.0
dy = -10.0
xmin = 0.0
xmax = dx * ncol
ymin = 0.0
ymax = abs(dy) * nrow
dims = ("layer", "y", "x")
layer = np.arange(1, nlay + 1)
y = np.arange(ymax, ymin, dy) + 0.5 * dy
x = np.arange(xmin, xmax, dx) + 0.5 * dx
coords = {"layer": layer, "y": y, "x": x}
return xr.DataArray(
data=np.ones(shape, dtype=float),
dims=dims,
coords=coords,
)
def make_big_data_dict(nvar=4, nlay=3, nrow=1000, ncol=1000):
varnames = [f"var{i}" for i in np.arange(nvar)]
return {var: make_da(nlay, nrow, ncol) for var in varnames}
def ds_from_merge_exact(d):
return xr.merge([d], join="exact")
def ds_from_merge_default(d):
return xr.merge([d])
def ds_assigned(d):
ds = xr.Dataset()
for key, value in d.items():
ds[key] = value
return ds
# %% Create data dict
d = make_big_data_dict()
# %% Benchmark
# Benchmark in ipython/Jupyter
# %timeit ds_exact = ds_from_merge_exact(d)
## > 424 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# %timeit ds_default = ds_from_merge_default(d)
## > 864 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# %timeit ds = ds_assigned(d)
## > 1.75 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, some questions and comments
UPDATE: Though the
Next to that, I had stackoverflow errors for certain tests, but those probably could be fixed with some work. For now, I think we are better off forwarding arguments in dictionaries, as I couldn't really find a solution which would fix everything immediately. Relevant stackoverflow issues:
Based on these stackoverflow answers, I see three potential solutions, all which I'm not fully happy with: 1. Explicitly class River
def __init__(
self,
stage,
conductance,
...
):
dict_dataset = {
"stage": stage,
"conductance": conductance,
...
}
super(type(self), self).__init__(dict_dataset) This has the advantage that its explicit and easy to follow, but there is code duplication. 2. Implicitly class River
def __init__(
self,
stage,
conductance,
...
):
super(type(self), self).__init__(locals()) This can be made fancier and more precise by using the 3. With kwargs class River
def __init__(
self,
**kwargs,
):
super(type(self), self).__init__(kwargs) This is the more classic pythonic approach, Matplotlib also applies this for some functions. However, there is no input checking anymore and it is hard to tell which arguments are expected in an IDE (e.g. autocomplete), which only can be inferred from the docstring. Also as @Manangka mentioned: We cannot do any type hinting in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
# Remove vars inplace | ||
del self.dataset["concentration"] | ||
del self.dataset["concentration_boundary_type"] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if concentration is not present in allargs.keys(), should we do the expand_transient_auxiliary_variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this was purely required if concentration was present. (At least that was the old behaviour). Changing that behaviour is out of scope here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but now expand_transient_auxiliary_variables(self) is in the else branch. It will do it if concentration is not present in allargs.keys()
Fix #708
The changeset is incomplete, as it is not rolled out for all packages yet. Therefore tests will fail, hence the draft status. This to make the review process more focused, reviewers are asked to provide feedback on the approach.
Changes:
Update:
pkg_init
decorator.