Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy arrays read from attributes are not writable #53

Open
kmuehlbauer opened this issue May 13, 2022 · 2 comments
Open

numpy arrays read from attributes are not writable #53

kmuehlbauer opened this issue May 13, 2022 · 2 comments

Comments

@kmuehlbauer
Copy link
Collaborator

I'm trying to wrap my head around the following issue. Arrays read from Datasets are created with numpy.memmap with "copyonwrite" flag set. So these numpy arrays can be changed in memory but not in the file. This is not true for arrays read from attributes. There we get back a numpy.ndarray with the writable-flag set to FALSE. An MCVE is attached below.

I was already looking where the flag might be set and why, but didn't come far. Any hints much appreciated.

MCVE:

import numpy as np
import h5py
import pyfive

with h5py.File("test.h5", mode="w") as f:
    f.attrs["att"] = np.arange(10)
    f["arr"] = np.arange(10)

with pyfive.File("test.h5") as f:
    # dataset
    arr = f["arr"][:]
    print(type(arr))
    print(arr.flags)
    print(arr.base)
    arr += 1
    print(arr)
    # attribute
    att = f.attrs["att"]
    print(type(att))
    print(att.flags)
    print(att.base)
    att += 1
    print(att)
<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

[0 1 2 3 4 5 6 7 8 9]
[ 1  2  3  4  5  6  7  8  9 10]
<class 'numpy.ndarray'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

[0 1 2 3 4 5 6 7 8 9]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [23], in <cell line: 9>()
     20 print(att.flags)
     21 print(att.base)
---> 22 att += 1
     23 print(att)

ValueError: output array is read-only

@kmuehlbauer
Copy link
Collaborator Author

The main issue is obviously that np.frombuffer in DataObjects._attr_value() returns an immutable np.array as it is created via bytes:

value = np.frombuffer(buf, dtype=dtype, count=count, offset=offset)

I've tested this adding a .copy() to the np.frombuffer call and the everything works fine.

Wouldn't it make sense to use the same approach here as in get_contiguous_data (via np.memmap)?

def _get_contiguous_data(self, property_offset):
data_offset, = struct.unpack_from('<Q', self.msg_data, property_offset)
if data_offset == UNDEFINED_ADDRESS:
# no storage is backing array, return all zeros
return np.zeros(self.shape, dtype=self.dtype)
if not isinstance(self.dtype, tuple):
# return a memory-map to the stored array with copy-on-write
return np.memmap(self.fh, dtype=self.dtype, mode='c',
offset=data_offset, shape=self.shape, order='C')
else:
dtype_class = self.dtype[0]
if dtype_class == 'REFERENCE':
size = self.dtype[1]

Not sure how much will be involved to make that change. @bmaranville Do you have any thoughts on that?

@bmaranville
Copy link
Collaborator

bmaranville commented May 13, 2022

I think the data access methods for attributes and datasets are almost the same, but we are handling them separately, and with different datatypes enabled for each:

Attribute

  • VLEN_SEQUENCE
  • VLEN_STRING
  • REFERENCE (single)
  • simple datatypes {float, int, fixed-length string} (frombuffer)

Dataset

  • REFERENCE(array)
  • simple datatypes (memmap)

I suspect there would be a big benefit from combining the handling of reading data from attributes and datasets, so that the combined feature set would be better for both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants