Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LGDO format conversion utilities #30

Merged
merged 54 commits into from
Dec 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
df224d7
Attempt to implement Issue #4 - Add LGDO format conversion utilities
MoritzNeuberger Nov 2, 2023
f9182e2
forgot to add awkward to `setup.cfg`
MoritzNeuberger Nov 2, 2023
35e1eda
Merge branch 'main' into issue_4_lgdo_format_conversion
gipert Nov 2, 2023
51f9cf7
fixed docstrings
MoritzNeuberger Nov 3, 2023
800bbbe
Merge branch 'issue_4_lgdo_format_conversion' of github.com:MoritzNeu…
MoritzNeuberger Nov 3, 2023
ee3da97
maybe now ...
MoritzNeuberger Nov 3, 2023
660cb5c
Merge branch 'main' into issue_4_lgdo_format_conversion
gipert Nov 24, 2023
52c51ea
Removed ability to controll copy and added option to controll wheter …
MoritzNeuberger Nov 24, 2023
5544e45
Rename convert -> view_as and other fixes (also docs)
gipert Nov 24, 2023
00d464e
working on Struct and Table
MoritzNeuberger Nov 24, 2023
1a4625b
Merge branch 'issue_4_lgdo_format_conversion' of github.com:MoritzNeu…
MoritzNeuberger Nov 24, 2023
f310f9b
Merge
MoritzNeuberger Nov 24, 2023
64cd213
Rename waveform_table module to waveformtable
gipert Nov 24, 2023
2bf36dc
added `view_as` for Struct, Table and WaveformTable
MoritzNeuberger Nov 24, 2023
20ebb2c
Merge branch 'issue_4_lgdo_format_conversion' of github.com:MoritzNeu…
MoritzNeuberger Nov 24, 2023
fbdd04b
Add awkward-pandas and pint-pandas to requirements
gipert Nov 24, 2023
4940ecc
Add tiny module for physical units
gipert Nov 24, 2023
9358f7e
Implement view_as for array types
gipert Nov 24, 2023
0f3ce51
Implement view_as() for VoVs
gipert Nov 24, 2023
f562fb7
Switch attaching units to off by default
gipert Nov 24, 2023
fd4c9bd
Fix failing test
gipert Nov 24, 2023
b98afde
Fix docstrings
gipert Nov 24, 2023
ad802d8
merge while wip on table.py
MoritzNeuberger Nov 30, 2023
e0d5c4f
merge and implemented table pd view_as
MoritzNeuberger Nov 30, 2023
7857cb3
fixed tests to not use get_dataframe anymore as it got a deprication …
MoritzNeuberger Nov 30, 2023
16db16f
at least the get_dataframe errors should be fixed
MoritzNeuberger Nov 30, 2023
8d1e59b
implemented with_units option for view_as of Table, added akpd transf…
MoritzNeuberger Nov 30, 2023
64ab2dd
misc small bug fixes
MoritzNeuberger Nov 30, 2023
e2fcc9c
implemented aoesa tests for view_as
MoritzNeuberger Nov 30, 2023
52bba6a
implemented tests for Table view_as
MoritzNeuberger Nov 30, 2023
7c5eeb1
cleaned up the view_as implementation of Table
MoritzNeuberger Nov 30, 2023
b067820
even more cleaned up the view_as implementation of Table
MoritzNeuberger Nov 30, 2023
fcd3f54
implemented view_as for voev and aoeesa
MoritzNeuberger Nov 30, 2023
cd48dfb
small cleanup
MoritzNeuberger Nov 30, 2023
6b4bfba
typo
MoritzNeuberger Nov 30, 2023
f059584
a much easier implementation of view_as for the encoded types [credit…
MoritzNeuberger Nov 30, 2023
c52599b
the tests might faile because of this pipe?
MoritzNeuberger Nov 30, 2023
4f26d98
implemented awkward based to_aoesa function including a clipping option
MoritzNeuberger Dec 1, 2023
dba303f
updated the doc strings
MoritzNeuberger Dec 1, 2023
9552bc8
Improve view_as() docstrings
gipert Dec 1, 2023
fb7e363
Move _view_table_as_pd() into view_as()
gipert Dec 1, 2023
8d488ae
Implement Struct.__setitem__()
gipert Dec 2, 2023
b11dd24
Implement Struct.__getattr__()
gipert Dec 2, 2023
94a10c8
Just throw a NotImplementedError in Struct.view_as()
gipert Dec 2, 2023
02425cd
Add tests for view_as() in encoded types
gipert Dec 2, 2023
b0bd351
first attempt at refactor of to_aoesa including the missing_value arg…
MoritzNeuberger Dec 4, 2023
b768217
Merge branch 'issue_4_lgdo_format_conversion' of github.com:MoritzNeu…
MoritzNeuberger Dec 4, 2023
67d8c29
second attempt, removing preserve_dtype but keeping it indirectly whe…
MoritzNeuberger Dec 4, 2023
acfc837
Yet another API change to VoV.to_aoesa()
gipert Dec 5, 2023
cef1c66
Fix docs
gipert Dec 5, 2023
a3c84b0
adding copy=False option to astype
MoritzNeuberger Dec 6, 2023
e49ed0c
im confused, why do I still need the equal sign there
MoritzNeuberger Dec 6, 2023
b233a26
Bug fix in WaveformTable.view_as()
gipert Dec 30, 2023
f5a317c
[docs] update LH5Files tutorial
gipert Dec 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
intersphinx_mapping = {
"python": ("https://docs.python.org/3", None),
"numpy": ("https://numpy.org/doc/stable", None),
"awkward": ("https://awkward-array.org/doc/stable", None),
"numba": ("https://numba.readthedocs.io/en/stable", None),
"pandas": ("https://pandas.pydata.org/docs", None),
"h5py": ("https://docs.h5py.org/en/stable", None),
Expand Down
139 changes: 134 additions & 5 deletions docs/source/notebooks/LH5Files.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
"metadata": {},
"outputs": [],
"source": [
"from lgdo import ls\n",
"from lgdo.lh5 import ls\n",
"\n",
"ls(lh5_file)"
]
Expand Down Expand Up @@ -91,7 +91,7 @@
"metadata": {},
"outputs": [],
"source": [
"from lgdo import show\n",
"from lgdo.lh5 import show\n",
"\n",
"show(lh5_file)"
]
Expand All @@ -111,7 +111,7 @@
"metadata": {},
"outputs": [],
"source": [
"from lgdo import LH5Store\n",
"from lgdo.lh5 import LH5Store\n",
"\n",
"store = LH5Store()"
]
Expand Down Expand Up @@ -210,12 +210,141 @@
"metadata": {},
"outputs": [],
"source": [
"from lgdo import LH5Iterator\n",
"from lgdo.lh5 import LH5Iterator\n",
"\n",
"for lh5_obj, entry, n_rows in LH5Iterator(lh5_file, \"geds/raw/energy\", buffer_len=20):\n",
" print(f\"entry {entry}, energy = {lh5_obj} ({n_rows} rows)\")"
]
},
{
"cell_type": "markdown",
"id": "684f8530",
"metadata": {},
"source": [
"### Converting LGDO data to alternative formats\n",
"\n",
"Each LGDO is equipped with a class method called `view_as()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.types.html#lgdo.types.lgdo.LGDO.view_as), which allows the user to \"view\" the data (i.e. avoiding copying data as much as possible) in a different, third-party format.\n",
"\n",
"LGDOs generally support viewing as NumPy (`np`), Pandas (`pd`) or [Awkward](https://awkward-array.org) (`ak`) data structures, with some exceptions. We strongly recommend having a look at the `view_as()` API docs of each LGDO type for more details (for `Table.view_as()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.types.html#lgdo.types.table.Table.view_as), for example).\n",
"\n",
"<div class=\"alert alert-info\">\n",
"\n",
"**Note:** To obtain a copy of the data in the selected third-party format, the user can call the appropriate third-party copy method on the view (e.g. `pandas.DataFrame.copy()`, if viewing the data as a Pandas dataframe).\n",
"\n",
"</div>\n",
"\n",
"Let's play around with our good old table, can we view it as a Pandas dataframe?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2f48391",
"metadata": {},
"outputs": [],
"source": [
"obj, _ = store.read(\"geds/raw\", lh5_file)\n",
"df = obj.view_as(\"pd\")\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "7f476362",
"metadata": {},
"source": [
"Yes! But how are the nested objects being handled?\n",
"\n",
"Nested tables have been flattened by prefixing their column names with the table object name (`obj.waveform.values` becomes `df.waveform_values`) and multi-dimensional columns are represented by Awkward arrays:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6261c8fe",
"metadata": {},
"outputs": [],
"source": [
"df.waveform_values"
]
},
{
"cell_type": "markdown",
"id": "6ed5904a",
"metadata": {},
"source": [
"But what if we wanted to have the waveform values as a NumPy array?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4b45112",
"metadata": {},
"outputs": [],
"source": [
"obj.waveform.values.view_as(\"np\")"
]
},
{
"cell_type": "markdown",
"id": "d0c86728",
"metadata": {},
"source": [
"Can we just view the full table as a huge Awkward array? Of course:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "33ae5c21",
"metadata": {},
"outputs": [],
"source": [
"obj.view_as(\"ak\")"
]
},
{
"cell_type": "markdown",
"id": "cd5fa308",
"metadata": {},
"source": [
"Note that viewing a `VectorOfVector` as an Awkward array is a nearly zero-copy operation and opens a new avenue of fast computational possibilities thanks to Awkward:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d75c8ff8",
"metadata": {},
"outputs": [],
"source": [
"import awkward as ak\n",
"\n",
"# tracelist is a VoV on disk\n",
"trlist = obj.tracelist.view_as(\"ak\")\n",
"ak.mean(trlist)"
]
},
{
"cell_type": "markdown",
"id": "d8d9ad8c",
"metadata": {},
"source": [
"Last but not least, we support attaching physical units (that might be stored in the `units` attribute of an LGDO) to data views through Pint, if the third-party format allows it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4007efd4",
"metadata": {},
"outputs": [],
"source": [
"df = obj.view_as(\"pd\", with_units=True)\n",
"df.timestamp.dtype"
]
},
{
"cell_type": "markdown",
"id": "3ab3794c",
Expand Down Expand Up @@ -278,7 +407,7 @@
"metadata": {},
"outputs": [],
"source": [
"from lgdo import show\n",
"from lgdo.lh5 import show\n",
"\n",
"show(\"my_objects.lh5\")"
]
Expand Down
3 changes: 3 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ classifiers =
[options]
packages = find:
install_requires =
awkward>=2
awkward-pandas
colorlog
h5py>=3.2
hdf5plugin
Expand All @@ -39,6 +41,7 @@ install_requires =
pandas>=1.4.4
parse
pint
pint-pandas
python_requires = >=3.9
include_package_data = True
package_dir =
Expand Down
9 changes: 6 additions & 3 deletions src/lgdo/compression/radware.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,9 @@ def encode(
)
# convert VectorOfVectors to ArrayOfEqualSizedArrays so it can be
# directly passed to the low-level encoding routine
sig_out_nda, nbytes = encode(sig_in.to_aoesa(), shift=shift)
sig_out_nda, nbytes = encode(
sig_in.to_aoesa(fill_val=0, preserve_dtype=True), shift=shift
)

# build the encoded LGDO
encoded_data = lgdo.ArrayOfEqualSizedArrays(nda=sig_out_nda).to_vov(
Expand Down Expand Up @@ -262,7 +264,7 @@ def decode(
# convert vector of vectors to array of equal sized arrays
# can now decode on the 2D matrix together with number of bytes to read per row
_, siglen = decode(
(sig_in.encoded_data.to_aoesa(preserve_dtype=True).nda, nbytes),
(sig_in.encoded_data.to_aoesa(fill_val=0, preserve_dtype=True).nda, nbytes),
sig_out if isinstance(sig_out, np.ndarray) else sig_out.nda,
shift=shift,
)
Expand All @@ -288,7 +290,8 @@ def decode(
# convert vector of vectors to array of equal sized arrays
# can now decode on the 2D matrix together with number of bytes to read per row
sig_out, siglen = decode(
(sig_in.encoded_data.to_aoesa(preserve_dtype=True).nda, nbytes), shift=shift
(sig_in.encoded_data.to_aoesa(fill_val=0, preserve_dtype=True).nda, nbytes),
shift=shift,
)

# sanity check
Expand Down
6 changes: 3 additions & 3 deletions src/lgdo/compression/varlen.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def encode(
)
# convert VectorOfVectors to ArrayOfEqualSizedArrays so it can be
# directly passed to the low-level encoding routine
sig_out_nda, nbytes = encode(sig_in.to_aoesa())
sig_out_nda, nbytes = encode(sig_in.to_aoesa(fill_val=0, preserve_dtype=True))

# build the encoded LGDO
encoded_data = lgdo.ArrayOfEqualSizedArrays(nda=sig_out_nda).to_vov(
Expand Down Expand Up @@ -227,7 +227,7 @@ def decode(
# convert vector of vectors to array of equal sized arrays
# can now decode on the 2D matrix together with number of bytes to read per row
_, siglen = decode(
(sig_in.encoded_data.to_aoesa(preserve_dtype=True).nda, nbytes),
(sig_in.encoded_data.to_aoesa(fill_val=0, preserve_dtype=True).nda, nbytes),
sig_out if isinstance(sig_out, np.ndarray) else sig_out.nda,
)

Expand All @@ -252,7 +252,7 @@ def decode(
# convert vector of vectors to array of equal sized arrays
# can now decode on the 2D matrix together with number of bytes to read per row
sig_out, siglen = decode(
(sig_in.encoded_data.to_aoesa(preserve_dtype=True).nda, nbytes)
(sig_in.encoded_data.to_aoesa(fill_val=0, preserve_dtype=True).nda, nbytes)
)

# sanity check
Expand Down
6 changes: 3 additions & 3 deletions src/lgdo/lh5/store.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ class LH5Store:
>>> store = LH5Store()
>>> obj, _ = store.read("/geds/waveform", "file.lh5")
>>> type(obj)
lgdo.waveform_table.WaveformTable
lgdo.waveformtable.WaveformTable
"""

def __init__(self, base_path: str = "", keep_open: bool = False) -> None:
Expand Down Expand Up @@ -890,13 +890,13 @@ def write(
`compression` attribute.

Note
----
----------
The `compression` LGDO attribute takes precedence over the default HDF5
compression settings. The `hdf5_settings` attribute takes precedence
over `compression`. These attributes are not written to disk.

Note
----
----------
HDF5 compression is skipped for the `encoded_data.flattened_data`
dataset of :class:`.VectorOfEncodedVectors` and
:class:`.ArrayOfEncodedEqualSizedArrays`.
Expand Down
2 changes: 1 addition & 1 deletion src/lgdo/types/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from .struct import Struct
from .table import Table
from .vectorofvectors import VectorOfVectors
from .waveform_table import WaveformTable
from .waveformtable import WaveformTable

__all__ = [
"Array",
Expand Down
60 changes: 60 additions & 0 deletions src/lgdo/types/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,14 @@
from collections.abc import Iterator
from typing import Any

import awkward as ak
import awkward_pandas as akpd
import numpy as np
import pandas as pd
import pint_pandas # noqa: F401

from .. import utils as utils
from ..units import default_units_registry as u
from .lgdo import LGDO

log = logging.getLogger(__name__)
Expand Down Expand Up @@ -138,3 +143,58 @@ def __repr__(self) -> str:
)
+ f", attrs={repr(self.attrs)})"
)

def view_as(
self, library: str, with_units: bool = False
) -> pd.DataFrame | np.NDArray | ak.Array:
"""View the Array data as a third-party format data structure.

This is a zero-copy operation. Supported third-party formats are:

- ``pd``: returns a :class:`pandas.Series`
- ``np``: returns the internal `nda` attribute (:class:`numpy.ndarray`)
- ``ak``: returns an :class:`ak.Array` initialized with `self.nda`

Parameters
----------
library
format of the returned data view.
with_units
forward physical units to the output data.

See Also
--------
.LGDO.view_as
"""
# TODO: does attaching units imply a copy?
attach_units = with_units and "units" in self.attrs

if library == "pd":
if attach_units:
if self.nda.ndim == 1:
return pd.Series(
self.nda, dtype=f"pint[{self.attrs['units']}]", copy=False
)
else:
raise ValueError(
"Pint does not support Awkward yet, you must view the data with_units=False"
)
else:
if self.nda.ndim == 1:
return pd.Series(self.nda, copy=False)
else:
return akpd.from_awkward(self.view_as("ak"))
elif library == "np":
if attach_units:
return self.nda * u(self.attrs["units"])
else:
return self.nda
elif library == "ak":
if attach_units:
raise ValueError(
"Pint does not support Awkward yet, you must view the data with_units=False"
)
else:
return ak.Array(self.nda)
else:
raise ValueError(f"{library} is not a supported third-party format.")
13 changes: 13 additions & 0 deletions src/lgdo/types/arrayofequalsizedarrays.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
from collections.abc import Iterator
from typing import Any

import awkward as ak
import numpy as np
import pandas as pd

from .. import utils as utils
from . import vectorofvectors as vov
Expand Down Expand Up @@ -131,3 +133,14 @@ def to_vov(self, cumulative_length: np.ndarray = None) -> vov.VectorOfVectors:
cumulative_length=cumulative_length,
attrs=attrs,
)

def view_as(
self, library: str, with_units: bool = False
) -> pd.DataFrame | np.NDArray | ak.Array:
"""View the array as a third-party format data structure.

See Also
--------
.LGDO.view_as
"""
return super().view_as(library, with_units=with_units)
Loading