support for complex attributes like pandas dataframes or onnx models #279

zkurtz · 2025-02-09T21:20:00Z

I've been developing classio, an experimental package for adding IO methods to relatively complex data classes that may have attributes like pandas data frames, onnx models, or other complex objects. I recently added support for attributes that are mashumaro yaml & json based classes. What I'm wondering is how hard it would be to add support for, say, a pandas.DataFrame-type attribute directly inside of mashumaro.

Fatal1ty · 2025-02-16T22:18:55Z

Hi @zkurtz

Thank you for your interest in this library!

I look through your code and found this and this modules that do all the work for dataclasses with mashumaro mixins. If I am not mistaken, two solutions come to my mind:

Using specific decoders and encoders for JSON and YAML. They allow you to use your dialect that can have serialization strategies for pandas dataframe and other types. Something like this:

from mashumaro.codecs.json import JSONEncoder
from mashumaro.dialect import Dialect

...

class DummioSerializationDialect(Dialect):
    serialization_strategy = {
        ... # pandas and other type handlers here
    }


def save(
    data: DataClassJSONMixin,
    *,
    filepath: PathType,
) -> None:
    """Save a mashumaro dataclass instance to a json text file."""
    encoder = JSONEncoder(
        type(data), default_dialect= DummioSerializationDialect
    )
    json_str = encoder.encode(data)
    assert isinstance(json_str, str), "expected a string from to_json()"
    UPath(filepath).write_text(json_str)

However, in terms of performance it is not wise to create disposable encoders and decoders every time you enter the function because it involves code generation. But if you are not in high load environment or you are in no hurry it could be a perfect solution for you that is simple and available right now.

The best thing in your case would be to define "global" functions that would handle pandas and other data types the way you want. It can be implemented now, but with some nuances. Currently, there is a global registry for packers and a global registry for unpackers:

mashumaro/mashumaro/core/meta/types/pack.py

Lines 81 to 82 in 0e97dd7

    
           PackerRegistry = Registry() 
        
           register = PackerRegistry.register

mashumaro/mashumaro/core/meta/types/unpack.py

Lines 108 to 109 in 0e97dd7

    
           UnpackerRegistry = Registry() 
        
           register = UnpackerRegistry.register

You can register handlers for third-party types in the image and likeness (but it's not a public API, so it can be changed):

from typing import Optional

from dataclasses import dataclass

from mashumaro.core.meta.types.common import Expression, ValueSpec
from mashumaro.core.meta.types.pack import register as register_pack
from mashumaro.core.meta.types.unpack import register as register_unpack
from mashumaro.mixins.json import DataClassJSONMixin


class MyType:
    def __init__(self, x: int):
        self.x = x

    def __repr__(self):
        return f"MyType(x={self.x})"


@register_pack
def pack_my_type(spec: ValueSpec,
) -> Optional[Expression]:
    if spec.type is MyType:
        return f"{{'x': {spec.expression}.x }}"


@register_unpack
def unpack_my_type(spec: ValueSpec,
) -> Optional[Expression]:
    if spec.type is MyType:
        spec.builder.ensure_object_imported(MyType)
        return f"MyType({spec.expression}['x'])"


@dataclass
class MyDataClass(DataClassJSONMixin):
    my_type: MyType


x = MyDataClass.from_json('{"my_type": {"x": 1}}')
print(x)  # MyDataClass(my_type=MyType(x=1))
print(x.to_json())  # {"my_type": {"x": 1}}

You could register your handlers this way before importing any dataclasses with mashumaro mixins you rely on.

This register decorator adds your handler to the end, so if no other handler was found for your third-party type before, you're lucky here. But If any of the previously registered handlers decided that it could handle this data type, then your handler won't even be called. I'm not sure about data science related types but I guess they could be subclasses of the builtin python types.

So, in order to be able to increase the priority of your handlers, we need to make the registries more flexible to external changes. The easiest way would be to add an optional argument position (beginning or end) to register decorator but I need to think about it. You can help me if you know other libraries, frameworks, programs etc. that have the same registry related problem and have publicly approved and user-friendly API I can get inspiration from.

zkurtz · 2025-02-17T16:39:31Z

Just trying to digest this, a few notes:

one key thing I'm looking for is the ability to use pandas native IO methods like to_parquet or read_parquet rather than forcing data to pass through any other serialization format (i.e. skip the json).
looking at both your approaches (1) and (2) reminds me that the main focus of mashumaro seems to to serialize classes to an in-memory string, with file-IO left to the user as a separate step.
there seems to be an inherent tension between the above two points.
I found this more-recent issue relevant. IIUC I could use the method they demonstrate there with some modifications for data frame: question aboud numpy saveing and loading #282

Fatal1ty added the needs information Further information is requested label Feb 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for complex attributes like pandas dataframes or onnx models #279

support for complex attributes like pandas dataframes or onnx models #279

zkurtz commented Feb 9, 2025

Fatal1ty commented Feb 16, 2025 •

edited

Loading

zkurtz commented Feb 17, 2025

support for complex attributes like pandas dataframes or onnx models #279

support for complex attributes like pandas dataframes or onnx models #279

Comments

zkurtz commented Feb 9, 2025

Fatal1ty commented Feb 16, 2025 • edited Loading

zkurtz commented Feb 17, 2025

Fatal1ty commented Feb 16, 2025 •

edited

Loading