Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for complex attributes like pandas dataframes or onnx models #279

Open
zkurtz opened this issue Feb 9, 2025 · 2 comments
Open
Labels
needs information Further information is requested

Comments

@zkurtz
Copy link

zkurtz commented Feb 9, 2025

I've been developing classio, an experimental package for adding IO methods to relatively complex data classes that may have attributes like pandas data frames, onnx models, or other complex objects. I recently added support for attributes that are mashumaro yaml & json based classes. What I'm wondering is how hard it would be to add support for, say, a pandas.DataFrame-type attribute directly inside of mashumaro.

@Fatal1ty
Copy link
Owner

Fatal1ty commented Feb 16, 2025

Hi @zkurtz

Thank you for your interest in this library!

I look through your code and found this and this modules that do all the work for dataclasses with mashumaro mixins. If I am not mistaken, two solutions come to my mind:

  1. Using specific decoders and encoders for JSON and YAML. They allow you to use your dialect that can have serialization strategies for pandas dataframe and other types. Something like this:
from mashumaro.codecs.json import JSONEncoder
from mashumaro.dialect import Dialect

...

class DummioSerializationDialect(Dialect):
    serialization_strategy = {
        ... # pandas and other type handlers here
    }


def save(
    data: DataClassJSONMixin,
    *,
    filepath: PathType,
) -> None:
    """Save a mashumaro dataclass instance to a json text file."""
    encoder = JSONEncoder(
        type(data), default_dialect= DummioSerializationDialect
    )
    json_str = encoder.encode(data)
    assert isinstance(json_str, str), "expected a string from to_json()"
    UPath(filepath).write_text(json_str)

However, in terms of performance it is not wise to create disposable encoders and decoders every time you enter the function because it involves code generation. But if you are not in high load environment or you are in no hurry it could be a perfect solution for you that is simple and available right now.

  1. The best thing in your case would be to define "global" functions that would handle pandas and other data types the way you want. It can be implemented now, but with some nuances. Currently, there is a global registry for packers and a global registry for unpackers:

PackerRegistry = Registry()
register = PackerRegistry.register

UnpackerRegistry = Registry()
register = UnpackerRegistry.register

You can register handlers for third-party types in the image and likeness (but it's not a public API, so it can be changed):

from typing import Optional

from dataclasses import dataclass

from mashumaro.core.meta.types.common import Expression, ValueSpec
from mashumaro.core.meta.types.pack import register as register_pack
from mashumaro.core.meta.types.unpack import register as register_unpack
from mashumaro.mixins.json import DataClassJSONMixin


class MyType:
    def __init__(self, x: int):
        self.x = x

    def __repr__(self):
        return f"MyType(x={self.x})"


@register_pack
def pack_my_type(spec: ValueSpec,
) -> Optional[Expression]:
    if spec.type is MyType:
        return f"{{'x': {spec.expression}.x }}"


@register_unpack
def unpack_my_type(spec: ValueSpec,
) -> Optional[Expression]:
    if spec.type is MyType:
        spec.builder.ensure_object_imported(MyType)
        return f"MyType({spec.expression}['x'])"


@dataclass
class MyDataClass(DataClassJSONMixin):
    my_type: MyType


x = MyDataClass.from_json('{"my_type": {"x": 1}}')
print(x)  # MyDataClass(my_type=MyType(x=1))
print(x.to_json())  # {"my_type": {"x": 1}}

You could register your handlers this way before importing any dataclasses with mashumaro mixins you rely on.

This register decorator adds your handler to the end, so if no other handler was found for your third-party type before, you're lucky here. But If any of the previously registered handlers decided that it could handle this data type, then your handler won't even be called. I'm not sure about data science related types but I guess they could be subclasses of the builtin python types.

So, in order to be able to increase the priority of your handlers, we need to make the registries more flexible to external changes. The easiest way would be to add an optional argument position (beginning or end) to register decorator but I need to think about it. You can help me if you know other libraries, frameworks, programs etc. that have the same registry related problem and have publicly approved and user-friendly API I can get inspiration from.

@Fatal1ty Fatal1ty added the needs information Further information is requested label Feb 16, 2025
@zkurtz
Copy link
Author

zkurtz commented Feb 17, 2025

Just trying to digest this, a few notes:

  • one key thing I'm looking for is the ability to use pandas native IO methods like to_parquet or read_parquet rather than forcing data to pass through any other serialization format (i.e. skip the json).
  • looking at both your approaches (1) and (2) reminds me that the main focus of mashumaro seems to to serialize classes to an in-memory string, with file-IO left to the user as a separate step.
  • there seems to be an inherent tension between the above two points.
  • I found this more-recent issue relevant. IIUC I could use the method they demonstrate there with some modifications for data frame: question aboud numpy saveing and loading #282

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs information Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants