Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated pythonize and pyo3 #401

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jamesbraza
Copy link

Closes #371

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

@wallies wallies requested a review from Copilot December 17, 2024 11:10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

@cjrh
Copy link
Collaborator

cjrh commented Jan 8, 2025

The reason the pickle test fails is because of something that is different in the pythonize package in the new version. When a Document object is deserialized, values in the json object that were i64, become u64 after deserialization:

Comparing:

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": I64(1), "b": I64(-2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": I64(-2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

Look carefully at the types in the json object. Before serialization, I64(1), but after deserialization, U64. This seems to only happen with the JSON object. If that is commented (.add_json) in the test then the test passes.

@jamesbraza
Copy link
Author

Yeah I concur with your breakdown that the serialization/deserialization cycle is considering the JSON's a key to be unsigned now.

If that is commented (.add_json) in the test then the test passes.

Fwiw, going to assert orig.to_dict() == pickled.to_dict() leads to the test passing as well. So it's something within __eq__ specifically.


How are you getting that printed representation with I64/U64? Using repr or to_dict I don't see those values.

Also, Document has no __eq__ method, so I am not yet understanding how the __eq__ check is actually failing.

@cjrh
Copy link
Collaborator

cjrh commented Feb 17, 2025

If the assert .to_dict() change makes the test pass, I am ok to proceed with that. On the python side there is no difference between I64 and u64, nor in a json representation of those payloads if they ever get serialized for anything. What do you reckon?

@jamesbraza
Copy link
Author

I think it's important for __eq__ to work across serialization/deserialization.

What do you think of implementing a Document.__eq__ to invoke to_dict:

class Document:
    ...

    def __eq__(self, other) -> bool:
        if not isinstance(other, type(self)):
            return NotImplemented
        return self.to_dict() == other.to_dict()

Also, it may be preferred to do this in document.rs with the Rust API, as opposed to in Python only.

@cjrh
Copy link
Collaborator

cjrh commented Feb 17, 2025

Overriding __eq__ makes me nervous.

I think it's important for eq to work across serialization/deserialization.

Good point. Perhaps we should make a reproducer of the issue using only pythonize, and ask from help from that project.

@jamesbraza
Copy link
Author

I would be happy to do so, if you can point out how you printed the Document such that U64/I64 stuff showed up.

Also, just so I can understand, how do you know the issue is with pythonize vs pyo3?

@cjrh
Copy link
Collaborator

cjrh commented Feb 18, 2025

if you can point out how you printed the Document such that U64/I64 stuff showed up.

From memory I put print statements somewhere. Hopefully I still have that code somewhere. I will check tomorrow.

@cjrh
Copy link
Collaborator

cjrh commented Feb 18, 2025

To print out that data, I added a print inside __richcmp__ like this:

impl Document {
    fn __richcmp__<'py>(
        &self,
        other: &Self,
        op: CompareOp,
        py: Python<'py>
    ) -> PyResult<Bound<'py, PyAny>> {
        println!("\n\n\nComparing:\n\n{:?}\n\n{:?}", self.field_values, other.field_values);
        match op {
            CompareOp::Eq => {
                let v = (self == other).into_pyobject(py)?.to_owned().into_any();
                Ok(v)
            },
            CompareOp::Ne => {
                let v = (self != other).into_pyobject(py)?.to_owned().into_any();
                Ok(v)
            },
            _ => {
                let v = PyNotImplemented::get(py).to_owned().into_any();
                Ok(v)
            }
        }
    }
}

Tracing backwards from here, I also added some prints in the _internal_from_pythonized() method:

impl Document {
    #[staticmethod]
    fn _internal_from_pythonized(serialized: &Bound<PyAny>) -> PyResult<Self> {
        println!("\n\n\nDeserializing: {:?}", serialized);
        let out = pythonize::depythonize(serialized).map_err(to_pyerr);
        let out: Document = out.unwrap();
        println!("\n\n\nDeserialized: {:?}", out);
        println!("\n\n\nDeserialized: {:?}", out.field_values);
        Ok(out)
    }
}

So when I run the tests I get this stdout:

------------------------------------- Captured stdout call -------------------------------------



Deserializing: {'birth': [{'Date': 1565614805000000000}], 'bytes': [{'Bytes': [97, 98, 99]}], 'facet': [{'Facet': '/europe/france'}], 'float': [{'F64': 1.0}], 'integer': [{'I64': 5}], 'json': [{'Object': {'a': 1, 'b': 2}}], 'title': [{'Str': 'hello world!'}], 'unsigned': [{'U64': 1}]}



Deserialized: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])



Deserialized: {"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": U64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}
orig: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])
pickled: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])



Comparing:

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": I64(1), "b": I64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": U64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

@jamesbraza
Copy link
Author

Okay I made jamesbraza#1, which:

  1. PR description states my current understanding of the problem (newer pythonize more aggressively making U64)
  2. Poses a somewhat hacky solution, basically checking if one can convert back to I64

I am not happy with this, but I am really in over my head here, so I made davidhewitt/pythonize#80 basically as a SOS.

@jamesbraza
Copy link
Author

Perhaps extract_value at https://github.com/quickwit-oss/tantivy-py/blob/0.22.0/src/document.rs#L37-L39 has something to do with it? I don't see a U64 case, more just talking out loud

@cjrh
Copy link
Collaborator

cjrh commented Mar 3, 2025

@jamesbraza It would be great if you could add a very short reproducer to your issue on pythonize, so that David can easily run it and see the change in type during roundtrip.

@cjrh
Copy link
Collaborator

cjrh commented Mar 3, 2025

and

but I am really in over my head here

Don't worry, we're all learning all the time. Thanks for your continued interest in helping here ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to fetch wheel: tantivy==0.22.0 (Python 3.13.0)
2 participants