-
Notifications
You must be signed in to change notification settings - Fork 36
DNM unvendor all libraries #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
srsly/msgpack/_msgpack_numpy.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to srsly/_msgpack_numpy.py and heavily reworked
| data: The data to serialize. | ||
| RETURNS (bytes): The serialized bytes. | ||
| """ | ||
| return msgpack.dumps(data, use_bin_type=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_bin_type is true by default
| cython>=0.29.1 | ||
| pytest>=4.6.5 | ||
| pytest-timeout>=1.3.3 | ||
| mock>=2.0.0,<3.0.0 | ||
| numpy>=1.15.0 | ||
| psutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only pytest remains, so this file makes little sense anymore
| python_requires = >=3.9,<3.15 | ||
| setup_requires = | ||
| cython>=0.29.1 | ||
| python_requires = >=3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed version cap as there is very little that can break anymore with future versions. Future compatibility is now delegated to upstream libraries.
setup.cfg
Outdated
| catalogue>=2.0.10,<3 | ||
| cloudpickle >=3.1.2,<4 | ||
| msgpack >=1.1,<2 | ||
| ruamel.yaml >=0.18.16,<1 | ||
| ujson >=5.11.0,<6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lower bounds are as of today. This is simply out of laziness to save the effort of pinpointing which is the minimum version of everything that works with thinc and spacy.
Higher bounds have been set very generously to reduce maintenance burden in srsly.
| shell: bash | ||
| run: | | ||
| python -m pytest --pyargs $MODULE_NAME -Werror | ||
| run: pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 3.14t this emits
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:491
<frozen importlib._bootstrap>:491: RuntimeWarning: The global interpreter lock (GIL) has been enabled to load module 'ujson', which has not declared that it can run safely without the GIL. To override this behavior and keep the GIL disabled (at your own risk), run with PYTHON_GIL=0 or -Xgil=0.
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 85 passed, 1 warning in 0.31s =========================
I'll do a one-liner follow-up that re-adds -Werror as soon as upstream releases of msgpack and ujson become available.
ngoldbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because I'm not familiar with the package: are the tests all inherited from upstream or did you do the edits to the tests manually?
Overall awesome, I love PRs that delete a ton of code. Hopefully testing spaCy doesn't lead to discovering more issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you go into a little more detail than what's in the PR description into what motivated adding _MsgpackExtensions over using the old code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong. in the opening message. The API
msgpack_encoders.register(name, cb)
msgpack_deecoders.register(name, cb)
was not part of a very old version of msgpack like I thought - it was original in srsly all along.
I had to reimplement it from scratch to make it work with an unpatched msgpack, and to deal with subtle breakages with np.float64 (which is a subclass of builtin float).
srsly/ruamel_yaml.py
Outdated
| @@ -0,0 +1 @@ | |||
| from ruamel.yaml import * | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how you're getting the same thing that the change in 484f519 provided before. Can you elaborate?
|
FWIW, I mentioned having concerns about the library that led to this decision to vendor it and remove these unsafe branches. I found the message where I explained this to our team at the time:
This wasn't originally written as a public message so it's more flippant than I'd generally be --- no code is perfect, and this relates to stylistic concerns that are matters of judgment. But my concern was that future versions of the library could easily introduce a regression that affected the safe loading feature. As I said on the call I can agree to unvendor the code as the cost/benefit analysis on the maintenance burden is different now. But I definitely want to make sure we maintain the behaviours that we had before, of forcing the library to only work in safe mode. A test that checked that the code execution features are in fact disabled would also be appreciated. |
|
Not sure I understand your worry here. It seems to me that to get a code injection one needs to
I too remember the time where >>> import io
>>> import srsly
>>> from ruamel.yaml import YAML
>>> class C:
... def __new__(cls):
... print("Arbitrary code execution!")
... return object.__new__(cls)
>>> c = C()
Arbitrary code execution!
>>> srsly.yaml_dumps(c)
RepresenterError: cannot represent an object: <__main__.C object at 0x724fd4819e10>
>>> buf = io.StringIO()
>>> yaml = YAML(typ='unsafe')
>>> yaml.dump(c, buf)
>>> buf.getvalue()
'!!python/object:__main__.C {}\n'
>>> buf.seek(0)
>>> yaml.load(buf)
Arbitrary code execution!
<__main__.C at 0x724f75c75050>
>>> yaml2 = YAML()
>>> buf.seek(0)
>>> yaml2.load(buf)
{}
>>> srsly.yaml_loads(buf.getvalue())
ValueError: Invalid YAML: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object:__main__.C'
in "<unicode string>", line 1, column 1:
!!python/object:__main__.C {}
^ (line: 1)I've added a unit test to verify this behaviour. |
All surviving tests are unique to srsly. All changes in them are my own edits. |
ngoldbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a pass over the tests. IMO it would be better if this PR only added new tests instead of changing or deleting old tests.
| m = Malicious() | ||
| buf = StringIO() | ||
| yaml = YAML(typ="full") | ||
| yaml.dump(m, buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a little more convincing if this test used srsly.write_yaml or srsly.yaml_dumps instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither of those functions, by design, allow you to serialize an arbitrary python object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test demonstrating that would be more useful than what's here IMO. I searched and there are zero uses of srsly.ruamel_yaml in the explosion stack outside srsly itself. Testing the public API is what's important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test does not test srsly.ruamel_yaml.
This test verifies that a YAML-legal malicious payload, which can only be crafted either with unwrapped ruamel.yaml.YAML or PyYAML, cannot cause arbitrary code execution in srly.yaml_load.
I've now added a test that shows that you get a meaningful error message in srsly.write_yaml when trying to write arbitrary objects, and removed the lines that use ruamel.yaml.YAML() to load the payload back in order to prevent confusion.
I searched and there are zero uses of srsly.ruamel_yaml in the explosion stack outside srsly itself.
Are you suggesting to not introduce the dummy srsly/ruamel_yaml.py etc. backwards compatibiity modules, incurring in a slight risk that third-party packages may (trivially) break?
| with pytest.raises(TypeError): | ||
| s = json_dumps(f) | ||
| with pytest.raises(TypeError, match="is not JSON serializable"): | ||
| s = json_dumps({1, 2}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not keep the old content too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you wanted to get rid of numpy as a dependency? You could leave it as a test dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy was never a runtime dependency. It was however a test dependency and this change (plus changes in the msgpack tests) makes it become optional.
From my point of view there is no difference between a set and a np.float32 - they're both types that json does not understand. So might as well simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to keep it as a test dependency because this PR introduces a use case (see the gh action changes) that proves that you can actually do everything except msgpack-numpy without having numpy installed. This was prompted by noticing that if numpy is not installed the old code was failing to serialize builtin complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand and I appreciate that having tests running without numpy installed is good.
I still think it would be easier to review this PR if you only added new tests instead of changing or deleting old ones. As you have it here it's harder to verify that everything still works as it used to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reinstated the old test alongside the new one.
| when serializing datetime objects, the error should be msgpack's TypeError, | ||
| not a "'np' is not defined error").""" | ||
| with pytest.raises(TypeError): | ||
| msgpack_loads(msgpack_dumps(datetime.datetime.now())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was using a very hacky monkey-patch design that really didn't work well with a separate msgpack library outside of our control. Instead of it I'm just running the whole test suite without numpy installed with a new gh actions job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you leave it, remove the monkeypatching, and mark it to be skipped if NumPy isn't importable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific test had just been renamed. I've now reinstated the previous name.
| if "__custom__" in obj: | ||
| return CustomObject(obj["__custom__"]) | ||
| return obj if chain is None else chain(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was chain unused or something? not clear why it got deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was probably used by the older code but to the best of my understanding it was not useful the way the callback API is used in all examples in the codebase, so I just simplified it. The old callbacks will continue working as long as they have chain=None in the signature. This may be disproved by dependency packages but I have my doubts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, there's no reason to delete old testing code IMO. It just makes this PR harder to review and verify that everything will continue to work as it used to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but... chain is always None if you define a callback with chain=None and never pass a chain parameter to it.
| assert_equal(type(self.encode_decode(b"foo")), bytes) | ||
|
|
||
| def test_str(self): | ||
| assert_equal(type(self.encode_decode("foo")), bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did these get deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is thoroughly tested by the upstream msgpack tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reinstated.
| def test_chain(self): | ||
| x = ThirdParty(foo=b"test marshal/unmarshal") | ||
| x_rec = self.encode_decode_thirdparty(x) | ||
| self.assertEqual(x, x_rec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this get deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already tested by test_msgpack_api.py::test_msgpack_custom_encoder_decoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reinstated.
| assert_equal(type(self.encode_decode("foo")), bytes) | ||
| def encode_decode(self, x): | ||
| x_enc = msgpack_dumps(x) | ||
| return msgpack_loads(x_enc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand why encode_decode changed: now it only uses public APIs. But why did all the other tests get deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other tests were redundant either with the upstream tests in msgpack or with test_msgpack_api.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reinstated.
| def test_chain(self): | ||
| x = ThirdParty(foo=b"test marshal/unmarshal") | ||
| x_rec = self.encode_decode_thirdparty(x) | ||
| self.assertEqual(x, x_rec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this test deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more chain parameter is necessary because of how _MsgpackExtensions._run works. There may be some very obscure use in the dependencies that disproves me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reinstated.
| @@ -91,10 +141,70 @@ def deserialize_obj(obj, chain=None): | |||
| assert new_data["a"] == 123 | |||
| assert isinstance(new_data["b"], CustomObject) | |||
| assert new_data["b"].value == {"foo": "bar"} | |||
| # Test that it also works with combinations of encoders/decoders (e.g. numpy) | |||
| data = {"a": numpy.zeros((1, 2, 3)), "b": CustomObject({"foo": "bar"})} | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change this not to use numpy? You could for example change this test to be parameterized by an input argument and have one of the arguments be conditionally marked such that the test is skipped if numpy isn't importable. That way you get both the content of the test as you have it in this PR and the old test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this test is to verify that registering a custom encoder/decoder does not break the builtin ones. So complex and np.ndarray are equivalent, with the former being able to run when numpy is not installed.
|
@ngoldbaum I've reinstated a bunch of tests where feasible. |
|
|
||
| def encode_thirdparty(self, obj): | ||
| return dict(__thirdparty__=True, foo=obj.foo) | ||
| if isinstance(obj, ThirdParty): | ||
| return {b"__thirdparty__": True, b"foo": obj.foo} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: on first sight, this may look like a major breaking change, where str keys are no longer converted to bytes upon serialization.
This however is an artefact of the old unit test, which was using the direct msgpack API msgpack.packb(x, default=self.encode_thirdparty, use_bin_type=use_bin_type) (see old lines 28:31).
These options have never been accessible from srsly.msgpack_dumps. With the srsly API, in this as well as the old version, str round-trips to str and bytes round-trips to bytes.
|
As discussed offline with @ngoldbaum , I've removed the legacy subpackages |
High level changes
In an effort to reduce the maintainance burden, unvendor all libraries except msgpack-python and upgrade them to the latest available version, with sane but reasonable version pins.
This causes the previously vendored libraries to be bumped up several years all at once. For this reason, I'd rather not merge this PR until we gather some confidence that it doesn't introduce regressions down the line.
This PR causes srsly to become a much simpler pure-python package.
Breaking changes
This PR removes
srsly.msgpack,srsly.cloudpickle,srsly.ruamel_yaml,srsly.ujson. This could cause breakages downstream. The trivial fix is to simply usemsgpack,cloudpickle,ruamel.yaml, andujsondirectly.Free-threading (noGIL)
This PR does not, as of today, make srsly compatible with free-threading, but it will automatically do so as soon as the below issues will be fixed upstream and new upstream releases become available:
Other changes
{'x':1}to{'x': 1}. This should be purely cosmetic but may cause some downstream unit tests to trivially fail.msgpack-specific changes
msgpack-python could not be unvendored. The reasons are that the latest upstream version of it
np.float64, due to it being a subclass offloatnumpySo instead I heavily reworked the fork that we have and added extra tests.
I also wrote from scratch a system for third-party msgpack extensions that is compatible with the previously vendored system, which is no longer available upstream.
Additionally:
complexcould not be serialized by msgpack unlessnumpyis installed