Skip to content

Conversation

@crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Nov 19, 2025

High level changes

In an effort to reduce the maintainance burden, unvendor all libraries except msgpack-python and upgrade them to the latest available version, with sane but reasonable version pins.

This causes the previously vendored libraries to be bumped up several years all at once. For this reason, I'd rather not merge this PR until we gather some confidence that it doesn't introduce regressions down the line.

This PR causes srsly to become a much simpler pure-python package.

Breaking changes

This PR removes srsly.msgpack, srsly.cloudpickle, srsly.ruamel_yaml, srsly.ujson. This could cause breakages downstream. The trivial fix is to simply use msgpack, cloudpickle, ruamel.yaml, and ujson directly.

Free-threading (noGIL)

This PR does not, as of today, make srsly compatible with free-threading, but it will automatically do so as soon as the below issues will be fixed upstream and new upstream releases become available:

Other changes

  • Bump version to 3.0, in an effort to avoid accidental bumps in downstream projects.
  • Fix ujson segfault in Python 3.14 (see Upgrade Python and GitHub Actions versions #117)
  • Introduce a minor change in json output, where dicts gain a whitespace between key and value (from {'x':1} to {'x': 1}. This should be purely cosmetic but may cause some downstream unit tests to trivially fail.
  • Add CI coverage for the use case where numpy is not installed

msgpack-specific changes

msgpack-python could not be unvendored. The reasons are that the latest upstream version of it

  • is quite old and uses a hacky patch system that offers no guarantee to remain feasible vs. future versions of msgpack, which is instead actively developed
  • introduces a regression where it fails to round-trip np.float64, due to it being a subclass of float
  • introduces a hard dependency to numpy

So instead I heavily reworked the fork that we have and added extra tests.

I also wrote from scratch a system for third-party msgpack extensions that is compatible with the previously vendored system, which is no longer available upstream.

Additionally:

  • Fix bug where built-in complex could not be serialized by msgpack unless numpy is installed
  • Fix bug where gc would remain permanently disabled if msgpack decode fails for any reason, e.g. in case of corrupted stream or if a third party extension raises.
  • Change exception when trying to decode a numpy object without having numpy installed from a nebulous AttributeError to a clearer ModuleNotFoundError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to srsly/_msgpack_numpy.py and heavily reworked

data: The data to serialize.
RETURNS (bytes): The serialized bytes.
"""
return msgpack.dumps(data, use_bin_type=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_bin_type is true by default

Comment on lines -3 to -8
cython>=0.29.1
pytest>=4.6.5
pytest-timeout>=1.3.3
mock>=2.0.0,<3.0.0
numpy>=1.15.0
psutil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only pytest remains, so this file makes little sense anymore

python_requires = >=3.9,<3.15
setup_requires =
cython>=0.29.1
python_requires = >=3.9
Copy link
Contributor Author

@crusaderky crusaderky Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed version cap as there is very little that can break anymore with future versions. Future compatibility is now delegated to upstream libraries.

setup.cfg Outdated
Comment on lines 30 to 34
catalogue>=2.0.10,<3
cloudpickle >=3.1.2,<4
msgpack >=1.1,<2
ruamel.yaml >=0.18.16,<1
ujson >=5.11.0,<6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower bounds are as of today. This is simply out of laziness to save the effort of pinpointing which is the minimum version of everything that works with thinc and spacy.

Higher bounds have been set very generously to reduce maintenance burden in srsly.

shell: bash
run: |
python -m pytest --pyargs $MODULE_NAME -Werror
run: pytest
Copy link
Contributor Author

@crusaderky crusaderky Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 3.14t this emits

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:491
  <frozen importlib._bootstrap>:491: RuntimeWarning: The global interpreter lock (GIL) has been enabled to load module 'ujson', which has not declared that it can run safely without the GIL. To override this behavior and keep the GIL disabled (at your own risk), run with PYTHON_GIL=0 or -Xgil=0.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 85 passed, 1 warning in 0.31s =========================

I'll do a one-liner follow-up that re-adds -Werror as soon as upstream releases of msgpack and ujson become available.

Copy link

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because I'm not familiar with the package: are the tests all inherited from upstream or did you do the edits to the tests manually?

Overall awesome, I love PRs that delete a ton of code. Hopefully testing spaCy doesn't lead to discovering more issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go into a little more detail than what's in the PR description into what motivated adding _MsgpackExtensions over using the old code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong. in the opening message. The API

msgpack_encoders.register(name, cb)
msgpack_deecoders.register(name, cb)

was not part of a very old version of msgpack like I thought - it was original in srsly all along.
I had to reimplement it from scratch to make it work with an unpatched msgpack, and to deal with subtle breakages with np.float64 (which is a subclass of builtin float).

@@ -0,0 +1 @@
from ruamel.yaml import *

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how you're getting the same thing that the change in 484f519 provided before. Can you elaborate?

@honnibal
Copy link
Member

FWIW, I mentioned having concerns about the library that led to this decision to vendor it and remove these unsafe branches. I found the message where I explained this to our team at the time:

So YAML has this infamous vulnerability where if you are so foolish as to write yaml.load as opposed to yaml.safe_load, you allow arbitrary code execution, as the YAML message may contain pickled Python objects.
So consider these lines: https://github.com/pycontribs/ruamel-yaml/blob/master/main.py#L1392
if issubclass(Loader, BaseLoader):
BaseConstructor.add_constructor(tag, object_constructor)
elif issubclass(Loader, SafeLoader):
SafeConstructor.add_constructor(tag, object_constructor)

Looks...alarming, right? Surely the order of conditions is wrong, so it will never enter SafeLoader 😱
Ah but wait...silly me. Why would I think SafeLoader is a subclass of BaseLoader? In fact SafeLoader has 99 base classes but BaseLoader ain't one:

class SafeLoader(Reader, Scanner, Parser, Composer, SafeConstructor, VersionedResolver):

This wasn't originally written as a public message so it's more flippant than I'd generally be --- no code is perfect, and this relates to stylistic concerns that are matters of judgment. But my concern was that future versions of the library could easily introduce a regression that affected the safe loading feature.

As I said on the call I can agree to unvendor the code as the cost/benefit analysis on the maintenance burden is different now. But I definitely want to make sure we maintain the behaviours that we had before, of forcing the library to only work in safe mode. A test that checked that the code execution features are in fact disabled would also be appreciated.

@crusaderky
Copy link
Contributor Author

Not sure I understand your worry here. It seems to me that to get a code injection one needs to

  1. deliberately bypass the srlsy API and directly use either ruamel.yaml or the backards compatibility alias srsly.yaml, and
  2. explicitly opt in with typ='unsafe'.

I too remember the time where yaml.load was unsafe by default, but that was many years ago.

>>> import io
>>> import srsly
>>> from ruamel.yaml import YAML

>>> class C:
...     def __new__(cls):
...         print("Arbitrary code execution!")
...         return object.__new__(cls)

>>> c = C()
Arbitrary code execution!

>>> srsly.yaml_dumps(c)
RepresenterError: cannot represent an object: <__main__.C object at 0x724fd4819e10>

>>> buf = io.StringIO()
>>> yaml = YAML(typ='unsafe')
>>> yaml.dump(c, buf)
>>> buf.getvalue()
'!!python/object:__main__.C {}\n'

>>> buf.seek(0)
>>> yaml.load(buf)
Arbitrary code execution!
<__main__.C at 0x724f75c75050>

>>> yaml2 = YAML()
>>> buf.seek(0)
>>> yaml2.load(buf)
{}

>>> srsly.yaml_loads(buf.getvalue())

ValueError: Invalid YAML: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object:__main__.C'
  in "<unicode string>", line 1, column 1:
    !!python/object:__main__.C {}
    ^ (line: 1)

I've added a unit test to verify this behaviour.

@crusaderky
Copy link
Contributor Author

Just because I'm not familiar with the package: are the tests all inherited from upstream or did you do the edits to the tests manually?

All surviving tests are unique to srsly. All changes in them are my own edits.
I've deleted all tests that were inherited from upstream.

Copy link

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a pass over the tests. IMO it would be better if this PR only added new tests instead of changing or deleting old tests.

m = Malicious()
buf = StringIO()
yaml = YAML(typ="full")
yaml.dump(m, buf)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a little more convincing if this test used srsly.write_yaml or srsly.yaml_dumps instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither of those functions, by design, allow you to serialize an arbitrary python object.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test demonstrating that would be more useful than what's here IMO. I searched and there are zero uses of srsly.ruamel_yaml in the explosion stack outside srsly itself. Testing the public API is what's important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not test srsly.ruamel_yaml.

This test verifies that a YAML-legal malicious payload, which can only be crafted either with unwrapped ruamel.yaml.YAML or PyYAML, cannot cause arbitrary code execution in srly.yaml_load.

I've now added a test that shows that you get a meaningful error message in srsly.write_yaml when trying to write arbitrary objects, and removed the lines that use ruamel.yaml.YAML() to load the payload back in order to prevent confusion.

I searched and there are zero uses of srsly.ruamel_yaml in the explosion stack outside srsly itself.

Are you suggesting to not introduce the dummy srsly/ruamel_yaml.py etc. backwards compatibiity modules, incurring in a slight risk that third-party packages may (trivially) break?

with pytest.raises(TypeError):
s = json_dumps(f)
with pytest.raises(TypeError, match="is not JSON serializable"):
s = json_dumps({1, 2})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keep the old content too?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you wanted to get rid of numpy as a dependency? You could leave it as a test dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy was never a runtime dependency. It was however a test dependency and this change (plus changes in the msgpack tests) makes it become optional.
From my point of view there is no difference between a set and a np.float32 - they're both types that json does not understand. So might as well simplify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to keep it as a test dependency because this PR introduces a use case (see the gh action changes) that proves that you can actually do everything except msgpack-numpy without having numpy installed. This was prompted by noticing that if numpy is not installed the old code was failing to serialize builtin complex.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand and I appreciate that having tests running without numpy installed is good.

I still think it would be easier to review this PR if you only added new tests instead of changing or deleting old ones. As you have it here it's harder to verify that everything still works as it used to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reinstated the old test alongside the new one.

when serializing datetime objects, the error should be msgpack's TypeError,
not a "'np' is not defined error")."""
with pytest.raises(TypeError):
msgpack_loads(msgpack_dumps(datetime.datetime.now()))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was using a very hacky monkey-patch design that really didn't work well with a separate msgpack library outside of our control. Instead of it I'm just running the whole test suite without numpy installed with a new gh actions job.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you leave it, remove the monkeypatching, and mark it to be skipped if NumPy isn't importable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific test had just been renamed. I've now reinstated the previous name.

if "__custom__" in obj:
return CustomObject(obj["__custom__"])
return obj if chain is None else chain(obj)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was chain unused or something? not clear why it got deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was probably used by the older code but to the best of my understanding it was not useful the way the callback API is used in all examples in the codebase, so I just simplified it. The old callbacks will continue working as long as they have chain=None in the signature. This may be disproved by dependency packages but I have my doubts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, there's no reason to delete old testing code IMO. It just makes this PR harder to review and verify that everything will continue to work as it used to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but... chain is always None if you define a callback with chain=None and never pass a chain parameter to it.

assert_equal(type(self.encode_decode(b"foo")), bytes)

def test_str(self):
assert_equal(type(self.encode_decode("foo")), bytes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did these get deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is thoroughly tested by the upstream msgpack tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinstated.

def test_chain(self):
x = ThirdParty(foo=b"test marshal/unmarshal")
x_rec = self.encode_decode_thirdparty(x)
self.assertEqual(x, x_rec)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this get deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already tested by test_msgpack_api.py::test_msgpack_custom_encoder_decoder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinstated.

assert_equal(type(self.encode_decode("foo")), bytes)
def encode_decode(self, x):
x_enc = msgpack_dumps(x)
return msgpack_loads(x_enc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand why encode_decode changed: now it only uses public APIs. But why did all the other tests get deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other tests were redundant either with the upstream tests in msgpack or with test_msgpack_api.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinstated.

def test_chain(self):
x = ThirdParty(foo=b"test marshal/unmarshal")
x_rec = self.encode_decode_thirdparty(x)
self.assertEqual(x, x_rec)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this test deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more chain parameter is necessary because of how _MsgpackExtensions._run works. There may be some very obscure use in the dependencies that disproves me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinstated.

@@ -91,10 +141,70 @@ def deserialize_obj(obj, chain=None):
assert new_data["a"] == 123
assert isinstance(new_data["b"], CustomObject)
assert new_data["b"].value == {"foo": "bar"}
# Test that it also works with combinations of encoders/decoders (e.g. numpy)
data = {"a": numpy.zeros((1, 2, 3)), "b": CustomObject({"foo": "bar"})}
Copy link

@ngoldbaum ngoldbaum Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this not to use numpy? You could for example change this test to be parameterized by an input argument and have one of the arguments be conditionally marked such that the test is skipped if numpy isn't importable. That way you get both the content of the test as you have it in this PR and the old test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this test is to verify that registering a custom encoder/decoder does not break the builtin ones. So complex and np.ndarray are equivalent, with the former being able to run when numpy is not installed.

@crusaderky
Copy link
Contributor Author

@ngoldbaum I've reinstated a bunch of tests where feasible.


def encode_thirdparty(self, obj):
return dict(__thirdparty__=True, foo=obj.foo)
if isinstance(obj, ThirdParty):
return {b"__thirdparty__": True, b"foo": obj.foo}
Copy link
Contributor Author

@crusaderky crusaderky Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: on first sight, this may look like a major breaking change, where str keys are no longer converted to bytes upon serialization.
This however is an artefact of the old unit test, which was using the direct msgpack API msgpack.packb(x, default=self.encode_thirdparty, use_bin_type=use_bin_type) (see old lines 28:31).
These options have never been accessible from srsly.msgpack_dumps. With the srsly API, in this as well as the old version, str round-trips to str and bytes round-trips to bytes.

@crusaderky
Copy link
Contributor Author

As discussed offline with @ngoldbaum , I've removed the legacy subpackages srsly.msgpack, srsly.cloudpickle, srsly.ruamel_yaml, and srsly.ujson. This is a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants