PR: Use fastjsonschema if installed and add tests #191

goanpeca · 2020-09-24T00:06:00Z

Fixes #190

This adds the possibility of using fastjsonschema if installed and other libraries if installed or defined by a NBFORMAT_VALIDATOR environment variable.

The possible values are 1-1 to the module names so, jsonschema, fastjsonschema, eventually (when stable) we could also add jsonschema_rs, the rust implementation.

Added a tests to check things work as expected with both libraries, with one caveat: test that expect a better_error will not work, so those 2 tests are skipped for fastjsonschema. The rest works as expected. This means if using fastjsonschema printed errors will be less readable.
Update CI configuration.
Used pytest parametrize

Related to jupyter-server/jupyter_server#312

Pinging @echarles, @jasongrout, @Zsailer, @mlucool

bollwyvl · 2020-09-24T12:29:53Z

This seems like a good step forward... but I have this feeling, much like with ujson/orjson on jupyterlab/jupyterlab_server#95, that just picking a second implementation and writing multiple code paths in many places will make the effort harder to manage in the future.

I guess, to both ends, a configurable JSONManager (e.g .parser_cls, .validator_cls) which offered a normalized, best-effort API for all the JSON stuff would answer the mail.

An additional thing it could provide would be an async interface to (de)serialization and validation, using (probably) an executor, as to my knowledge, all of these things are blocking, which is fine for, say, nbconvert, but not so fine in a server setting.

Such a manager could be started here in nbformat, as it's as close as anything to the bits-on-disk, and moved at a later date.

goanpeca · 2020-09-24T13:26:48Z

Hi @bollwyvl, thanks for the feedback.

writing multiple code paths in many places will make the effort harder to manage in the future.

Agreed, it does make the code path harder to follow and I actually tried it but for this specific case, that would require making way more involved changes and this was a "low hanging fruit fix". This fix is also kinda "urgent" for the use case we ran into.

We were having jupyter_server taking 40 seconds to actually send a notebook!

I guess, to both ends, a configurable JSONManager (e.g .parser_cls, .validator_cls) which offered a normalized, best-effort API for all the JSON stuff would answer the mail.

Agreed, except that unlike the jupyter_server case we have at least 3-4 (more?) different competing implementations of JSON that are fast in general. Here we have just this one (in python), the other option is slow too. We could explore using other libraries written in other languages, but that would also require calling them via subprocess and rethinking a lot along the way.

I guess, to both ends, a configurable JSONManager

At this point there are 2 things this manager would at least need to do. Reading and validation, and the bottleneck here is validation, not reading, although I could see benefits in that area as well.

Also I think this is an interesting project enough (the JSONManager) that I would prefer starting a new package for it. Could you create a repo in deathbeds and I can start coding in there with more freedom :-)

bollwyvl · 2020-09-24T13:58:39Z

"low hanging fruit fix"

... unless if collecting the low-hanging fruit makes the tree unhealthy.

Once use_fast is in a bunch of places, it gets a lot harder to change, whereas if it's a config-level thing, with a well-defined interface, then no existing code has to change to take advantage of the performance gains.

For example, I frankly wouldn't see running both a jsonschema and fastjsonschema validation during the same import... and indeed, perhaps the copy-pasted validator caching is something this manager could provide, by default, as it's almost always what you'd want, given the warm-up cost of both implementations considered to this point.

Speaking of implementations...

Here we have just this one (in python), the other option is slow too.

Well, there's at least these:

https://pypi.org/project/validr (cython)
https://pypi.org/project/jsonschema-rs (rust)

I have no benchmarks, but I'd wager, for certain workloads, these would be faster.

Could you create a repo in deathbeds and I can start coding in there with more freedom :-)

Not to down-sell our mad science experiments over there, but nobody looks at deathbeds, and a lot of the stuff in there is intentionally not "core ready". "How Jupyter Does JSON Standards" is a pretty big deal, and should probably be incubated more publicly in an official project... to me, that seems like jupyter_core, but it is intentionally lightweight (woulda been reeeeeally nice to solve the win/py38 ProactorEventLoop there, but didn't want to take the tornado opinion, which is far lighter-weight than anything discussed here).

So, like I said, right here in nbformat is probably already the right place, and could start with something far simpler, e.g. a json_compat.py which offers the expected API, and can be overridden at the import level, without digging into traitlets config.

goanpeca · 2020-09-24T14:04:22Z

Sounds good, will take a look ;-)

nbformat/json_compat.py

nbformat/validator.py

echarles

Forgot to submit my morning review, which is I believe deprecated based on recent discussions.

echarles · 2020-09-24T05:13:49Z

nbformat/validator.py

@@ -294,7 +327,8 @@ def iter_validate(nbdict=None, ref=None, version=None, version_minor=None,
    if version is None:
        version, version_minor = get_version(nbdict)

-    validator = get_validator(version, version_minor, relax_add_props=relax_add_props)
+    validator = get_validator(version, version_minor, relax_add_props=relax_add_props,
+                              use_fast=False)


use_fast is always set to False. Anyway that a consumer of the nbformat library would use it with use_fast set to True? I mean, it sounds to me reading the changes that it will not be possible for a consumer to benefit from this enhancement as it would always be disabled.

A consumer can can call validate(nb, use_fast=True), but what about the double validation question we have identified in jupyter_server? The validation is always triggered on load in nbformat?

Hi @echarles, thanks for the feedback. So:

use_fast is always set to False. Anyway that a consumer of the nbformat library would use it with use_fast set to True? I mean, it sounds to me reading the changes that it will not be possible for a consumer to benefit from this enhancement as it would always be disabled.

Yes, this is because there is some nested logic that required the "switch"

validate -> iter_validate -> better_validation_errors -> validate ....

I am looking into decoupling it so the use_fast is no longer needed.

A consumer can can call validate(nb, use_fast=True), but what about the double validation question we have identified in jupyter_server? The validation is always triggered on load in nbformat?

The read method which also validates (but does not raise errors) uses now use_fast=True. Since we use validate on jupyter_server, we could add the use_fas=True there, so we would still have double validation, but now both are fast.

We could also evaluate if there should be an extra parameter on the read method, to actually skip validation, I think it should, but be True by default.

Still, I would prefer no to have the use_fast parameter, so I need to think of a better way to decouple the
validate -> iter_validate -> better_validation_errors -> validate .... chain

bollwyvl · 2020-09-24T14:57:40Z

The force push makes it very hard to follow the work... just lost a comment to the effect of:

I don't think we'd want to mix knowledge of both implementations in a single class. I also don't know how to make PFJS quack exactly like iter_errors...

So there might be a `Validator`, which implements the canonical `jsonschema`, and `FastValidator` which subclasses that. Finally, there's the missing piece of `get_validator` or something which would actually pick one... i guess it could be hidden behind an env var for now, e.g. `NBFORMAT_VALIDATOR=fast` but, again, would need to become end-user configurable to be exactly _one_ value.

goanpeca · 2020-09-24T14:58:14Z

The force push makes it very hard to follow the work... just lost a comment to the effect of:

Sorry!

goanpeca · 2020-09-24T15:01:26Z

I don't think we'd want to mix knowledge of both implementations in a single class.

Yes, it would be cleaner to have different classes, but as you say:

I also don't know how to make PFJS quack exactly like iter_errors...

Yeah this is not possible since this is in not a 1-1 replacement, so it would not make sense to split more at this stage

So there might be a Validator, which implements the canonical jsonschema, and FastValidator which subclasses that.

See para above.

Finally, there's the missing piece of get_validator or something which would actually pick one... i guess it could be hidden behind an env var for now, e.g. NBFORMAT_VALIDATOR=fast but, again, would need to become end-user configurable to be exactly one value.

Yes, in the case of having more validators bundled, yes, but I would say that is out of the scope of this one.

If more validators are bundled, yes, at the current state it would not make much sense since only fast validation is using the fast validator (on read), the iter_errors is only provided by jsonschema, so all those splitting of classes do not make sense in this iteration. For people using the validator func outside (like jupyter_server) we have the option of using fast validation, or "slow" as a parameter.

Also, the NBFORMAT_VALIDATOR does not make sense in this iteration.

Thanks for the review @bollwyvl !

goanpeca · 2020-09-24T15:11:40Z

Will think a bit harder to decouple the better_validation_errors logic, which is the part holding the splitting of classes.

bollwyvl · 2020-09-24T15:27:28Z

iter_errors is only provided by jsonschema

It could just yield the first error.

If someone is use iter_errors and treating "some" errors as warnings, or counting errors, then they are setting up their stuff for failure, anyway.

Also, the NBFORMAT_VALIDATOR does not make sense in this iteration.

Ok, then json_compat.use_validator("fastjsonschema") or something. In the future, this can become whatever else the implementation is.

I feel like adding use_fast to the API is the part that isn't scalable, given the layer of the stack we are in. If, at import/reconfigure time, some other validator we know about a) is available, and b) we want to use it, it should use it, and only it... every downstream should not have to discover this new API flag, and then think about how it doesn't work for their use case.

goanpeca · 2020-09-24T15:28:21Z

I feel like adding use_fast to the API is the part that isn't scalable, given the layer of the stack we are in.

Agreed, I just need to figure out splitting the logic, so working on that :-)

goanpeca · 2020-09-24T18:38:08Z

@MSeal would it be ok to add pytest so I can add fixtures with parameters for the different libraries available?

goanpeca · 2020-09-24T20:20:13Z

This one is ready for review @MSeal, thanks!

MSeal

Looks fine to me, will give @bollwyvl some time to respond in case he has further thoughts on his side.

bollwyvl · 2020-09-25T12:22:16Z

Sorry for the delay on review! Yeah, this is looking great. 🎉

This gets us to where a downstream can cleanly experiment already... that rust one sounds so promising!

The only things missing from this PR that I would think could be fixed in this PR or on subsequent PRs:

changelog entry
a narrative sentence and example showing how to use NBFORMAT_VALIDATOR
- either in the comment of the appropriate method, or in the rst
- probably also a link to fastjsonschema's docs
an extras in setup.py to document the fastjsonschema dependency
- if it makes sense, add a sensible lower and/or upper bound, e.g. >=2,<3

Definitely out of scope, not blocking, but could be investigated:

how might we formalize the extensibility e.g. entry_point, traitlets?
how might we capture and document the performance trade-offs of these approaches (under different loads) e.g. asv?
how could this approach be expanded to include serialization with e.g. orjson, pysmdjson (also rust-based)
how could these could be made async (I owe @goanpeca an issue for this one 😊)?

goanpeca · 2020-09-25T13:17:44Z

Thanks @bollwyvl!

The only things missing from this PR that I would think could be fixed in this PR or on subsequent PRs:

All great suggestions, I think they should be added on this PR, just wanted to do it after agreeing on the workflow.

Definitely out of scope, not blocking, but could be investigated:

Absolutely! And yes you owe me some issues :-p

Will get this done later today!

…variable

goanpeca · 2020-10-04T20:11:26Z

Added some info on the docs (API section)
Update changelog
Updated the README.

MSeal

One minor comment, at this point I'm happy to merge though when that gets touched up.

MSeal · 2020-10-06T16:50:03Z

setup.py

@@ -90,6 +90,7 @@
 ]

 extras_require = setuptools_args['extras_require'] = {
+    'fast': ['fastjsonschema'],


We should also add this to test so tests can use it without issue,

goanpeca · 2020-10-06T18:21:54Z

One minor comment, at this point I'm happy to merge though when that gets touched up.

Made the suggested fixes @MSeal ! Thanks :-)

MSeal

Thanks for following up on the PR reviews quickly!

goanpeca · 2020-10-06T20:28:43Z

Wohooo! Thanks @bollwyvl, @echarles and @MSeal :-) !

Now the follow up question. Could a release be planed soon :-) ?

Thanks!

MSeal · 2020-10-06T21:27:17Z

We also have #189 open and ready for review -- I'd like to get it released too if we can. If that takes longer to merge we can ship a release. Very soon I won't be able to take much action (newborn coming) on these projects so someone else will likely need to do the release itself.

goanpeca · 2020-10-06T21:28:56Z

Very soon I won't be able to take much action (newborn coming) on these projects so

Congratulations!

someone else will likely need to do the release itself.

I would be happy to help anyway I can :-), I can set the release process to be tied to tags and run the pypinpm publication on CI, that would simplify the process for anyone mainting this.

Cheers!

echarles · 2020-10-07T06:09:55Z

Congrats @goanpeca for this great work and improvements. Really loved how things have moved forward. Thanks also to @MSeal and @bollwyvl for the reviews.

I can set the release process to be tied to tags and run the pypinpm publication on CI, that would simplify the process for anyone mainting this.

That would be another great contribution @goanpeca

MSeal · 2020-10-10T22:08:48Z

I went ahead and released 5.0.8 with this change since the other PR is likely going to have to sit until JupyterCon is done to get proper eyes on it

mlucool · 2020-10-14T23:13:33Z

Thanks for releasing @MSeal. It looks like pypi was missed when you published. Could you please update that too?

MSeal · 2020-10-15T01:08:25Z

Huh that's strange, I guess when I did the push originally there was an error message I missed. Looks fixed now.

goanpeca force-pushed the enh/fastjsonschema branch from 9930a44 to 45ee420 Compare September 24, 2020 00:10

goanpeca changed the title ~~PR: Use fastjsonschema if installed and tests.~~ PR: Use fastjsonschema if installed and tests Sep 24, 2020

goanpeca force-pushed the enh/fastjsonschema branch 2 times, most recently from ea9ae9d to 5a51f83 Compare September 24, 2020 04:39

goanpeca marked this pull request as draft September 24, 2020 14:09

bollwyvl reviewed Sep 24, 2020

View reviewed changes

nbformat/json_compat.py Outdated Show resolved Hide resolved

bollwyvl reviewed Sep 24, 2020

View reviewed changes

nbformat/json_compat.py Outdated Show resolved Hide resolved

bollwyvl reviewed Sep 24, 2020

View reviewed changes

nbformat/validator.py Outdated Show resolved Hide resolved

bollwyvl reviewed Sep 24, 2020

View reviewed changes

nbformat/validator.py Outdated Show resolved Hide resolved

echarles reviewed Sep 24, 2020

View reviewed changes

goanpeca force-pushed the enh/fastjsonschema branch 3 times, most recently from a92153e to 2b339e7 Compare September 24, 2020 14:57

goanpeca marked this pull request as ready for review September 24, 2020 14:58

goanpeca changed the title ~~PR: Use fastjsonschema if installed and tests~~ PR: Use fastjsonschema if installed and add tests Sep 24, 2020

goanpeca force-pushed the enh/fastjsonschema branch 2 times, most recently from 60bb71f to 7b4274c Compare September 24, 2020 20:16

goanpeca requested a review from bollwyvl September 24, 2020 20:19

goanpeca requested a review from MSeal September 24, 2020 20:58

MSeal approved these changes Sep 24, 2020

View reviewed changes

bollwyvl approved these changes Sep 25, 2020

View reviewed changes

bollwyvl mentioned this pull request Sep 25, 2020

nbformat asynchronous API? #194

Open

goanpeca added 4 commits October 4, 2020 14:54

Use fastjsonschema if installed and tests.

41bb87d

Refactor code into a common validator

c235609

Split classes and add reading the current validator from environment …

4c708b6

…variable

Add review comments.

49da28f

goanpeca force-pushed the enh/fastjsonschema branch from 066e15c to 49da28f Compare October 4, 2020 19:54

Update docs, readme and changelog and CI for tests

eb74cf7

goanpeca requested review from bollwyvl and MSeal October 4, 2020 20:12

Fix extras

99603b1

MSeal approved these changes Oct 6, 2020

View reviewed changes

Add fastjsonshcema to test deps and update CI

e977d22

MSeal approved these changes Oct 6, 2020

View reviewed changes

MSeal merged commit 23272c1 into jupyter:master Oct 6, 2020

goanpeca deleted the enh/fastjsonschema branch October 6, 2020 20:28

jess-x mentioned this pull request Mar 6, 2022

Meeting Notes 2022 jupyter-server/team-compass#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR: Use fastjsonschema if installed and add tests #191

PR: Use fastjsonschema if installed and add tests #191

goanpeca commented Sep 24, 2020 •

edited

Loading

bollwyvl commented Sep 24, 2020

goanpeca commented Sep 24, 2020 •

edited

Loading

bollwyvl commented Sep 24, 2020

goanpeca commented Sep 24, 2020

echarles left a comment

echarles Sep 24, 2020

echarles Sep 24, 2020

goanpeca Sep 24, 2020 •

edited

Loading

bollwyvl commented Sep 24, 2020

goanpeca commented Sep 24, 2020

goanpeca commented Sep 24, 2020 •

edited

Loading

goanpeca commented Sep 24, 2020

bollwyvl commented Sep 24, 2020

goanpeca commented Sep 24, 2020

goanpeca commented Sep 24, 2020

goanpeca commented Sep 24, 2020

MSeal left a comment

bollwyvl commented Sep 25, 2020

goanpeca commented Sep 25, 2020

goanpeca commented Oct 4, 2020 •

edited

Loading

MSeal left a comment

MSeal Oct 6, 2020

goanpeca Oct 6, 2020

goanpeca commented Oct 6, 2020

MSeal left a comment

goanpeca commented Oct 6, 2020

MSeal commented Oct 6, 2020

goanpeca commented Oct 6, 2020

echarles commented Oct 7, 2020

MSeal commented Oct 10, 2020

mlucool commented Oct 14, 2020

MSeal commented Oct 15, 2020

PR: Use fastjsonschema if installed and add tests #191

PR: Use fastjsonschema if installed and add tests #191

Conversation

goanpeca commented Sep 24, 2020 • edited Loading

bollwyvl commented Sep 24, 2020

goanpeca commented Sep 24, 2020 • edited Loading

bollwyvl commented Sep 24, 2020

goanpeca commented Sep 24, 2020

echarles left a comment

Choose a reason for hiding this comment

echarles Sep 24, 2020

Choose a reason for hiding this comment

echarles Sep 24, 2020

Choose a reason for hiding this comment

goanpeca Sep 24, 2020 • edited Loading

Choose a reason for hiding this comment

bollwyvl commented Sep 24, 2020

goanpeca commented Sep 24, 2020

goanpeca commented Sep 24, 2020 • edited Loading

goanpeca commented Sep 24, 2020

bollwyvl commented Sep 24, 2020

goanpeca commented Sep 24, 2020

goanpeca commented Sep 24, 2020

goanpeca commented Sep 24, 2020

MSeal left a comment

Choose a reason for hiding this comment

bollwyvl commented Sep 25, 2020

goanpeca commented Sep 25, 2020

goanpeca commented Oct 4, 2020 • edited Loading

MSeal left a comment

Choose a reason for hiding this comment

MSeal Oct 6, 2020

Choose a reason for hiding this comment

goanpeca Oct 6, 2020

Choose a reason for hiding this comment

goanpeca commented Oct 6, 2020

MSeal left a comment

Choose a reason for hiding this comment

goanpeca commented Oct 6, 2020

MSeal commented Oct 6, 2020

goanpeca commented Oct 6, 2020

echarles commented Oct 7, 2020

MSeal commented Oct 10, 2020

mlucool commented Oct 14, 2020

MSeal commented Oct 15, 2020

goanpeca commented Sep 24, 2020 •

edited

Loading

goanpeca commented Sep 24, 2020 •

edited

Loading

goanpeca Sep 24, 2020 •

edited

Loading

goanpeca commented Sep 24, 2020 •

edited

Loading

goanpeca commented Oct 4, 2020 •

edited

Loading