Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR: Use fastjsonschema if installed and add tests #191

Merged
merged 7 commits into from
Oct 6, 2020

Conversation

goanpeca
Copy link
Contributor

@goanpeca goanpeca commented Sep 24, 2020

Fixes #190

This adds the possibility of using fastjsonschema if installed and other libraries if installed or defined by a NBFORMAT_VALIDATOR environment variable.

The possible values are 1-1 to the module names so, jsonschema, fastjsonschema, eventually (when stable) we could also add jsonschema_rs, the rust implementation.

  • Added a tests to check things work as expected with both libraries, with one caveat: test that expect a better_error will not work, so those 2 tests are skipped for fastjsonschema. The rest works as expected. This means if using fastjsonschema printed errors will be less readable.
  • Update CI configuration.
  • Used pytest parametrize

Related to jupyter-server/jupyter_server#312

Pinging @echarles, @jasongrout, @Zsailer, @mlucool

@goanpeca goanpeca changed the title PR: Use fastjsonschema if installed and tests. PR: Use fastjsonschema if installed and tests Sep 24, 2020
@goanpeca goanpeca force-pushed the enh/fastjsonschema branch 2 times, most recently from ea9ae9d to 5a51f83 Compare September 24, 2020 04:39
@bollwyvl
Copy link
Contributor

This seems like a good step forward... but I have this feeling, much like with ujson/orjson on jupyterlab/jupyterlab_server#95, that just picking a second implementation and writing multiple code paths in many places will make the effort harder to manage in the future.

I guess, to both ends, a configurable JSONManager (e.g .parser_cls, .validator_cls) which offered a normalized, best-effort API for all the JSON stuff would answer the mail.

An additional thing it could provide would be an async interface to (de)serialization and validation, using (probably) an executor, as to my knowledge, all of these things are blocking, which is fine for, say, nbconvert, but not so fine in a server setting.

Such a manager could be started here in nbformat, as it's as close as anything to the bits-on-disk, and moved at a later date.

@goanpeca
Copy link
Contributor Author

goanpeca commented Sep 24, 2020

Hi @bollwyvl, thanks for the feedback.

writing multiple code paths in many places will make the effort harder to manage in the future.

Agreed, it does make the code path harder to follow and I actually tried it but for this specific case, that would require making way more involved changes and this was a "low hanging fruit fix". This fix is also kinda "urgent" for the use case we ran into.

We were having jupyter_server taking 40 seconds to actually send a notebook!

I guess, to both ends, a configurable JSONManager (e.g .parser_cls, .validator_cls) which offered a normalized, best-effort API for all the JSON stuff would answer the mail.

Agreed, except that unlike the jupyter_server case we have at least 3-4 (more?) different competing implementations of JSON that are fast in general. Here we have just this one (in python), the other option is slow too. We could explore using other libraries written in other languages, but that would also require calling them via subprocess and rethinking a lot along the way.


I guess, to both ends, a configurable JSONManager

At this point there are 2 things this manager would at least need to do. Reading and validation, and the bottleneck here is validation, not reading, although I could see benefits in that area as well.


Also I think this is an interesting project enough (the JSONManager) that I would prefer starting a new package for it. Could you create a repo in deathbeds and I can start coding in there with more freedom :-)

@bollwyvl
Copy link
Contributor

"low hanging fruit fix"

... unless if collecting the low-hanging fruit makes the tree unhealthy.

Once use_fast is in a bunch of places, it gets a lot harder to change, whereas if it's a config-level thing, with a well-defined interface, then no existing code has to change to take advantage of the performance gains.

For example, I frankly wouldn't see running both a jsonschema and fastjsonschema validation during the same import... and indeed, perhaps the copy-pasted validator caching is something this manager could provide, by default, as it's almost always what you'd want, given the warm-up cost of both implementations considered to this point.

Speaking of implementations...

Here we have just this one (in python), the other option is slow too.

Well, there's at least these:

I have no benchmarks, but I'd wager, for certain workloads, these would be faster.

Could you create a repo in deathbeds and I can start coding in there with more freedom :-)

Not to down-sell our mad science experiments over there, but nobody looks at deathbeds, and a lot of the stuff in there is intentionally not "core ready". "How Jupyter Does JSON Standards" is a pretty big deal, and should probably be incubated more publicly in an official project... to me, that seems like jupyter_core, but it is intentionally lightweight (woulda been reeeeeally nice to solve the win/py38 ProactorEventLoop there, but didn't want to take the tornado opinion, which is far lighter-weight than anything discussed here).

So, like I said, right here in nbformat is probably already the right place, and could start with something far simpler, e.g. a json_compat.py which offers the expected API, and can be overridden at the import level, without digging into traitlets config.

@goanpeca
Copy link
Contributor Author

Sounds good, will take a look ;-)

@goanpeca goanpeca marked this pull request as draft September 24, 2020 14:09
nbformat/json_compat.py Outdated Show resolved Hide resolved
nbformat/json_compat.py Outdated Show resolved Hide resolved
nbformat/validator.py Outdated Show resolved Hide resolved
nbformat/validator.py Outdated Show resolved Hide resolved
Copy link
Member

@echarles echarles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to submit my morning review, which is I believe deprecated based on recent discussions.

@@ -294,7 +327,8 @@ def iter_validate(nbdict=None, ref=None, version=None, version_minor=None,
if version is None:
version, version_minor = get_version(nbdict)

validator = get_validator(version, version_minor, relax_add_props=relax_add_props)
validator = get_validator(version, version_minor, relax_add_props=relax_add_props,
use_fast=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_fast is always set to False. Anyway that a consumer of the nbformat library would use it with use_fast set to True? I mean, it sounds to me reading the changes that it will not be possible for a consumer to benefit from this enhancement as it would always be disabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A consumer can can call validate(nb, use_fast=True), but what about the double validation question we have identified in jupyter_server? The validation is always triggered on load in nbformat?

Copy link
Contributor Author

@goanpeca goanpeca Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @echarles, thanks for the feedback. So:

use_fast is always set to False. Anyway that a consumer of the nbformat library would use it with use_fast set to True? I mean, it sounds to me reading the changes that it will not be possible for a consumer to benefit from this enhancement as it would always be disabled.

Yes, this is because there is some nested logic that required the "switch"

validate -> iter_validate -> better_validation_errors -> validate ....

I am looking into decoupling it so the use_fast is no longer needed.

A consumer can can call validate(nb, use_fast=True), but what about the double validation question we have identified in jupyter_server? The validation is always triggered on load in nbformat?

The read method which also validates (but does not raise errors) uses now use_fast=True. Since we use validate on jupyter_server, we could add the use_fas=True there, so we would still have double validation, but now both are fast.

We could also evaluate if there should be an extra parameter on the read method, to actually skip validation, I think it should, but be True by default.


Still, I would prefer no to have the use_fast parameter, so I need to think of a better way to decouple the
validate -> iter_validate -> better_validation_errors -> validate .... chain

@goanpeca goanpeca force-pushed the enh/fastjsonschema branch 3 times, most recently from a92153e to 2b339e7 Compare September 24, 2020 14:57
@bollwyvl
Copy link
Contributor

The force push makes it very hard to follow the work... just lost a comment to the effect of:

I don't think we'd want to mix knowledge of both implementations in a single class. I also don't know how to make PFJS quack exactly like iter_errors...

So there might be a `Validator`, which implements the canonical `jsonschema`, and `FastValidator` which subclasses that. Finally, there's the missing piece of `get_validator` or something which would actually pick one... i guess it could be hidden behind an env var for now, e.g. `NBFORMAT_VALIDATOR=fast` but, again, would need to become end-user configurable to be exactly _one_ value.

@goanpeca goanpeca marked this pull request as ready for review September 24, 2020 14:58
@goanpeca
Copy link
Contributor Author

The force push makes it very hard to follow the work... just lost a comment to the effect of:

Sorry!

@goanpeca
Copy link
Contributor Author

goanpeca commented Sep 24, 2020

I don't think we'd want to mix knowledge of both implementations in a single class.

Yes, it would be cleaner to have different classes, but as you say:

I also don't know how to make PFJS quack exactly like iter_errors...

Yeah this is not possible since this is in not a 1-1 replacement, so it would not make sense to split more at this stage

So there might be a Validator, which implements the canonical jsonschema, and FastValidator which subclasses that.

See para above.

Finally, there's the missing piece of get_validator or something which would actually pick one... i guess it could be hidden behind an env var for now, e.g. NBFORMAT_VALIDATOR=fast but, again, would need to become end-user configurable to be exactly one value.

Yes, in the case of having more validators bundled, yes, but I would say that is out of the scope of this one.

If more validators are bundled, yes, at the current state it would not make much sense since only fast validation is using the fast validator (on read), the iter_errors is only provided by jsonschema, so all those splitting of classes do not make sense in this iteration. For people using the validator func outside (like jupyter_server) we have the option of using fast validation, or "slow" as a parameter.

Also, the NBFORMAT_VALIDATOR does not make sense in this iteration.

Thanks for the review @bollwyvl !

@goanpeca
Copy link
Contributor Author

Will think a bit harder to decouple the better_validation_errors logic, which is the part holding the splitting of classes.

@goanpeca goanpeca changed the title PR: Use fastjsonschema if installed and tests PR: Use fastjsonschema if installed and add tests Sep 24, 2020
@bollwyvl
Copy link
Contributor

iter_errors is only provided by jsonschema

It could just yield the first error.

If someone is use iter_errors and treating "some" errors as warnings, or counting errors, then they are setting up their stuff for failure, anyway.

Also, the NBFORMAT_VALIDATOR does not make sense in this iteration.

Ok, then json_compat.use_validator("fastjsonschema") or something. In the future, this can become whatever else the implementation is.

I feel like adding use_fast to the API is the part that isn't scalable, given the layer of the stack we are in. If, at import/reconfigure time, some other validator we know about a) is available, and b) we want to use it, it should use it, and only it... every downstream should not have to discover this new API flag, and then think about how it doesn't work for their use case.

@goanpeca
Copy link
Contributor Author

I feel like adding use_fast to the API is the part that isn't scalable, given the layer of the stack we are in.

Agreed, I just need to figure out splitting the logic, so working on that :-)

@goanpeca
Copy link
Contributor Author

@MSeal would it be ok to add pytest so I can add fixtures with parameters for the different libraries available?

@goanpeca goanpeca force-pushed the enh/fastjsonschema branch 2 times, most recently from 60bb71f to 7b4274c Compare September 24, 2020 20:16
@goanpeca
Copy link
Contributor Author

This one is ready for review @MSeal, thanks!

@goanpeca goanpeca requested a review from MSeal September 24, 2020 20:58
Copy link
Contributor

@MSeal MSeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, will give @bollwyvl some time to respond in case he has further thoughts on his side.

@bollwyvl
Copy link
Contributor

Sorry for the delay on review! Yeah, this is looking great. 🎉

This gets us to where a downstream can cleanly experiment already... that rust one sounds so promising!

The only things missing from this PR that I would think could be fixed in this PR or on subsequent PRs:

  • changelog entry
  • a narrative sentence and example showing how to use NBFORMAT_VALIDATOR
    • either in the comment of the appropriate method, or in the rst
    • probably also a link to fastjsonschema's docs
  • an extras in setup.py to document the fastjsonschema dependency
    • if it makes sense, add a sensible lower and/or upper bound, e.g. >=2,<3

Definitely out of scope, not blocking, but could be investigated:

  • how might we formalize the extensibility e.g. entry_point, traitlets?
  • how might we capture and document the performance trade-offs of these approaches (under different loads) e.g. asv?
  • how could this approach be expanded to include serialization with e.g. orjson, pysmdjson (also rust-based)
  • how could these could be made async (I owe @goanpeca an issue for this one 😊)?

@goanpeca
Copy link
Contributor Author

Thanks @bollwyvl!

The only things missing from this PR that I would think could be fixed in this PR or on subsequent PRs:

All great suggestions, I think they should be added on this PR, just wanted to do it after agreeing on the workflow.

Definitely out of scope, not blocking, but could be investigated:

Absolutely! And yes you owe me some issues :-p


Will get this done later today!

@goanpeca
Copy link
Contributor Author

goanpeca commented Oct 4, 2020

  • Added some info on the docs (API section)
    image

  • Update changelog
    image

  • Updated the README.
    image

Copy link
Contributor

@MSeal MSeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, at this point I'm happy to merge though when that gets touched up.

@@ -90,6 +90,7 @@
]

extras_require = setuptools_args['extras_require'] = {
'fast': ['fastjsonschema'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add this to test so tests can use it without issue,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@goanpeca
Copy link
Contributor Author

goanpeca commented Oct 6, 2020

One minor comment, at this point I'm happy to merge though when that gets touched up.

Made the suggested fixes @MSeal ! Thanks :-)

Copy link
Contributor

@MSeal MSeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up on the PR reviews quickly!

@MSeal MSeal merged commit 23272c1 into jupyter:master Oct 6, 2020
@goanpeca
Copy link
Contributor Author

goanpeca commented Oct 6, 2020

Wohooo! Thanks @bollwyvl, @echarles and @MSeal :-) !

Now the follow up question. Could a release be planed soon :-) ?

Thanks!

@goanpeca goanpeca deleted the enh/fastjsonschema branch October 6, 2020 20:28
@MSeal
Copy link
Contributor

MSeal commented Oct 6, 2020

We also have #189 open and ready for review -- I'd like to get it released too if we can. If that takes longer to merge we can ship a release. Very soon I won't be able to take much action (newborn coming) on these projects so someone else will likely need to do the release itself.

@goanpeca
Copy link
Contributor Author

goanpeca commented Oct 6, 2020

Very soon I won't be able to take much action (newborn coming) on these projects so

Congratulations!

someone else will likely need to do the release itself.

I would be happy to help anyway I can :-), I can set the release process to be tied to tags and run the pypinpm publication on CI, that would simplify the process for anyone mainting this.

Cheers!

@echarles
Copy link
Member

echarles commented Oct 7, 2020

Congrats @goanpeca for this great work and improvements. Really loved how things have moved forward. Thanks also to @MSeal and @bollwyvl for the reviews.

I can set the release process to be tied to tags and run the pypinpm publication on CI, that would simplify the process for anyone mainting this.

That would be another great contribution @goanpeca

@MSeal
Copy link
Contributor

MSeal commented Oct 10, 2020

I went ahead and released 5.0.8 with this change since the other PR is likely going to have to sit until JupyterCon is done to get proper eyes on it

@mlucool
Copy link

mlucool commented Oct 14, 2020

Thanks for releasing @MSeal. It looks like pypi was missed when you published. Could you please update that too?

@MSeal
Copy link
Contributor

MSeal commented Oct 15, 2020

Huh that's strange, I guess when I did the push originally there was an error message I missed. Looks fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long validations times, explore using fastjsonschema
5 participants