Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify that malformed Unicode code points result in an error #420

Closed
hudlow opened this issue Dec 9, 2024 · 3 comments · Fixed by #423
Closed

Specify that malformed Unicode code points result in an error #420

hudlow opened this issue Dec 9, 2024 · 3 comments · Fixed by #423

Comments

@hudlow
Copy link
Contributor

hudlow commented Dec 9, 2024

Currently, cel-go suppresses errors by silently coercing bad code points to \ufffd. As a result, this expression evaluates to true:

'\udead' == '\ufffd'

In contrast, cel-java throws an explicit exception:

dev.cel.common.CelValidationException: 
ERROR: <input>:1:1: Invalid unicode code point
 | '\udead' == '\ufffd'
 | ^

I believe the Java behavior here is far more desirable, at least as a default, and should be the specified behavior.

@hudlow
Copy link
Contributor Author

hudlow commented Dec 9, 2024

Perhaps a worse facet of this in cel-go is that because Golang itself allows for strings to contain malformed UTF-8, string variables break the type safety of CEL's strings.

As a consequence, this returns true:

env, _ := cel.NewEnv(cel.Variable("name", cel.StringType))
ast, _ := env.Compile("bytes(name) == b'\\x80'")
prg, _ := env.Program(ast)
out, _, err := prg.Eval(map[string]any{
  "name": "\x80",
})
if err != nil {
  log.Fatalln(err)
}
  
str, _ := out.ConvertToNative(reflect.TypeOf(true));
fmt.Println(str.(bool))

And this results in a runtime error1:

env, _ := cel.NewEnv(cel.Variable("name", cel.StringType))
ast, _ := env.Compile("string(bytes(name))")
prg, _ := env.Program(ast)
out, _, err := prg.Eval(map[string]any{
  "name": "\x80",
})
if err != nil {
  log.Fatalln(err)
}

str, _ := out.ConvertToNative(reflect.TypeOf(""));
fmt.Println(str.(string))

Footnotes

  1. invalid UTF-8 in bytes, cannot convert to string

@TristonianJones
Copy link
Collaborator

Hi @hudlow,

There's an expectation that strings are valid Unicode code points (I could spell out UTF-8 specifically, perhaps). From the langdef.md file:

While strings must be sequences of valid Unicode code points, no Unicode
normalization is attempted on strings, as there are several normal forms, they
can be expensive to convert, and we don't know which is desired. If Unicode
normalization is desired, it should be performed outside of CEL, or done as a
custom extension function.

Regarding, where invalid utf-8 strings are provided to the CEL runtime, CEL's position would be that this is undefined behavior. The cel-go behavior of evaluating '\udead' == '\ufffd' is definitely an error and not by design.

I've filed google/cel-go#1093 to track the Golang issue. Is there a specific update to the language regarding UTF-8 string expectations?

@hudlow
Copy link
Contributor Author

hudlow commented Dec 10, 2024

@TristonianJones:

The cel-go behavior of evaluating '\udead' == '\ufffd' is definitely an error and not by design. I've filed google/cel-go#1093 to track the Golang issue. Is there a specific update to the language regarding UTF-8 string expectations?

Actually, the language is probably pretty good since string literals accept escaped code points, and not escaped UTF-8 bytes. I think I would like the spec to be more explicit that a CEL parser ought to reject a string literal with invalid code points (versus it somehow only getting rejected during validation or evaluation). If you're amenable to this, I can try to draft a clarification.

Thank you for the quick fix!

Regarding, where invalid utf-8 strings are provided to the CEL runtime, CEL's position would be that this is undefined behavior.

Okay, I think it makes sense to me that this is out of scope for the language definition because by providing invalid UTF-8 strings as string data, you're essentially providing corrupt data. Still, in terms of the cel-go implementation, isn't it really uncomfortable that it is so easy to overlook a type safety issue?

I guess what I'm getting at is what the best practice for an implementation is with respect to ensuring that you don't inadvertently pass corrupt data from the implementing environment to the CEL environment.

Thinking through it more though, since Golang doesn't have a native type for "only well-formed Unicode strings," I'm struggling to think of a way that you could catch the issue at compile time (for either the Go code or the CEL expression), and Unicode validation could be an enormous performance tax at runtime, especially considering that it is not particularly unlikely that your runtime values will be coming from a source which itself guarantees correct Unicode and only transiently be handled as Go strings that don't.

Still, the lack of guardrails makes me twitch a little. Is there nothing more that can be done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants