Specify that malformed Unicode code points result in an error #420

hudlow · 2024-12-09T21:40:07Z

Currently, cel-go suppresses errors by silently coercing bad code points to \ufffd. As a result, this expression evaluates to true:

'\udead' == '\ufffd'

In contrast, cel-java throws an explicit exception:

dev.cel.common.CelValidationException: 
ERROR: <input>:1:1: Invalid unicode code point
 | '\udead' == '\ufffd'
 | ^

I believe the Java behavior here is far more desirable, at least as a default, and should be the specified behavior.

The text was updated successfully, but these errors were encountered:

hudlow · 2024-12-09T22:32:31Z

Perhaps a worse facet of this in cel-go is that because Golang itself allows for strings to contain malformed UTF-8, string variables break the type safety of CEL's strings.

As a consequence, this returns true:

env, _ := cel.NewEnv(cel.Variable("name", cel.StringType))
ast, _ := env.Compile("bytes(name) == b'\\x80'")
prg, _ := env.Program(ast)
out, _, err := prg.Eval(map[string]any{
  "name": "\x80",
})
if err != nil {
  log.Fatalln(err)
}
  
str, _ := out.ConvertToNative(reflect.TypeOf(true));
fmt.Println(str.(bool))

And this results in a runtime error¹:

env, _ := cel.NewEnv(cel.Variable("name", cel.StringType))
ast, _ := env.Compile("string(bytes(name))")
prg, _ := env.Program(ast)
out, _, err := prg.Eval(map[string]any{
  "name": "\x80",
})
if err != nil {
  log.Fatalln(err)
}

str, _ := out.ConvertToNative(reflect.TypeOf(""));
fmt.Println(str.(string))

invalid UTF-8 in bytes, cannot convert to string ↩

TristonianJones · 2024-12-09T23:17:57Z

Hi @hudlow,

There's an expectation that strings are valid Unicode code points (I could spell out UTF-8 specifically, perhaps). From the langdef.md file:

While strings must be sequences of valid Unicode code points, no Unicode
normalization is attempted on strings, as there are several normal forms, they
can be expensive to convert, and we don't know which is desired. If Unicode
normalization is desired, it should be performed outside of CEL, or done as a
custom extension function.

Regarding, where invalid utf-8 strings are provided to the CEL runtime, CEL's position would be that this is undefined behavior. The cel-go behavior of evaluating '\udead' == '\ufffd' is definitely an error and not by design.

I've filed google/cel-go#1093 to track the Golang issue. Is there a specific update to the language regarding UTF-8 string expectations?

hudlow · 2024-12-10T04:57:34Z

@TristonianJones:

The cel-go behavior of evaluating '\udead' == '\ufffd' is definitely an error and not by design. I've filed google/cel-go#1093 to track the Golang issue. Is there a specific update to the language regarding UTF-8 string expectations?

Actually, the language is probably pretty good since string literals accept escaped code points, and not escaped UTF-8 bytes. I think I would like the spec to be more explicit that a CEL parser ought to reject a string literal with invalid code points (versus it somehow only getting rejected during validation or evaluation). If you're amenable to this, I can try to draft a clarification.

Thank you for the quick fix!

Regarding, where invalid utf-8 strings are provided to the CEL runtime, CEL's position would be that this is undefined behavior.

Okay, I think it makes sense to me that this is out of scope for the language definition because by providing invalid UTF-8 strings as string data, you're essentially providing corrupt data. Still, in terms of the cel-go implementation, isn't it really uncomfortable that it is so easy to overlook a type safety issue?

I guess what I'm getting at is what the best practice for an implementation is with respect to ensuring that you don't inadvertently pass corrupt data from the implementing environment to the CEL environment.

Thinking through it more though, since Golang doesn't have a native type for "only well-formed Unicode strings," I'm struggling to think of a way that you could catch the issue at compile time (for either the Go code or the CEL expression), and Unicode validation could be an enormous performance tax at runtime, especially considering that it is not particularly unlikely that your runtime values will be coming from a source which itself guarantees correct Unicode and only transiently be handled as Go strings that don't.

Still, the lack of guardrails makes me twitch a little. Is there nothing more that can be done?

TristonianJones mentioned this issue Dec 9, 2024

String literals permit invalid UTF-8 google/cel-go#1093

Closed

3 tasks

hudlow mentioned this issue Dec 14, 2024

clarify Unicode handling #423

Merged

TristonianJones closed this as completed in #423 Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify that malformed Unicode code points result in an error #420

Specify that malformed Unicode code points result in an error #420

hudlow commented Dec 9, 2024

hudlow commented Dec 9, 2024 •

edited

Loading

TristonianJones commented Dec 9, 2024

hudlow commented Dec 10, 2024 •

edited

Loading

Specify that malformed Unicode code points result in an error #420

Specify that malformed Unicode code points result in an error #420

Comments

hudlow commented Dec 9, 2024

hudlow commented Dec 9, 2024 • edited Loading

Footnotes

TristonianJones commented Dec 9, 2024

hudlow commented Dec 10, 2024 • edited Loading

hudlow commented Dec 9, 2024 •

edited

Loading

hudlow commented Dec 10, 2024 •

edited

Loading