Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

badarg exception when reading multi-line Unicode input #7591

Closed
chrzaszcz opened this issue Aug 25, 2023 · 4 comments · Fixed by #7714
Closed

badarg exception when reading multi-line Unicode input #7591

chrzaszcz opened this issue Aug 25, 2023 · 4 comments · Fixed by #7714
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@chrzaszcz
Copy link

Describe the bug
When multiline UTF-8 data is provided on the standard input, and file:read_line/1 is used to read it, it fails with badarg.

To Reproduce
The simplest way is to use the escript:

#!/usr/bin/env escript

main(_) ->
    {ok, Line} = file:read_line(standard_io),
    io:format("~ts", [Line]),
    ok.

Then, you can easily reproduce the bug:

$ echo -e "ę\ną" | ./test.escript
=SUPERVISOR REPORT==== 25-Aug-2023::15:36:46.996583 ===
    supervisor: {<0.64.0>,user_sup}
    errorContext: child_terminated
    reason: {badarg,[{erlang,'++',
                             [{error,[],[261,10]},[]],
                             [{error_info,#{module => erl_erts_errors}}]},
                     {group,append,3,[{file,"group.erl"},{line,1076}]},
                     {group,get_chars_loop,10,[{file,"group.erl"},{line,516}]},
                     {group,io_request,6,[{file,"group.erl"},{line,200}]},
                     {group,server_loop,3,[{file,"group.erl"},{line,126}]}]}
    offender: [{pid,<0.69.0>},{mod,user_sup}]

escript: exception error: no match of right hand side value {error,terminated}
  in function  erl_eval:expr/6 (erl_eval.erl, line 498)
  in call from escript:eval_exprs/5 (escript.erl, line 869)
  in call from erl_eval:local_func/8 (erl_eval.erl, line 646)
  in call from escript:interpret/4 (escript.erl, line 780)
  in call from escript:start/1 (escript.erl, line 277)
  in call from init:start_em/1
  in call from init:do_boot/3

The bug is not present when there is only one line of Unicode input:

$ echo -e "ę\n" | ./test.escript
ę

The bug can be caused with an Erlang module as well, but the data needs to be provided immediately on startup.

Expected behavior
The expected output of the example escript is just "ę".

Affected versions
The bug is present in 26.0.2
The bug is not present in 25.3.2.5

Additional context
Printable characters and encoding are 'unicode' in the shell - checked.
The bug disappears when LC_ALL is set to 'ISO-8859-1', but then the output is incorrect (no unicode, of course).
OS: macOS Ventura 13.4.1

There is another issue that might be related, and I am not sure if it should be reported as a separate bug. It is present in both tested versions. The issue is that the binary unicode input does not work at all:

$ erl
Erlang/OTP 26 [erts-14.0.2] [source] [64-bit] [smp:10:10] [ds:10:10:10] [async-threads:1] [jit] [dtrace]

Eshell V14.0.2 (press Ctrl+G to abort, type help(). for help)
1> io:setopts([binary]).
ok
2> file:read_line(standard_io).

ą
3> v(2).
{error,collect_line}

It looks like the broken code is somewhere in group:cast, where the encoding seems to be latin1 even though it is set to unicode.

@chrzaszcz chrzaszcz added the bug Issue is reported as a bug label Aug 25, 2023
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Aug 28, 2023
@garazdawi garazdawi assigned garazdawi and unassigned frazze-jobb Oct 2, 2023
@garazdawi
Copy link
Contributor

Hello Paweł!

This works as intended (if you ignore the crash...). What has changed inbetween 25 and 26 is that escripts now by default run using the same encoding as the environment instead of always running in bytewise encoding (aka latin1).

This means that in order to get the same behaviour in 25 and 26 you need to set the encoding you want to work with (without changing it in the receiving shell):

> echo -e "ę\ną" | LC_CTYPE=ISO-8859-1 ./test.escript
ę

or you can do it as a kernel application parameter that was added in OTP-26.1:

#!/usr/bin/env escript
%%! -kernel standard_io_encoding latin1

main(_) ->
    {ok, Line} = file:read_line(standard_io),
    io:format("~~w : ~w~n", [Line]),
    io:format("~~ts: ~ts~n", [Line]),
    ok.
> echo -e "ę\ną" | ./test.escript
~w : [196,153,10]
~ts: ę

The crash happens because the code does not correctly figure out that it should return {error, {no_translation, unicode, latin1}} when you call file:read_line/1.

Your example with setting mode to binary actually works better, though the error code is again wrong. You have the same problem there, you try to read latin1 characters from a unicode stream and there is no mapping. Starting with erl -kernel standard_io_encoding latin1 corrects the issue.

I will fix the crash and the incorrect error return when set in binary mode.

Anyway, I don't think you actually want to use file:read_file/1 in this case, but instead io:get_line/2. The file module is used to read bytes, so when you use it on a unicode stream (and encoding is set to latin1) you will get [196,153] (which is "Ä™") and not [281] as you expect. While if you use io:get_line/2 instead you will get the correct result. The reason why using file kind of always works is because what you read that is valid latin1, although not the same thing as you input. You will get into all kinds of problems if you try to work with that string.

#!/usr/bin/env escript

main(_) ->
    Line = io:get_line(standard_io, ""),
    io:format("~~w : ~w~n", [Line]),
    io:format("~~ts: ~ts~n", [Line]),
    ok.
> echo -e "ę\ną" | ./test.escript
~w : [281,10]
~ts: ę

This is all very non-obvious and subtle, I attempted to make the documentation clearer regarding this in #7384. Who knew plain text could be this hard...

@chrzaszcz
Copy link
Author

Hi Lukas, thanks for the detailed answer.
I checked it and this indeed works as expected:

#!/usr/bin/env escript

main(_) ->
    Line = io:get_line(standard_io, ""),
    io:format("~ts", [Line]),
    ok.

I must have read the previous version of the documentation, because the current version clearly recommends io:get_line/2 for my case.

Should I close the task or are you going to close it when the badarg itself gets addressed?

@garazdawi
Copy link
Contributor

I must have read the previous version of the documentation, because the current version clearly recommends io:get_line/2 for my case.

The docs update was released after you opened this ticket, so that is very likely :)

Should I close the task or are you going to close it when the badarg itself gets addressed?

I'll close it when the badarg is fixed.

garazdawi added a commit to garazdawi/otp that referenced this issue Oct 4, 2023
When file is used to read from a group, we should return a translation
error if the input stream contains unicode characters (as file expects
latin1).

Closes erlang#7591
@garazdawi
Copy link
Contributor

#7714 fixes the error handling to be correct

garazdawi added a commit to garazdawi/otp that referenced this issue Oct 10, 2023
When file is used to read from a group, we should return a translation
error if the input stream contains unicode characters (as file expects
latin1).

Closes erlang#7591
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants