badarg exception when reading multi-line Unicode input #7591

chrzaszcz · 2023-08-25T14:22:32Z

Describe the bug
When multiline UTF-8 data is provided on the standard input, and file:read_line/1 is used to read it, it fails with badarg.

To Reproduce
The simplest way is to use the escript:

#!/usr/bin/env escript

main(_) ->
    {ok, Line} = file:read_line(standard_io),
    io:format("~ts", [Line]),
    ok.

Then, you can easily reproduce the bug:

$ echo -e "ę\ną" | ./test.escript
=SUPERVISOR REPORT==== 25-Aug-2023::15:36:46.996583 ===
    supervisor: {<0.64.0>,user_sup}
    errorContext: child_terminated
    reason: {badarg,[{erlang,'++',
                             [{error,[],[261,10]},[]],
                             [{error_info,#{module => erl_erts_errors}}]},
                     {group,append,3,[{file,"group.erl"},{line,1076}]},
                     {group,get_chars_loop,10,[{file,"group.erl"},{line,516}]},
                     {group,io_request,6,[{file,"group.erl"},{line,200}]},
                     {group,server_loop,3,[{file,"group.erl"},{line,126}]}]}
    offender: [{pid,<0.69.0>},{mod,user_sup}]

escript: exception error: no match of right hand side value {error,terminated}
  in function  erl_eval:expr/6 (erl_eval.erl, line 498)
  in call from escript:eval_exprs/5 (escript.erl, line 869)
  in call from erl_eval:local_func/8 (erl_eval.erl, line 646)
  in call from escript:interpret/4 (escript.erl, line 780)
  in call from escript:start/1 (escript.erl, line 277)
  in call from init:start_em/1
  in call from init:do_boot/3

The bug is not present when there is only one line of Unicode input:

$ echo -e "ę\n" | ./test.escript
ę

The bug can be caused with an Erlang module as well, but the data needs to be provided immediately on startup.

Expected behavior
The expected output of the example escript is just "ę".

Affected versions
The bug is present in 26.0.2
The bug is not present in 25.3.2.5

Additional context
Printable characters and encoding are 'unicode' in the shell - checked.
The bug disappears when LC_ALL is set to 'ISO-8859-1', but then the output is incorrect (no unicode, of course).
OS: macOS Ventura 13.4.1

There is another issue that might be related, and I am not sure if it should be reported as a separate bug. It is present in both tested versions. The issue is that the binary unicode input does not work at all:

$ erl
Erlang/OTP 26 [erts-14.0.2] [source] [64-bit] [smp:10:10] [ds:10:10:10] [async-threads:1] [jit] [dtrace]

Eshell V14.0.2 (press Ctrl+G to abort, type help(). for help)
1> io:setopts([binary]).
ok
2> file:read_line(standard_io).

ą
3> v(2).
{error,collect_line}

It looks like the broken code is somewhere in group:cast, where the encoding seems to be latin1 even though it is set to unicode.

The text was updated successfully, but these errors were encountered:

garazdawi · 2023-10-04T08:33:55Z

Hello Paweł!

This works as intended (if you ignore the crash...). What has changed inbetween 25 and 26 is that escripts now by default run using the same encoding as the environment instead of always running in bytewise encoding (aka latin1).

This means that in order to get the same behaviour in 25 and 26 you need to set the encoding you want to work with (without changing it in the receiving shell):

> echo -e "ę\ną" | LC_CTYPE=ISO-8859-1 ./test.escript
ę

or you can do it as a kernel application parameter that was added in OTP-26.1:

#!/usr/bin/env escript
%%! -kernel standard_io_encoding latin1

main(_) ->
    {ok, Line} = file:read_line(standard_io),
    io:format("~~w : ~w~n", [Line]),
    io:format("~~ts: ~ts~n", [Line]),
    ok.

> echo -e "ę\ną" | ./test.escript
~w : [196,153,10]
~ts: ę

The crash happens because the code does not correctly figure out that it should return {error, {no_translation, unicode, latin1}} when you call file:read_line/1.

Your example with setting mode to binary actually works better, though the error code is again wrong. You have the same problem there, you try to read latin1 characters from a unicode stream and there is no mapping. Starting with erl -kernel standard_io_encoding latin1 corrects the issue.

I will fix the crash and the incorrect error return when set in binary mode.

Anyway, I don't think you actually want to use file:read_file/1 in this case, but instead io:get_line/2. The file module is used to read bytes, so when you use it on a unicode stream (and encoding is set to latin1) you will get [196,153] (which is "Ä™") and not [281] as you expect. While if you use io:get_line/2 instead you will get the correct result. The reason why using file kind of always works is because what you read that is valid latin1, although not the same thing as you input. You will get into all kinds of problems if you try to work with that string.

#!/usr/bin/env escript

main(_) ->
    Line = io:get_line(standard_io, ""),
    io:format("~~w : ~w~n", [Line]),
    io:format("~~ts: ~ts~n", [Line]),
    ok.

> echo -e "ę\ną" | ./test.escript
~w : [281,10]
~ts: ę

This is all very non-obvious and subtle, I attempted to make the documentation clearer regarding this in #7384. Who knew plain text could be this hard...

chrzaszcz · 2023-10-04T10:14:28Z

Hi Lukas, thanks for the detailed answer.
I checked it and this indeed works as expected:

#!/usr/bin/env escript

main(_) ->
    Line = io:get_line(standard_io, ""),
    io:format("~ts", [Line]),
    ok.

I must have read the previous version of the documentation, because the current version clearly recommends io:get_line/2 for my case.

Should I close the task or are you going to close it when the badarg itself gets addressed?

garazdawi · 2023-10-04T10:17:30Z

I must have read the previous version of the documentation, because the current version clearly recommends io:get_line/2 for my case.

The docs update was released after you opened this ticket, so that is very likely :)

Should I close the task or are you going to close it when the badarg itself gets addressed?

I'll close it when the badarg is fixed.

When file is used to read from a group, we should return a translation error if the input stream contains unicode characters (as file expects latin1). Closes erlang#7591

garazdawi · 2023-10-04T10:20:08Z

#7714 fixes the error handling to be correct

When file is used to read from a group, we should return a translation error if the input stream contains unicode characters (as file expects latin1). Closes erlang#7591

chrzaszcz added the bug Issue is reported as a bug label Aug 25, 2023

IngelaAndin added the team:VM Assigned to OTP team VM label Aug 28, 2023

bjorng assigned frazze-jobb Aug 28, 2023

garazdawi assigned garazdawi and unassigned frazze-jobb Oct 2, 2023

garazdawi mentioned this issue Oct 4, 2023

kernel: Fix group to return translation errors correctly #7714

Merged

garazdawi closed this as completed in 3c3f39d Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

badarg exception when reading multi-line Unicode input #7591

badarg exception when reading multi-line Unicode input #7591

chrzaszcz commented Aug 25, 2023

garazdawi commented Oct 4, 2023

chrzaszcz commented Oct 4, 2023

garazdawi commented Oct 4, 2023

garazdawi commented Oct 4, 2023

badarg exception when reading multi-line Unicode input #7591

badarg exception when reading multi-line Unicode input #7591

Comments

chrzaszcz commented Aug 25, 2023

garazdawi commented Oct 4, 2023

chrzaszcz commented Oct 4, 2023

garazdawi commented Oct 4, 2023

garazdawi commented Oct 4, 2023