-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
badarg exception when reading multi-line Unicode input #7591
Comments
Hello Paweł! This works as intended (if you ignore the crash...). What has changed inbetween 25 and 26 is that escripts now by default run using the same encoding as the environment instead of always running in bytewise encoding (aka latin1). This means that in order to get the same behaviour in 25 and 26 you need to set the encoding you want to work with (without changing it in the receiving shell):
or you can do it as a kernel application parameter that was added in OTP-26.1: #!/usr/bin/env escript
%%! -kernel standard_io_encoding latin1
main(_) ->
{ok, Line} = file:read_line(standard_io),
io:format("~~w : ~w~n", [Line]),
io:format("~~ts: ~ts~n", [Line]),
ok.
The crash happens because the code does not correctly figure out that it should return Your example with setting mode to binary actually works better, though the error code is again wrong. You have the same problem there, you try to read latin1 characters from a unicode stream and there is no mapping. Starting with I will fix the crash and the incorrect error return when set in binary mode. Anyway, I don't think you actually want to use file:read_file/1 in this case, but instead io:get_line/2. The file module is used to read bytes, so when you use it on a unicode stream (and encoding is set to latin1) you will get #!/usr/bin/env escript
main(_) ->
Line = io:get_line(standard_io, ""),
io:format("~~w : ~w~n", [Line]),
io:format("~~ts: ~ts~n", [Line]),
ok.
This is all very non-obvious and subtle, I attempted to make the documentation clearer regarding this in #7384. Who knew plain text could be this hard... |
Hi Lukas, thanks for the detailed answer. #!/usr/bin/env escript
main(_) ->
Line = io:get_line(standard_io, ""),
io:format("~ts", [Line]),
ok. I must have read the previous version of the documentation, because the current version clearly recommends Should I close the task or are you going to close it when the |
The docs update was released after you opened this ticket, so that is very likely :)
I'll close it when the badarg is fixed. |
When file is used to read from a group, we should return a translation error if the input stream contains unicode characters (as file expects latin1). Closes erlang#7591
#7714 fixes the error handling to be correct |
When file is used to read from a group, we should return a translation error if the input stream contains unicode characters (as file expects latin1). Closes erlang#7591
Describe the bug
When multiline UTF-8 data is provided on the standard input, and
file:read_line/1
is used to read it, it fails withbadarg
.To Reproduce
The simplest way is to use the escript:
Then, you can easily reproduce the bug:
The bug is not present when there is only one line of Unicode input:
The bug can be caused with an Erlang module as well, but the data needs to be provided immediately on startup.
Expected behavior
The expected output of the example escript is just "ę".
Affected versions
The bug is present in
26.0.2
The bug is not present in
25.3.2.5
Additional context
Printable characters and encoding are 'unicode' in the shell - checked.
The bug disappears when LC_ALL is set to 'ISO-8859-1', but then the output is incorrect (no unicode, of course).
OS: macOS Ventura 13.4.1
There is another issue that might be related, and I am not sure if it should be reported as a separate bug. It is present in both tested versions. The issue is that the
binary
unicode input does not work at all:It looks like the broken code is somewhere in
group:cast
, where the encoding seems to belatin1
even though it is set tounicode
.The text was updated successfully, but these errors were encountered: