-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making ByteString.hGetLine behave like System.IO.hGetLine #327
base: master
Are you sure you want to change the base?
Conversation
This closes issue haskell#13. The changes can be summarized as updating `findEOL` to look for "\r\n" in CRLF mode and updating the logic of `haveBuf` to resize the buffer according to the size of the newline. Additionally, tests were added to verify that both `hGetLine`s produce the same behavior. Some of the edge-cases to worry about here include * '\n' still counts as a line end. Thus line endings' length vary between 1 and 2 in CRLF mode. * "\r\r\n" can give a false-start. This means you can't always skip 2 characters when `c' /= '\n'`. * '\r' when not followed by '\n' isn't part of a newline. * Not reading out of the buffer when '\r' is the last character.
The old test had wrote a special file filled with strange line endings. Now that there is a reliable, and consistent property test available for hGetLine, this code can be removed at little cost.
Additional work to do includes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very decent and well-thought! Here are some suggestions about tests.
tests/Properties.hs
Outdated
@@ -1614,6 +1614,36 @@ prop_read_write_file_D x = unsafePerformIO $ do | |||
(const $ do y <- D.readFile f | |||
return (x==y)) | |||
|
|||
prop_hgetline_like_s8_hgetline (LinedASCII filetext) (lineEndIn, lineEndOut) = idempotentIOProperty $ do | |||
tid <- myThreadId | |||
let f = "qc-test-"++show tid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use openTempFile
, e. g.,
bytestring/tests/LazyHClose.hs
Line 18 in 54133b3
(fn, h) <- openTempFile "." "lazy-hclose-test.tmp" |
tests/Properties.hs
Outdated
prop_hgetline_like_s8_hgetline (LinedASCII filetext) (lineEndIn, lineEndOut) = idempotentIOProperty $ do | ||
tid <- myThreadId | ||
let f = "qc-test-"++show tid | ||
let newlineMode = NewlineMode (if lineEndIn then LF else CRLF) (if lineEndOut then LF else CRLF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pity that QuickCheck
does not provide Arbitrary
instances for Newline
and NewlineMode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised an issue (nick8325/quickcheck#322) for this, so hopefully QuickCheck
will provide those instances soon.
But I don't know what the timeline would look like to get rid of the Bool
s here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is not worth for bytestring
to depend on the very latest QuickCheck
only for the sake of these instances, so we should probably keep if lineEndIn then LF else CRLF
. But it would be nice to submit a PR to QuickCheck
adding them, for generations to come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request made, in the process I realized I should flip the conditional because QuickCheck attempts to shrink True
into False
and I would prefer shrinking CRLF
into LF
given that CRLF
is the more complicated case.
tests/Properties.hs
Outdated
tid <- myThreadId | ||
let f = "qc-test-"++show tid | ||
let newlineMode = NewlineMode (if lineEndIn then LF else CRLF) (if lineEndOut then LF else CRLF) | ||
bracket_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much point to use bracket_
in tests. In this particular case it even makes things worse, because if the test fails with an exception, I would rather not removeFile f
to facilitate investigation.
tests/Properties.hs
Outdated
sLines <- withFile f ReadMode (\h -> do | ||
hSetNewlineMode h newlineMode | ||
readByLines System.IO.hGetLine h | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to reduce code duplication here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way would be to write a readFileByLines
function.
readFileByLines filename getLine = withFile filename ReadMode (\h -> do
hSetNewlineMode h newlineMode
readByLines getLine h
)
Alternatively, because it is expensive to open files on Windows it would be possible to use just the Handle
from when the file was written and just seek the beginning after the initial write and read.
Something like
hPutStr h_ filetext
hFlush h_
let readByLinesFromStart getLine = hSeek h_ AbsoluteSeek 0 *> readByLines getLine h_
hSetNewlineMode h_ newlineMode
bsLines <- readByLinesFromStart C.hGetLine
sLines <- readByLinesFromStart System.IO.hGetLine
Then instead of open 3 files per test-case, we open 1 file.
I can try running that on Windows later to see if it is a large improvement (I would expect it is close to 3x for this short test).
Downside being that it isn't as obvious as opening files from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not particularly care about performance of the test suite, I'd rather keep tests as straightforward and focused as possible.
tests/Properties.hs
Outdated
return $ map C.unpack bsLines === sLines | ||
|
||
where | ||
readByLines getLine h_ = go [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better than a more naive implementation?
readByLines getLine h_ = do
isEnd <- hIsEOF h_
if isEnd then return [] else (:) <$> getLine h_ <*> readByLines getLine h_
… bug. On Windows, the test data would be written using the platform newlines. This means that any lone \n would become a \r\n. The consequence is that the property would fail to test the implementation on linux line endings when developing on windows. The fix is to set the newlineMode to noNewlineTranslation before writing the test data.
The variables were renamed to make boolean correspondance clearer. Also, True was changed to CRLF in order to get QuickCheck to try shrinking from CRLF to LF if possible.
This PR is incorrect. I did not see any code to handle the usual corner case that CRLF is split across two input buffers with CR in one and LF in the next. A quick test shows the case is not handled. Given a text file where the first character is "x" which is then followed by 40,000 CRLF pairs: λ> import qualified System.IO as IO
λ> import qualified Data.ByteString.Char8 as BC
λ> import Control.Monad (forM)
λ> x <- IO.openFile "/tmp/test.txt" IO.ReadMode
λ> IO.hSetNewlineMode x $ IO.NewlineMode IO.CRLF IO.CRLF
λ> l <- forM [0..39000::Int] $ \i -> do { line <- BC.length <$> hGetLine x; pure (i, line) }
λ> Prelude.filter ((/= 0) . snd) l
[(0,1),(4095,1),(8191,1),(12287,1),(16383,1),(20479,1),(24575,1),(28671,1),(32767,1),(36863,1)] |
I also wonder whether it really makes sense to fix the underlying issue. When the open file has a non-trivial encoding, the bytestring
I am not entirely sure this is actually a flaw. Once the encoding is ignored, all bets are off. If the file is UCS-16, does it really help to try to handle CRLF, when it'll actually be (little-endian) |
If we do not make any assumptions about encoding at all, we should remove many functions, starting from bytestring/Data/ByteString/Char8.hs Lines 21 to 23 in 101566e
This is a pragmatic choice. We can honour CRLF choice without getting into too much trouble, and there is users' demand for it. Covering encodings is more challenging. |
I understand that this is a pragmatic choice, but we're in a state of sin, and perhaps papering over just some of the friction can be considered to be setting up false expectations... Indeed it can be appropriate to check whether the input is in some non-trivial encoding where ISO-8859-1 or "binary" are trivial, as might be various other ISO-8859-X variants. On the other hand it is not entirely obvious to me how to determine which encodings are safe to ignore, and which should raise errors. |
Anyway, if there's general consensus this should move forward, despite the asymmetry with encoding, I won't stand in the way. Of course the implementation needs to be correct, and should not impose an undue performance cost, especially when the
|
Yes, I think we should move this forward. @dbramucci are you still interested to carry on with this PR? |
This PR has stalled. The original author has not stepped forward to fix the split CRLF issue. Should we close this, or is someone else willing to take this over and fix it? |
Also, having just fixed strictness in lazy bytestring Also the |
Sounds like a good idea to me. |
This changes
ByteString.hGetLine
such that it respects the newline mode in the provided handle. (SpecificallyHandle__{haInputNL}
).Before, whether newlines were
LF
(Linux convention) orCRLF
(Windows convention),ByteString.hGetLine
would behave as though the only line ending wasLF
.Now, when
hGetLine
is passed a handle, where the input newline isCRLF
, it will treat both\r\n
and\n
as line feeds, just likeSystem.IO.hGetLine
does.Note that this means that even in
CRLF
mode,"foo\nbar"
will be treated as 2 separate lines.The technical details are straightforward.
haveBuf
now informsfindEOL
whathaInputNL
is.haveBuf
no longer assumes thatfindEOL
only reads 1 char at a time (it can be 2 chars for\r\n
now)findEOL
checks for\r\n
inCRLF
mode and jumps 2 forward, otherwise inching forward like it did previously.Furthermore, a property test and lined ascii text generator have been added to test that
ByteString.hGetLine
behaves likeSystem.IO.hGetLine
does.This is intended to close issue #13