Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making ByteString.hGetLine behave like System.IO.hGetLine #327

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 27 additions & 10 deletions Data/ByteString.hs
Original file line number Diff line number Diff line change
Expand Up @@ -1816,31 +1816,48 @@ hGetLine h =
else ioe_EOF
else haveBuf h_ buf' len xss

haveBuf h_@Handle__{haByteBuffer}
haveBuf h_@Handle__{haByteBuffer, haInputNL}
buf@Buffer{ bufRaw=raw, bufR=w, bufL=r }
len xss =
do
off <- findEOL r w raw
(off, sizeNewline) <- findEOL haInputNL r w raw
let new_len = len + off - r
xs <- mkPS raw r off

-- if eol == True, then off is the offset of the '\n'
-- otherwise off == w and the buffer is now empty.
if off /= w
then do if w == off + 1
then writeIORef haByteBuffer buf{ bufL=0, bufR=0 }
else writeIORef haByteBuffer buf{ bufL = off + 1 }
mkBigPS new_len (xs:xss)
then do
-- If off + sizeNewline == w then the remaining buffer is empty
if (off + sizeNewline) == w
then writeIORef haByteBuffer buf{ bufL=0, bufR=0 }
else writeIORef haByteBuffer buf{ bufL = off + sizeNewline }
mkBigPS new_len (xs:xss)
else fill h_ buf{ bufL=0, bufR=0 } new_len (xs:xss)

-- find the end-of-line character, if there is one
findEOL r w raw
| r == w = return w
findEOL haInputNL r w raw
| r == w = return (w, 0)
| otherwise = do
c <- readWord8Buf raw r
if c == fromIntegral (ord '\n')
then return r -- NB. not r+1: don't include the '\n'
else findEOL (r+1) w raw
then do
-- NB. not r+1: don't include the '\n'
-- Also, it is important that it ends the line in both modes
-- To match System.IO.hGetLine's behavior
return (r, 1)
else if haInputNL == CRLF && c == fromIntegral (ord '\r') && r+1 < w
then do
c' <- readWord8Buf raw (r+1)
if c' == fromIntegral (ord '\n')
then return (r, 2) -- NB. not r+1 or r+2: don't include the '\r\n'
else do
-- We cannot jump 2 characters ahead
-- because if we encountered '\r\r\n'
-- We would miss the pattern starting on the second '\r'
findEOL haInputNL (r+1) w raw
else findEOL haInputNL (r+1) w raw


mkPS :: RawBuffer Word8 -> Int -> Int -> IO ByteString
mkPS buf start end =
Expand Down
33 changes: 32 additions & 1 deletion tests/Properties.hs
Original file line number Diff line number Diff line change
Expand Up @@ -1614,6 +1614,36 @@ prop_read_write_file_D x = unsafePerformIO $ do
(const $ do y <- D.readFile f
return (x==y))

prop_hgetline_like_s8_hgetline (LinedASCII filetext) (lineEndIn, lineEndOut) = idempotentIOProperty $ do
tid <- myThreadId
let f = "qc-test-"++show tid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use openTempFile, e. g.,

(fn, h) <- openTempFile "." "lazy-hclose-test.tmp"

let newlineMode = NewlineMode (if lineEndIn then LF else CRLF) (if lineEndOut then LF else CRLF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity that QuickCheck does not provide Arbitrary instances for Newline and NewlineMode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised an issue (nick8325/quickcheck#322) for this, so hopefully QuickCheck will provide those instances soon.
But I don't know what the timeline would look like to get rid of the Bools here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not worth for bytestring to depend on the very latest QuickCheck only for the sake of these instances, so we should probably keep if lineEndIn then LF else CRLF. But it would be nice to submit a PR to QuickCheck adding them, for generations to come.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request made, in the process I realized I should flip the conditional because QuickCheck attempts to shrink True into False and I would prefer shrinking CRLF into LF given that CRLF is the more complicated case.

bracket_
Copy link
Contributor

@Bodigrim Bodigrim Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much point to use bracket_ in tests. In this particular case it even makes things worse, because if the test fails with an exception, I would rather not removeFile f to facilitate investigation.

(writeFile f filetext)
dbramucci marked this conversation as resolved.
Show resolved Hide resolved
(removeFile f)
$ do
bsLines <- withFile f ReadMode (\h -> do
hSetNewlineMode h newlineMode
readByLines C.hGetLine h
)
sLines <- withFile f ReadMode (\h -> do
hSetNewlineMode h newlineMode
readByLines System.IO.hGetLine h
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to reduce code duplication here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way would be to write a readFileByLines function.

readFileByLines filename getLine = withFile filename ReadMode (\h -> do
    hSetNewlineMode h newlineMode
    readByLines getLine h
  )

Alternatively, because it is expensive to open files on Windows it would be possible to use just the Handle from when the file was written and just seek the beginning after the initial write and read.
Something like

hPutStr h_ filetext
hFlush h_

let readByLinesFromStart getLine = hSeek h_ AbsoluteSeek 0 *> readByLines getLine h_
   
hSetNewlineMode h_ newlineMode
bsLines <- readByLinesFromStart C.hGetLine
sLines <- readByLinesFromStart System.IO.hGetLine

Then instead of open 3 files per test-case, we open 1 file.
I can try running that on Windows later to see if it is a large improvement (I would expect it is close to 3x for this short test).
Downside being that it isn't as obvious as opening files from scratch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not particularly care about performance of the test suite, I'd rather keep tests as straightforward and focused as possible.

return $ map C.unpack bsLines === sLines

where
readByLines getLine h_ = go []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better than a more naive implementation?

readByLines getLine h_ = do 
  isEnd <- hIsEOF h_ 
  if isEnd then return [] else (:) <$> getLine h_ <*> readByLines getLine h_

where
go lines = do
isEnd <- hIsEOF h_
if isEnd
then return lines
else do
!nextLine <- getLine h_
go (nextLine : lines)


------------------------------------------------------------------------

prop_append_file_P x y = unsafePerformIO $ do
Expand Down Expand Up @@ -1791,7 +1821,8 @@ io_tests =
, testProperty "appendFile " prop_append_file_D

, testProperty "packAddress " prop_packAddress


, testProperty "pack.hGetLine=hGetLine" prop_hgetline_like_s8_hgetline
]

misc_tests =
Expand Down
16 changes: 16 additions & 0 deletions tests/QuickCheckUtils.hs
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,22 @@ instance Arbitrary String8 where
toChar :: Word8 -> Char
toChar = toEnum . fromIntegral

-- | Strings, but each char is ASCII and there are a lot of newlines generated
--
newtype LinedASCII = LinedASCII String
deriving (Eq, Ord, Show)

instance Arbitrary LinedASCII where
arbitrary = fmap LinedASCII . listOf . oneof $
[ arbitraryASCIIChar
, elements ['\n', '\r']
]

shrink (LinedASCII s) = fmap LinedASCII (shrink s)

instance CoArbitrary LinedASCII where
coarbitrary (LinedASCII s) = coarbitrary s

------------------------------------------------------------------------
--
-- We're doing two forms of testing here. Firstly, model based testing.
Expand Down