Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of link destinations that look like code or <html> #137

Merged
merged 2 commits into from
Sep 11, 2024

Conversation

notriddle
Copy link
Contributor

stack bench output:

Before
commonmark-cli       > build (exe)
commonmark-cli       > Preprocessing executable 'commonmark' for commonmark-cli-0.2.1..
commonmark-cli       > Building executable 'commonmark' for commonmark-cli-0.2.1..
commonmark-cli       > copy/register
commonmark-cli       > Installing executable commonmark in /home/michael/Development/commonmark-hs/.stack-work/install/x86_64-linux/d6a7ffd91072a14c336d55f594f5fe3da20e2523e0123ffc206169986025661e/8.10.7/bin
commonmark           > benchmarks
Running 1 benchmarks... 
Benchmark benchmark-commonmark: RUNNING...
All                     
  tokenize              
    tokenize sample.md: OK (0.24s)
      3.6 ms ± 250 μs   
  parse sample.md       
    commonmark default: OK (0.34s)
       48 ms ± 2.3 ms   
  pathological          
    nested strong emph  
      commonmark        
        1000:           OK (0.29s)
          1.1 ms ±  56 μs
        2000:           OK (0.28s)
          2.2 ms ± 103 μs
        3000:           OK (0.43s)
          3.3 ms ± 109 μs
        4000:           OK (0.28s)
          4.5 ms ± 323 μs
    many emph closers with no openers
      commonmark        
        1000:           OK (0.54s)
          1.0 ms ±  44 μs
        2000:           OK (0.14s)
          2.2 ms ± 200 μs
        3000:           OK (0.42s)
          3.3 ms ±  91 μs
        4000:           OK (0.27s)
          4.3 ms ± 322 μs
    many emph openers with no closers
      commonmark        
        1000:           OK (0.26s)
          1.0 ms ±  54 μs
        2000:           OK (0.27s)
          2.0 ms ± 127 μs
        3000:           OK (0.20s)
          3.2 ms ± 214 μs
        4000:           OK (4.48s)
          4.3 ms ± 355 μs
    many link closers with no openers
      commonmark        
        1000:           OK (0.32s)
          1.2 ms ±  91 μs
        2000:           OK (0.39s)
          2.9 ms ± 279 μs
        3000:           OK (1.05s)
          4.0 ms ±  57 μs
        4000:           OK (0.37s)
          5.5 ms ± 342 μs
    many link openers with no closers
      commonmark        
        1000:           OK (0.33s)
          1.2 ms ±  63 μs
        2000:           OK (0.17s)
          2.6 ms ± 175 μs
        3000:           OK (0.54s)
          4.3 ms ± 419 μs
        4000:           OK (0.36s)
          5.5 ms ± 274 μs
    mismatched openers and closers
      commonmark        
        1000:           OK (0.35s)
           11 ms ± 698 μs
        2000:           OK (0.30s)
           43 ms ± 3.3 ms
        3000:           OK (0.29s)
           97 ms ± 3.3 ms
        4000:           OK (0.53s)
          175 ms ± 5.0 ms
    openers and closers multiple of 3
      commonmark        
        1000:           OK (0.37s)
          3.0 ms ± 144 μs
        2000:           OK (0.32s)
           10 ms ± 824 μs
        3000:           OK (0.35s)
           23 ms ± 1.2 ms
        4000:           OK (0.12s)
           40 ms ± 2.7 ms
    link openers and emph closers
      commonmark        
        1000:           OK (0.30s)
          1.1 ms ± 113 μs
        2000:           OK (0.30s)
          2.3 ms ± 213 μs
        3000:           OK (0.23s)
          3.6 ms ± 190 μs
        4000:           OK (0.31s)
          4.9 ms ± 295 μs
    nested brackets     
      commonmark        
        1000:           OK (0.44s)
          6.8 ms ± 251 μs
        2000:           OK (0.38s)
           25 ms ± 925 μs
        3000:           OK (0.17s)
           56 ms ± 3.0 ms
        4000:           OK (0.29s)
           96 ms ± 5.9 ms
    inline link openers without closers
      commonmark        
        1000:           OK (0.55s)
          2.1 ms ±  71 μs
        2000:           OK (0.28s)
          4.4 ms ± 291 μs
        3000:           OK (0.44s)
          6.8 ms ± 256 μs
        4000:           OK (0.15s)
           10 ms ± 729 μs
    repeated pattern '[ (]('
      commonmark        
        1000:           OK (0.36s)
          1.4 ms ±  48 μs
        2000:           OK (0.37s)
          2.8 ms ± 187 μs
        3000:           OK (0.26s)
          4.3 ms ± 170 μs
        4000:           OK (0.39s)
          5.9 ms ± 239 μs
    nested block quotes 
      commonmark        
        1000:           OK (0.49s)
          937 μs ±  28 μs
        2000:           OK (0.27s)
          2.0 ms ± 162 μs
        3000:           OK (0.85s)
          3.3 ms ± 183 μs
        4000:           OK (0.15s)
          4.7 ms ± 387 μs
    nested list         
      commonmark        
        1000:           OK (0.14s)
          529 μs ±  50 μs
        2000:           OK (0.40s)
          769 μs ±  25 μs
        3000:           OK (0.27s)
          1.0 ms ±  50 μs
        4000:           OK (0.33s)
          1.3 ms ±  57 μs
    nested list 2       
      commonmark        
        1000:           OK (0.34s)
          2.8 ms ± 118 μs
        2000:           OK (0.42s)
          6.4 ms ± 395 μs
        3000:           OK (0.32s)
           10 ms ± 620 μs
        4000:           OK (0.48s)
           15 ms ± 518 μs
    backticks           
      commonmark        
        1000:           OK (0.20s)
          400 μs ±  29 μs
        2000:           OK (0.23s)
          857 μs ±  52 μs
        3000:           OK (0.32s)
          1.2 ms ±  43 μs
        4000:           OK (0.22s)
          1.6 ms ± 147 μs
    CDATA               
      commonmark        
        1000:           OK (0.23s)
          898 μs ±  74 μs
        2000:           OK (0.23s)
          1.8 ms ±  89 μs
        3000:           OK (0.37s)
          2.9 ms ± 127 μs
        4000:           OK (0.49s)
          3.8 ms ± 173 μs
    <?                  
      commonmark        
        1000:           OK (24.57s)
          1.5 ms ±  42 μs
        2000:           OK (0.81s)
          3.1 ms ± 187 μs
        3000:           OK (0.17s)
          5.5 ms ± 398 μs
        4000:           OK (0.84s)
          6.3 ms ± 305 μs
    <!A                 
      commonmark        
        1000:           OK (0.31s)
          1.2 ms ± 105 μs
        2000:           OK (0.32s)
          2.4 ms ± 132 μs
        3000:           OK (0.24s)
          3.7 ms ± 262 μs
        4000:           OK (0.33s)
          5.0 ms ± 423 μs
                        
All 74 tests passed (54.37s)
Benchmark benchmark-commonmark: FINISH
commonmark-extensions> benchmarks
Running 1 benchmarks...            
Benchmark benchmark-commonmark-extensions: RUNNING...
All                                
  commonmark +smart:      OK (0.37s)
     50 ms ± 4.2 ms                
  commonmark +autolink:   OK (0.36s)
     52 ms ± 2.5 ms                
  commonmark +attributes: OK (1.47s)
     47 ms ± 1.9 ms                
  commonmark +pipe_table: OK (0.32s)
     43 ms ± 4.1 ms                
                                   
All 4 tests passed (2.55s)         
Benchmark benchmark-commonmark-extensions: FINISH
Completed 3 action(s).
After
$ stack bench
commonmark           > benchmarks
Running 1 benchmarks...
Benchmark benchmark-commonmark: RUNNING...
All         
  tokenize  
    tokenize sample.md: OK (0.24s)
      3.7 ms ± 221 μs
  parse sample.md
    commonmark default: OK (0.14s)
       47 ms ± 4.6 ms
  pathological
    nested strong emph
      commonmark
        1000:           OK (0.28s)
          1.1 ms ±  43 μs
        2000:           OK (0.62s)
          2.5 ms ± 181 μs
        3000:           OK (0.44s)
          3.4 ms ± 198 μs
        4000:           OK (0.30s)
          4.6 ms ± 213 μs
    many emph closers with no openers
      commonmark
        1000:           OK (0.55s)
          1.1 ms ±  50 μs
        2000:           OK (0.28s)
          2.1 ms ± 112 μs
        3000:           OK (0.21s)
          3.2 ms ± 300 μs
        4000:           OK (0.28s)
          4.4 ms ± 240 μs
    many emph openers with no closers
      commonmark
        1000:           OK (0.26s)
          1.0 ms ±  46 μs
        2000:           OK (0.27s)
          2.1 ms ± 144 μs
        3000:           OK (0.43s)
          3.3 ms ± 209 μs
        4000:           OK (0.29s)
          4.3 ms ± 199 μs
    many link closers with no openers
      commonmark
        1000:           OK (0.33s)
          1.2 ms ±  74 μs
        2000:           OK (0.35s)
          2.6 ms ± 193 μs
        3000:           OK (0.31s)
          4.7 ms ± 275 μs
        4000:           OK (0.75s)
          5.4 ms ± 151 μs
    many link openers with no closers
      commonmark
        1000:           OK (0.34s)
          1.3 ms ±  48 μs
        2000:           OK (0.35s)
          2.6 ms ± 123 μs
        3000:           OK (0.16s)
          5.1 ms ± 499 μs
        4000:           OK (0.38s)
          5.6 ms ± 261 μs
    mismatched openers and closers
      commonmark
        1000:           OK (0.37s)
           11 ms ± 624 μs
        2000:           OK (0.14s)
           45 ms ± 4.5 ms
        3000:           OK (0.30s)
           98 ms ± 3.6 ms
        4000:           OK (0.54s)
          179 ms ± 8.1 ms
    openers and closers multiple of 3
      commonmark
        1000:           OK (0.19s)
          3.1 ms ± 229 μs
        2000:           OK (0.36s)
           12 ms ± 1.0 ms
        3000:           OK (0.34s)
           22 ms ± 1.5 ms
        4000:           OK (0.61s)
           40 ms ± 1.6 ms
    link openers and emph closers
      commonmark
        1000:           OK (0.63s)
          1.1 ms ± 103 μs
        2000:           OK (0.32s)
          2.4 ms ± 104 μs
        3000:           OK (0.94s)
          3.6 ms ± 196 μs
        4000:           OK (0.65s)
          5.0 ms ± 249 μs
    nested brackets
      commonmark
        1000:           OK (0.45s)
          6.8 ms ± 464 μs
        2000:           OK (0.18s)
           25 ms ± 1.8 ms
        3000:           OK (0.17s)
           55 ms ± 4.8 ms
        4000:           OK (0.29s)
           96 ms ± 4.5 ms
    inline link openers without closers
      commonmark
        1000:           OK (0.29s)
          2.1 ms ± 130 μs
        2000:           OK (0.30s)
          4.5 ms ± 251 μs
        3000:           OK (0.46s)
          6.9 ms ± 558 μs
        4000:           OK (0.63s)
          9.8 ms ± 402 μs
    repeated pattern '[ (]('
      commonmark
        1000:           OK (0.38s)
          1.5 ms ±  55 μs
        2000:           OK (0.39s)
          2.8 ms ± 238 μs
        3000:           OK (0.29s)
          4.3 ms ± 279 μs
        4000:           OK (0.40s)
          6.0 ms ± 554 μs
    nested block quotes
      commonmark
        1000:           OK (0.50s)
          941 μs ±  23 μs
        2000:           OK (0.54s)
          2.0 ms ±  43 μs
        3000:           OK (0.21s)
          3.2 ms ± 316 μs
        4000:           OK (0.29s)
          4.2 ms ± 223 μs
    nested list
      commonmark
        1000:           OK (0.28s)
          529 μs ±  24 μs
        2000:           OK (0.42s)
          813 μs ±  47 μs
        3000:           OK (0.30s)
          1.1 ms ±  66 μs
        4000:           OK (0.36s)
          1.3 ms ± 128 μs
    nested list 2
      commonmark
        1000:           OK (0.36s)
          2.8 ms ± 108 μs
        2000:           OK (0.42s)
          6.3 ms ± 387 μs
        3000:           OK (1.33s)
           10 ms ± 167 μs
        4000:           OK (0.49s)
           15 ms ± 796 μs
    backticks
      commonmark
        1000:           OK (0.21s)
          406 μs ±  23 μs
        2000:           OK (0.22s)
          862 μs ±  57 μs
        3000:           OK (0.33s)
          1.3 ms ±  57 μs
        4000:           OK (0.23s)
          1.7 ms ±  97 μs
    CDATA   
      commonmark
        1000:           OK (0.24s)
          916 μs ±  43 μs
        2000:           OK (0.49s)
          1.8 ms ±  51 μs
        3000:           OK (0.38s)
          2.9 ms ±  93 μs
        4000:           OK (0.50s)
          3.8 ms ±  93 μs
    <?      
      commonmark
        1000:           OK (0.40s)
          1.5 ms ±  53 μs
        2000:           OK (0.40s)
          3.0 ms ±  95 μs
        3000:           OK (0.31s)
          4.6 ms ± 390 μs
        4000:           OK (0.22s)
          6.2 ms ± 570 μs
    <!A     
      commonmark
        1000:           OK (0.32s)
          1.2 ms ±  55 μs
        2000:           OK (0.33s)
          2.4 ms ± 144 μs
        3000:           OK (0.48s)
          3.7 ms ± 157 μs
        4000:           OK (0.33s)
          4.9 ms ± 202 μs
            
All 74 tests passed (28.22s)
Benchmark benchmark-commonmark: FINISH
commonmark-extensions> benchmarks
Running 1 benchmarks...            
Benchmark benchmark-commonmark-extensions: RUNNING...
All                                
  commonmark +smart:      OK (0.78s)
     50 ms ± 2.0 ms                
  commonmark +autolink:   OK (0.39s)
     53 ms ± 4.1 ms                
  commonmark +attributes: OK (0.74s)
     48 ms ± 1.0 ms                
  commonmark +pipe_table: OK (0.35s)
     47 ms ± 3.0 ms                
                                   
All 4 tests passed (2.25s)         
Benchmark benchmark-commonmark-extensions: FINISH
Completed 2 action(s).

Fixes #136

This works by re-parsing the tokens that come after the link, but only when the end delimiter isn't on a chunk boundary (since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that needs re-interpreted can span more than one chunk. For example, we can draw the bounds of the erroneous code chunk in this example:

[x](`) <a href="`">
    ^-----------^

If we re-parse the underlined part in isolation, we'll fix the first link, but won't find the HTML (since the closing angle bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass would make writing extensions more complicated. For example, LaTeX math is supposed to have the same binding strength as code spans:

$first[$](about)
^------^ this is a math span, not a link

[first]($)$5/8$
        ^-^ this is an analogue of the original bug
            it shouldn't be a math span, but looks like one

@notriddle
Copy link
Contributor Author

@jgm other than the merge conflict in regression.md (a simple rebase), is there anything blocking this fix?

@jgm
Copy link
Owner

jgm commented Jul 4, 2024

This probably just arrived at a busy time; I will look at. It would help if you could rebase.

Fixes jgm#136

This works by re-parsing the tokens that come after the link,
but only when the end delimiter isn't on a chunk boundary
(since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that
needs re-interpreted can span more than one chunk. For example,
we can draw the bounds of the erroneous code chunk in this example:

    [x](`) <a href="`">
        ^-----------^

If we re-parse the underlined part in isolation, we'll fix the
first link, but won't find the HTML (since the closing angle
bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass
would make writing extensions more complicated. For example,
LaTeX math is supposed to have the same binding strength as
code spans:

    $first[$](about)
    ^------^ this is a math span, not a link

    [first]($)$5/8$
            ^-^ this is an analogue of the original bug
                it shouldn't be a math span, but looks like one
@notriddle
Copy link
Contributor Author

Okay, it’s rebased.

@notriddle
Copy link
Contributor Author

Any feedback on this PR?

@jgm
Copy link
Owner

jgm commented Sep 11, 2024

Sorry, I was on vacation when this posted and it got ignored. I'll try to have a look in the near future!

Copy link
Owner

@jgm jgm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to do the job, but the code is pretty complex and it's hard to survey. I suggest adding some comments to explain what is going on, and maybe using intermediate definitions to make the code clearer. That will make it easier to maintain this code in the future.

Comment on lines 77 to 92
let go chunks toks' bottoms = do
chunks' <- {-# SCC parseChunks #-} evalStateT
(parseChunks bracketedSpecs formattingSpecs ilParsers
attrParser rm toks') defaultEnders
case chunks' of
Left err -> return $ Left err
Right chunks'' ->
case (processBrackets bracketedSpecs rm (chunks ++ chunks'') bottoms) of
Left st -> go ((reverse . befores . rightCursor) st) (mconcat (map chunkToks $ (maybeToList . center $ rightCursor st) ++ (afters $ rightCursor st))) (stackBottoms st)
Right chunks''' -> return $ Right chunks'''
res <- go [] ((dropWhile iswhite . reverse . dropWhile iswhite . reverse) toks) mempty
return $!
case res of
Left err -> Left err
Right chunks ->
(Right .
unChunks .
processEmphasis .
processBrackets bracketedSpecs rm) chunks
case res of
Left err -> Left err
Right chunks ->
(Right . unChunks . processEmphasis) chunks
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has gotten pretty complex, and I have a hard time following what is going on. I'm afraid I'll be at a loss if I ever need to modify this part of the code again. Can it be made a bit more user-friendly, with some comments and simplifying definitions?

Comment on lines 702 to +703
processBrackets :: IsInline a
=> [BracketedSpec a] -> ReferenceMap -> [Chunk a] -> [Chunk a]
processBrackets bracketedSpecs rm xs =
=> [BracketedSpec a] -> ReferenceMap -> [Chunk a] -> M.Map Text SourcePos -> Either (DState a) [Chunk a]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type has gotten more complex; it would be worth explaining in a comment what the significance of Left and Right results is, and maybe why the bottoms map is needed.

Comment on lines -737 to +741
=> [BracketedSpec a] -> DState a -> [Chunk a]
=> [BracketedSpec a] -> DState a -> Either (DState a) [Chunk a]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a comment at the beginning explaining what Left and Right results mean here.

@notriddle
Copy link
Contributor Author

Do the new comments and variable names help?

@jgm
Copy link
Owner

jgm commented Sep 11, 2024

Yes, this looks great, thanks!

@jgm jgm merged commit ff9fe57 into jgm:master Sep 11, 2024
7 checks passed
@notriddle notriddle deleted the notriddle/backquote-link branch September 11, 2024 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[fuzz result] code span vanishes when link destination is `
2 participants