Fix parsing of link destinations that look like `code` or <html> #137

notriddle · 2024-01-18T23:08:13Z

stack bench output:

Before

commonmark-cli       > build (exe)
commonmark-cli       > Preprocessing executable 'commonmark' for commonmark-cli-0.2.1..
commonmark-cli       > Building executable 'commonmark' for commonmark-cli-0.2.1..
commonmark-cli       > copy/register
commonmark-cli       > Installing executable commonmark in /home/michael/Development/commonmark-hs/.stack-work/install/x86_64-linux/d6a7ffd91072a14c336d55f594f5fe3da20e2523e0123ffc206169986025661e/8.10.7/bin
commonmark           > benchmarks
Running 1 benchmarks... 
Benchmark benchmark-commonmark: RUNNING...
All                     
  tokenize              
    tokenize sample.md: OK (0.24s)
      3.6 ms ± 250 μs   
  parse sample.md       
    commonmark default: OK (0.34s)
       48 ms ± 2.3 ms   
  pathological          
    nested strong emph  
      commonmark        
        1000:           OK (0.29s)
          1.1 ms ±  56 μs
        2000:           OK (0.28s)
          2.2 ms ± 103 μs
        3000:           OK (0.43s)
          3.3 ms ± 109 μs
        4000:           OK (0.28s)
          4.5 ms ± 323 μs
    many emph closers with no openers
      commonmark        
        1000:           OK (0.54s)
          1.0 ms ±  44 μs
        2000:           OK (0.14s)
          2.2 ms ± 200 μs
        3000:           OK (0.42s)
          3.3 ms ±  91 μs
        4000:           OK (0.27s)
          4.3 ms ± 322 μs
    many emph openers with no closers
      commonmark        
        1000:           OK (0.26s)
          1.0 ms ±  54 μs
        2000:           OK (0.27s)
          2.0 ms ± 127 μs
        3000:           OK (0.20s)
          3.2 ms ± 214 μs
        4000:           OK (4.48s)
          4.3 ms ± 355 μs
    many link closers with no openers
      commonmark        
        1000:           OK (0.32s)
          1.2 ms ±  91 μs
        2000:           OK (0.39s)
          2.9 ms ± 279 μs
        3000:           OK (1.05s)
          4.0 ms ±  57 μs
        4000:           OK (0.37s)
          5.5 ms ± 342 μs
    many link openers with no closers
      commonmark        
        1000:           OK (0.33s)
          1.2 ms ±  63 μs
        2000:           OK (0.17s)
          2.6 ms ± 175 μs
        3000:           OK (0.54s)
          4.3 ms ± 419 μs
        4000:           OK (0.36s)
          5.5 ms ± 274 μs
    mismatched openers and closers
      commonmark        
        1000:           OK (0.35s)
           11 ms ± 698 μs
        2000:           OK (0.30s)
           43 ms ± 3.3 ms
        3000:           OK (0.29s)
           97 ms ± 3.3 ms
        4000:           OK (0.53s)
          175 ms ± 5.0 ms
    openers and closers multiple of 3
      commonmark        
        1000:           OK (0.37s)
          3.0 ms ± 144 μs
        2000:           OK (0.32s)
           10 ms ± 824 μs
        3000:           OK (0.35s)
           23 ms ± 1.2 ms
        4000:           OK (0.12s)
           40 ms ± 2.7 ms
    link openers and emph closers
      commonmark        
        1000:           OK (0.30s)
          1.1 ms ± 113 μs
        2000:           OK (0.30s)
          2.3 ms ± 213 μs
        3000:           OK (0.23s)
          3.6 ms ± 190 μs
        4000:           OK (0.31s)
          4.9 ms ± 295 μs
    nested brackets     
      commonmark        
        1000:           OK (0.44s)
          6.8 ms ± 251 μs
        2000:           OK (0.38s)
           25 ms ± 925 μs
        3000:           OK (0.17s)
           56 ms ± 3.0 ms
        4000:           OK (0.29s)
           96 ms ± 5.9 ms
    inline link openers without closers
      commonmark        
        1000:           OK (0.55s)
          2.1 ms ±  71 μs
        2000:           OK (0.28s)
          4.4 ms ± 291 μs
        3000:           OK (0.44s)
          6.8 ms ± 256 μs
        4000:           OK (0.15s)
           10 ms ± 729 μs
    repeated pattern '[ (]('
      commonmark        
        1000:           OK (0.36s)
          1.4 ms ±  48 μs
        2000:           OK (0.37s)
          2.8 ms ± 187 μs
        3000:           OK (0.26s)
          4.3 ms ± 170 μs
        4000:           OK (0.39s)
          5.9 ms ± 239 μs
    nested block quotes 
      commonmark        
        1000:           OK (0.49s)
          937 μs ±  28 μs
        2000:           OK (0.27s)
          2.0 ms ± 162 μs
        3000:           OK (0.85s)
          3.3 ms ± 183 μs
        4000:           OK (0.15s)
          4.7 ms ± 387 μs
    nested list         
      commonmark        
        1000:           OK (0.14s)
          529 μs ±  50 μs
        2000:           OK (0.40s)
          769 μs ±  25 μs
        3000:           OK (0.27s)
          1.0 ms ±  50 μs
        4000:           OK (0.33s)
          1.3 ms ±  57 μs
    nested list 2       
      commonmark        
        1000:           OK (0.34s)
          2.8 ms ± 118 μs
        2000:           OK (0.42s)
          6.4 ms ± 395 μs
        3000:           OK (0.32s)
           10 ms ± 620 μs
        4000:           OK (0.48s)
           15 ms ± 518 μs
    backticks           
      commonmark        
        1000:           OK (0.20s)
          400 μs ±  29 μs
        2000:           OK (0.23s)
          857 μs ±  52 μs
        3000:           OK (0.32s)
          1.2 ms ±  43 μs
        4000:           OK (0.22s)
          1.6 ms ± 147 μs
    CDATA               
      commonmark        
        1000:           OK (0.23s)
          898 μs ±  74 μs
        2000:           OK (0.23s)
          1.8 ms ±  89 μs
        3000:           OK (0.37s)
          2.9 ms ± 127 μs
        4000:           OK (0.49s)
          3.8 ms ± 173 μs
    <?                  
      commonmark        
        1000:           OK (24.57s)
          1.5 ms ±  42 μs
        2000:           OK (0.81s)
          3.1 ms ± 187 μs
        3000:           OK (0.17s)
          5.5 ms ± 398 μs
        4000:           OK (0.84s)
          6.3 ms ± 305 μs
    <!A                 
      commonmark        
        1000:           OK (0.31s)
          1.2 ms ± 105 μs
        2000:           OK (0.32s)
          2.4 ms ± 132 μs
        3000:           OK (0.24s)
          3.7 ms ± 262 μs
        4000:           OK (0.33s)
          5.0 ms ± 423 μs
                        
All 74 tests passed (54.37s)
Benchmark benchmark-commonmark: FINISH
commonmark-extensions> benchmarks
Running 1 benchmarks...            
Benchmark benchmark-commonmark-extensions: RUNNING...
All                                
  commonmark +smart:      OK (0.37s)
     50 ms ± 4.2 ms                
  commonmark +autolink:   OK (0.36s)
     52 ms ± 2.5 ms                
  commonmark +attributes: OK (1.47s)
     47 ms ± 1.9 ms                
  commonmark +pipe_table: OK (0.32s)
     43 ms ± 4.1 ms                
                                   
All 4 tests passed (2.55s)         
Benchmark benchmark-commonmark-extensions: FINISH
Completed 3 action(s).

After

$ stack bench
commonmark           > benchmarks
Running 1 benchmarks...
Benchmark benchmark-commonmark: RUNNING...
All         
  tokenize  
    tokenize sample.md: OK (0.24s)
      3.7 ms ± 221 μs
  parse sample.md
    commonmark default: OK (0.14s)
       47 ms ± 4.6 ms
  pathological
    nested strong emph
      commonmark
        1000:           OK (0.28s)
          1.1 ms ±  43 μs
        2000:           OK (0.62s)
          2.5 ms ± 181 μs
        3000:           OK (0.44s)
          3.4 ms ± 198 μs
        4000:           OK (0.30s)
          4.6 ms ± 213 μs
    many emph closers with no openers
      commonmark
        1000:           OK (0.55s)
          1.1 ms ±  50 μs
        2000:           OK (0.28s)
          2.1 ms ± 112 μs
        3000:           OK (0.21s)
          3.2 ms ± 300 μs
        4000:           OK (0.28s)
          4.4 ms ± 240 μs
    many emph openers with no closers
      commonmark
        1000:           OK (0.26s)
          1.0 ms ±  46 μs
        2000:           OK (0.27s)
          2.1 ms ± 144 μs
        3000:           OK (0.43s)
          3.3 ms ± 209 μs
        4000:           OK (0.29s)
          4.3 ms ± 199 μs
    many link closers with no openers
      commonmark
        1000:           OK (0.33s)
          1.2 ms ±  74 μs
        2000:           OK (0.35s)
          2.6 ms ± 193 μs
        3000:           OK (0.31s)
          4.7 ms ± 275 μs
        4000:           OK (0.75s)
          5.4 ms ± 151 μs
    many link openers with no closers
      commonmark
        1000:           OK (0.34s)
          1.3 ms ±  48 μs
        2000:           OK (0.35s)
          2.6 ms ± 123 μs
        3000:           OK (0.16s)
          5.1 ms ± 499 μs
        4000:           OK (0.38s)
          5.6 ms ± 261 μs
    mismatched openers and closers
      commonmark
        1000:           OK (0.37s)
           11 ms ± 624 μs
        2000:           OK (0.14s)
           45 ms ± 4.5 ms
        3000:           OK (0.30s)
           98 ms ± 3.6 ms
        4000:           OK (0.54s)
          179 ms ± 8.1 ms
    openers and closers multiple of 3
      commonmark
        1000:           OK (0.19s)
          3.1 ms ± 229 μs
        2000:           OK (0.36s)
           12 ms ± 1.0 ms
        3000:           OK (0.34s)
           22 ms ± 1.5 ms
        4000:           OK (0.61s)
           40 ms ± 1.6 ms
    link openers and emph closers
      commonmark
        1000:           OK (0.63s)
          1.1 ms ± 103 μs
        2000:           OK (0.32s)
          2.4 ms ± 104 μs
        3000:           OK (0.94s)
          3.6 ms ± 196 μs
        4000:           OK (0.65s)
          5.0 ms ± 249 μs
    nested brackets
      commonmark
        1000:           OK (0.45s)
          6.8 ms ± 464 μs
        2000:           OK (0.18s)
           25 ms ± 1.8 ms
        3000:           OK (0.17s)
           55 ms ± 4.8 ms
        4000:           OK (0.29s)
           96 ms ± 4.5 ms
    inline link openers without closers
      commonmark
        1000:           OK (0.29s)
          2.1 ms ± 130 μs
        2000:           OK (0.30s)
          4.5 ms ± 251 μs
        3000:           OK (0.46s)
          6.9 ms ± 558 μs
        4000:           OK (0.63s)
          9.8 ms ± 402 μs
    repeated pattern '[ (]('
      commonmark
        1000:           OK (0.38s)
          1.5 ms ±  55 μs
        2000:           OK (0.39s)
          2.8 ms ± 238 μs
        3000:           OK (0.29s)
          4.3 ms ± 279 μs
        4000:           OK (0.40s)
          6.0 ms ± 554 μs
    nested block quotes
      commonmark
        1000:           OK (0.50s)
          941 μs ±  23 μs
        2000:           OK (0.54s)
          2.0 ms ±  43 μs
        3000:           OK (0.21s)
          3.2 ms ± 316 μs
        4000:           OK (0.29s)
          4.2 ms ± 223 μs
    nested list
      commonmark
        1000:           OK (0.28s)
          529 μs ±  24 μs
        2000:           OK (0.42s)
          813 μs ±  47 μs
        3000:           OK (0.30s)
          1.1 ms ±  66 μs
        4000:           OK (0.36s)
          1.3 ms ± 128 μs
    nested list 2
      commonmark
        1000:           OK (0.36s)
          2.8 ms ± 108 μs
        2000:           OK (0.42s)
          6.3 ms ± 387 μs
        3000:           OK (1.33s)
           10 ms ± 167 μs
        4000:           OK (0.49s)
           15 ms ± 796 μs
    backticks
      commonmark
        1000:           OK (0.21s)
          406 μs ±  23 μs
        2000:           OK (0.22s)
          862 μs ±  57 μs
        3000:           OK (0.33s)
          1.3 ms ±  57 μs
        4000:           OK (0.23s)
          1.7 ms ±  97 μs
    CDATA   
      commonmark
        1000:           OK (0.24s)
          916 μs ±  43 μs
        2000:           OK (0.49s)
          1.8 ms ±  51 μs
        3000:           OK (0.38s)
          2.9 ms ±  93 μs
        4000:           OK (0.50s)
          3.8 ms ±  93 μs
    <?      
      commonmark
        1000:           OK (0.40s)
          1.5 ms ±  53 μs
        2000:           OK (0.40s)
          3.0 ms ±  95 μs
        3000:           OK (0.31s)
          4.6 ms ± 390 μs
        4000:           OK (0.22s)
          6.2 ms ± 570 μs
    <!A     
      commonmark
        1000:           OK (0.32s)
          1.2 ms ±  55 μs
        2000:           OK (0.33s)
          2.4 ms ± 144 μs
        3000:           OK (0.48s)
          3.7 ms ± 157 μs
        4000:           OK (0.33s)
          4.9 ms ± 202 μs
            
All 74 tests passed (28.22s)
Benchmark benchmark-commonmark: FINISH
commonmark-extensions> benchmarks
Running 1 benchmarks...            
Benchmark benchmark-commonmark-extensions: RUNNING...
All                                
  commonmark +smart:      OK (0.78s)
     50 ms ± 2.0 ms                
  commonmark +autolink:   OK (0.39s)
     53 ms ± 4.1 ms                
  commonmark +attributes: OK (0.74s)
     48 ms ± 1.0 ms                
  commonmark +pipe_table: OK (0.35s)
     47 ms ± 3.0 ms                
                                   
All 4 tests passed (2.25s)         
Benchmark benchmark-commonmark-extensions: FINISH
Completed 2 action(s).

Fixes #136

This works by re-parsing the tokens that come after the link, but only when the end delimiter isn't on a chunk boundary (since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that needs re-interpreted can span more than one chunk. For example, we can draw the bounds of the erroneous code chunk in this example:

[x](`) <a href="`">
    ^-----------^

If we re-parse the underlined part in isolation, we'll fix the first link, but won't find the HTML (since the closing angle bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass would make writing extensions more complicated. For example, LaTeX math is supposed to have the same binding strength as code spans:

$first[$](about)
^------^ this is a math span, not a link

[first]($)$5/8$
        ^-^ this is an analogue of the original bug
            it shouldn't be a math span, but looks like one

notriddle · 2024-07-04T05:58:46Z

@jgm other than the merge conflict in regression.md (a simple rebase), is there anything blocking this fix?

jgm · 2024-07-04T06:11:48Z

This probably just arrived at a busy time; I will look at. It would help if you could rebase.

Fixes jgm#136 This works by re-parsing the tokens that come after the link, but only when the end delimiter isn't on a chunk boundary (since that's the only way this problem can happen). Re-parsing a specific chunk won't work, because the part that needs re-interpreted can span more than one chunk. For example, we can draw the bounds of the erroneous code chunk in this example: [x](`) <a href="`"> ^-----------^ If we re-parse the underlined part in isolation, we'll fix the first link, but won't find the HTML (since the closing angle bracket is in the next chunk). On the other hand, parsing links, code, and HTML in a single pass would make writing extensions more complicated. For example, LaTeX math is supposed to have the same binding strength as code spans: $first[$](about) ^------^ this is a math span, not a link [first]($)$5/8$ ^-^ this is an analogue of the original bug it shouldn't be a math span, but looks like one

notriddle · 2024-07-16T02:47:32Z

Okay, it’s rebased.

notriddle · 2024-09-11T15:09:40Z

Any feedback on this PR?

jgm · 2024-09-11T15:58:46Z

Sorry, I was on vacation when this posted and it got ignored. I'll try to have a look in the near future!

jgm

This seems to do the job, but the code is pretty complex and it's hard to survey. I suggest adding some comments to explain what is going on, and maybe using intermediate definitions to make the code clearer. That will make it easier to maintain this code in the future.

jgm · 2024-09-11T16:09:46Z

commonmark/src/Commonmark/Inlines.hs

+  let go chunks toks' bottoms = do
+        chunks' <- {-# SCC parseChunks #-} evalStateT
+                (parseChunks bracketedSpecs formattingSpecs ilParsers
+                 attrParser rm toks') defaultEnders
+        case chunks' of
+           Left err -> return $ Left err
+           Right chunks'' ->
+              case (processBrackets bracketedSpecs rm (chunks ++ chunks'') bottoms) of
+                  Left st -> go ((reverse . befores . rightCursor) st) (mconcat (map chunkToks $ (maybeToList . center $ rightCursor st) ++ (afters $ rightCursor st))) (stackBottoms st)
+                  Right chunks''' -> return $ Right chunks'''
+  res <- go [] ((dropWhile iswhite . reverse . dropWhile iswhite . reverse) toks) mempty
  return $!
-    case res of
-       Left err     -> Left err
-       Right chunks ->
-         (Right .
-          unChunks .
-          processEmphasis .
-          processBrackets bracketedSpecs rm) chunks
+      case res of
+         Left err     -> Left err
+         Right chunks ->
+           (Right . unChunks . processEmphasis) chunks


This has gotten pretty complex, and I have a hard time following what is going on. I'm afraid I'll be at a loss if I ever need to modify this part of the code again. Can it be made a bit more user-friendly, with some comments and simplifying definitions?

jgm · 2024-09-11T16:10:47Z

commonmark/src/Commonmark/Inlines.hs

 processBrackets :: IsInline a
-                => [BracketedSpec a] -> ReferenceMap -> [Chunk a] -> [Chunk a]
-processBrackets bracketedSpecs rm xs =
+                => [BracketedSpec a] -> ReferenceMap -> [Chunk a] -> M.Map Text SourcePos -> Either (DState a) [Chunk a]


The type has gotten more complex; it would be worth explaining in a comment what the significance of Left and Right results is, and maybe why the bottoms map is needed.

jgm · 2024-09-11T16:11:18Z

commonmark/src/Commonmark/Inlines.hs

-          => [BracketedSpec a] -> DState a -> [Chunk a]
+          => [BracketedSpec a] -> DState a -> Either (DState a) [Chunk a]


Worth a comment at the beginning explaining what Left and Right results mean here.

notriddle · 2024-09-11T17:47:12Z

Do the new comments and variable names help?

jgm · 2024-09-11T18:58:34Z

Yes, this looks great, thanks!

notriddle force-pushed the notriddle/backquote-link branch from 6384ae9 to 96ef383 Compare January 18, 2024 23:14

notriddle force-pushed the notriddle/backquote-link branch 2 times, most recently from 7ccb68c to f8b163e Compare January 27, 2024 00:10

notriddle mentioned this pull request Feb 5, 2024

[fuzz result] code span vanishes when link destination is ` #136

Closed

notriddle force-pushed the notriddle/backquote-link branch from f8b163e to abfc152 Compare July 4, 2024 14:34

notriddle force-pushed the notriddle/backquote-link branch from abfc152 to fe3bee4 Compare July 4, 2024 14:35

jgm reviewed Sep 11, 2024

View reviewed changes

Add more descriptive variables and comments

c983274

jgm merged commit ff9fe57 into jgm:master Sep 11, 2024
7 checks passed

notriddle deleted the notriddle/backquote-link branch September 11, 2024 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing of link destinations that look like `code` or <html> #137

Fix parsing of link destinations that look like `code` or <html> #137

notriddle commented Jan 18, 2024

notriddle commented Jul 4, 2024

jgm commented Jul 4, 2024

notriddle commented Jul 16, 2024

notriddle commented Sep 11, 2024

jgm commented Sep 11, 2024

jgm left a comment

jgm Sep 11, 2024

jgm Sep 11, 2024

jgm Sep 11, 2024

notriddle commented Sep 11, 2024

jgm commented Sep 11, 2024

		=> [BracketedSpec a] -> DState a -> [Chunk a]
		=> [BracketedSpec a] -> DState a -> Either (DState a) [Chunk a]

Fix parsing of link destinations that look like code or <html> #137

Fix parsing of link destinations that look like code or <html> #137

Conversation

notriddle commented Jan 18, 2024

notriddle commented Jul 4, 2024

jgm commented Jul 4, 2024

notriddle commented Jul 16, 2024

notriddle commented Sep 11, 2024

jgm commented Sep 11, 2024

jgm left a comment

Choose a reason for hiding this comment

jgm Sep 11, 2024

Choose a reason for hiding this comment

jgm Sep 11, 2024

Choose a reason for hiding this comment

jgm Sep 11, 2024

Choose a reason for hiding this comment

notriddle commented Sep 11, 2024

jgm commented Sep 11, 2024

Fix parsing of link destinations that look like `code` or <html> #137

Fix parsing of link destinations that look like `code` or <html> #137