[Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

neteler · 2024-12-20T13:02:08Z

Describe the bug

I am working on the mass conversion of all HTML manual pages to markdown. To convert all HTML files to markdown I have written a pandoc based converter script (see #4620) which already does most of the job.

A showstopper in the conversion of HTML manual pages to markdown are the figures as the related HTML snippets vary from manual page to manual page, nonetheless there is a style recommendation.

For an easier discussion, I have moved the figure issue here to separate it out from #4748.

Many figures looks ugly after MD conversion (resulting MD code is paertially garbage):

v.fill.holes.html figures
- grass/vector/v.fill.holes/v.fill.holes.html
  
  Line 13 in fc94e29
  
  <div align="center" style="margin: 10px">
v.to.rast3.html figure
... many more
often the figure caption are not properly detected: mkdocs/site/raster3dintro.html

I have written a LUA filter for pandoc (yet unsubmitted) but it can only convert that specific HTML code. With so many HTML variants I have no idea how to do that.

To reproduce

run the utils/grass_html2md.sh converter script (see docs: script to convert HTML manual pages to markdown #4620)
run markdownlint on the MD files

I tried to submit the converted MD files for community review but I get stuck in the pre-commit stage:

From my terminal:

markdownlint-fix.........................................................Failed
- hook id: markdownlint-fix
- exit code: 1
- files were modified by this hook
display/d.rast/d.rast.md:14:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:16:1 MD033/no-inline-html Inline HTML [Element: img]
display/d.rast/d.rast.md:29:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:31:1 MD033/no-inline-html Inline HTML [Element: img]
display/d.rast/d.rast.md:43:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:45:1 MD033/no-inline-html Inline HTML [Element: img]
gui/wxpython/docs/wxGUI.toolboxes.md:180:1 MD033/no-inline-html Inline HTML [Element: img]
gui/wxpython/timeline/g.gui.timeline.md:14:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.li/r.li.cwed/r.li.cwed.md:12:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:12:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:14:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:14:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:21:1 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.mpa/r.li.mpa.md:10:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.mpa/r.li.mpa.md:10:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.path/r.path.md:122:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.path/r.path.md:124:2 MD033/no-inline-html Inline HTML [Element: img]
raster/r.path/r.path.md:176:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.path/r.path.md:178:2 MD033/no-inline-html Inline HTML [Element: img]
raster/r.resamp.filter/r.resamp.filter.md:98:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.resamp.filter/r.resamp.filter.md:100:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sim/r.sim.water/r.sim.water.md:30:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.sim/r.sim.water/r.sim.water.md:32:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sim/r.sim.water/r.sim.water.md:154:81 MD013/line-length Line length [Expected: 80; Actual: 147]
raster/r.sim/r.sim.water/r.sim.water.md:168:81 MD013/line-length Line length [Expected: 80; Actual: 95]
raster/r.sim/r.sim.water/r.sim.water.md:175:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.sim/r.sim.water/r.sim.water.md:177:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sunmask/r.sunmask.md:89:81 MD013/line-length Line length [Expected: 80; Actual: 96]
raster/r.univar/r.univar.md:59:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.univar/r.univar.md:61:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.univar/r.univar.md:187:1 MD033/no-inline-html Inline HTML [Element: div]
...

Expected behavior

I wonder if we have to touch the ~170 HTML files manually to streamline the HTML figure code therein in order to eventually develop a single pandoc LUA filer.

Support welcome!

The text was updated successfully, but these errors were encountered:

echoix · 2024-12-20T13:20:14Z

Or, if we want to keep things moving, add an exclusion for now. Is there a pattern that could be used or it would be impossible?

It's ok to not have them perfect on the first try.

Test submission of conversion of all HTML manual pages to markdown using the `pandoc` based converter script (see OSGeo#4620). For figure code conversion issues, see OSGeo#4864

neteler · 2024-12-20T13:41:05Z

For easier inspection, converted MD files submitted in #4865.

ninsbl · 2024-12-23T09:24:29Z

Maybe this python library by Microsoft could be worth a try: https://github.com/microsoft/markitdown ?

echoix · 2024-12-23T12:52:48Z

Maybe this python library by Microsoft could be worth a try: https://github.com/microsoft/markitdown ?

I didn't know about this one :)

ninsbl · 2024-12-23T23:36:59Z

I just tried the markitdown tool on v.fill.holes.html And the result looks quite OK. Images are bigger compared to the pandoc conversion. However, pymarkdownlnt and markdownlint-cli for example complain about line length and missing blank lines (amongst others)... Also code blocks are not automatically defined as shell... So, there some post-processing would be needed too...

neteler · 2024-12-27T21:27:32Z

I tried it as well, but no success with e.g. this file:

cd raster3d/r3.to.rast/
cat r3.to.rast.html | markitdown  
Traceback (most recent call last):
  File "/home/mneteler/.local/bin/markitdown", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/__main__.py", line 38, in main
    result = markitdown.convert_stream(sys.stdin.buffer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/_markitdown.py", line 1142, in convert_stream
    result = self._convert(temp_path, extensions, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/_markitdown.py", line 1260, in _convert
    raise UnsupportedFormatException(
markitdown._markitdown.UnsupportedFormatException: Could not convert '/home/mneteler/tmp/tmplrsg96v_' to Markdown. The formats [] are not supported.

What's the trick, @ninsbl ?

neteler · 2025-01-02T00:11:27Z

I just tried the markitdown tool on v.fill.holes.html And the result looks quite OK.

@ninsbl would you mind to share the command you have used?

ninsbl · 2025-01-06T23:28:28Z

OK, now I tested and managed to convert all HTML manual pages into markdown with markitdown.

Here is what I did from the root of the GRASS GIS source tree, writing the mardown files into a directory named md:

mkdir md
find . -type f -iname *.html | grep -v "x86_64-pc-linux-gnu" | awk -v FS='/', -v OFS='/' 'BEGIN{FS="/"}; {PATH=$0; MD=$NF} {gsub("html", "md", MD)};{print "if [ -s \"" PATH "\" ] ; then markitdown " PATH " | sed '"':a;/^[ \\\\n]*$/{\$d;N;ba}'"' | sed -E '"'s/([[:space:]]*)\`\`\`/\\\\n\`\`\`\\\\n/g'"' | sed '"'s/[[:blank:]]*$//g'"' | sed '"'/^$/{:a;N;s/\\\\n$//;ta}'"' > ./md/" MD "; else echo \"" MD " is empty\"; touch " MD "; fi\0"}' | xargs -0 -n 1 -P 12 bash -c

grep is used to exclude the compilation directory: dist.x86_64-pc-linux-gnu. Some HTML files (e.g.: test.rtree.lib.) are empty and cause markitdown to fail, so I had to handle them separate (if then fi).
Also some files had multiple blank lines at the end, trailing whitespaces and other issues that caused fixing with pymarkdownlint to fail. Thus I inserted the sed sequences in order to make pymarkdownlint work I also had to add newlines around fences code using sed. This could be extended for issues like e.g. missing blank lines around headings (MD022) or missng language info for fenced code (MD040)...

And here is a summary of the warnings/errors that pymarkdownlint reports on the generated .md files:

pymarkdown scan md/ | cut -f2 -d " " | tr -d ':' | sort | uniq -c
      5 MD001
     29 MD004
      9 MD007
    736 MD010
      1 MD011
   1197 MD013
      1 MD018
   2308 MD022
     10 MD024
      3 MD025
     38 MD026
     11 MD029
      6 MD032
     50 MD033
      2 MD034
    198 MD036
    128 MD037
   2121 MD040
    597 MD041
    147 MD045
     11 MD046

consecutive runs of pymarkdown fix md can reduce the number of markdown formating issues to this:

pymarkdown scan md/ | cut -f2 -d " " | tr -d ':' | sort | uniq -c
      1 MD011 # Reversed link syntax
      1 MD012 # Multiple consecutive blank lines
   1192 MD013 # Line length
      1 MD018 # No space after hash on atx style heading
   2306 MD022 # Headings should be surrounded by blank lines
     10 MD024 # Multiple headings with the same content
      3 MD025 # Multiple top-level headings in the same document
     38 MD026 # Trailing punctuation in heading
      1 MD031 # Fenced code blocks should be surrounded by blank lines
      7 MD032 # Lists should be surrounded by blank lines
     50 MD033 # Inline HTML
      2 MD034 # Bare URL used
    198 MD036 # Emphasis used instead of a heading
   2134 MD040 # Fenced code blocks should have a language specified
    597 MD041 # First line in a file should be a top-level heading
    147 MD045 #  Images should have alternate text (alt text)

However, fixing with pymarkdown can introduce new issues and sometimes rules cause conflicts in the tool...

I did not check how those files look...

This is my setup:

Ubuntu 22.04.5 LTS
Python 3.10.12
markitdown 0.0.1a3
pymarkdown 0.9.26

echoix · 2025-01-06T23:42:04Z

Just to note that two (or more, but markdown lint has rules for that) has a meaning in markdown, it means to add a line break, to not wrap in the same paragraph.

ninsbl · 2025-01-07T07:33:10Z

Thanks for the note @echoix .

Now I noticed, pymarkdownlint has a --continue-on-error flag that can be used to apply automatic fixes as far as possible... So the sed command that removes double line breaks can be safely removed...

Do you have a suggestion for another markdown linter? pymarkdownlint (which I used here) changes files forth and back which seems a little unstable...

neteler · 2025-01-07T09:03:40Z

Thanks @ninsbl!

I have taken r3.to.rast as an example to compare the result of your script above with the output of the proposed utils/grass_html2md.sh:

diff  --side-by-side --width=150 r3.to.rast.md ~/software/grass_main/raster3d/r3.to.rast/r3.to.rast.md
									  <
## DESCRIPTION									## DESCRIPTION

Converts one 3D raster map into several 2D raster maps (depends on dep	  |	Converts one 3D raster map into several 2D raster maps (depends on
If the 2D and 3D region settings are different, the 3D resolution will	  |	depths). If the 2D and 3D region settings are different, the 3D
adjusted to the 2D resolution (the depths are not touched).		  |	resolution will be adjusted to the 2D resolution (the depths are not
The user can force *r3.to.rast* to use the 2D resolution of the input	  |	touched). The user can force *r3.to.rast* to use the 2D resolution of
3D raster map for the output maps, independently from the current regi	  |	the input 3D raster map for the output maps, independently from the
![](r3.to.rast.png)							  |	current region settings.
									  >
									  >	<img src="r3.to.rast.png" data-border="0" />  

									  >	|                        |
									  >	|------------------------|
| *How r3.to.rast works* |							| *How r3.to.rast works* |
| --- |									  <

### Map type conversions							### Map type conversions

Type of resulting 2D raster maps is determined by the type of the	  |	Type of resulting 2D raster maps is determined by the type of the inpu
input 3D raster, i.e. 3D raster of type DCELL (double) will result in	  |	3D raster, i.e. 3D raster of type DCELL (double) will result in DCELL 
DCELL 2D rasters. A specific type for 2D rasters can be requested usin	  |	rasters. A specific type for 2D rasters can be requested using the
the **type** option.							  |	**type** option.
									  |
The **type** option is especially advantageous when the 3D raster	  |	The **type** option is especially advantageous when the 3D raster map
map stores categories (which need to be stored as floating point numbe	  |	stores categories (which need to be stored as floating point numbers)
and the 2D raster map should be also categorical, i.e. use integers.	  |	and the 2D raster map should be also categorical, i.e. use integers. T
The type is set to `CELL` in this case.					  |	type is set to `CELL` in this case.
### Modifying the values						  <

The values in the 3D raster map can be modified prior to storing in	  |	### Modifying the values
the 2D raster map. The values can be scaled using the option **multipl	  <
and a constant value can be added using the option **add**.		  <
The new value is computed using the following equation:			  <

```									  |	The values in the 3D raster map can be modified prior to storing in th
									  >	2D raster map. The values can be scaled using the option **multiply**
									  >	and a constant value can be added using the option **add**. The new
									  >	value is computed using the following equation:

									  >	```bash
y = ax + b									y = ax + b
									  <
```										```

where *x* is the original value, *a* is the value of			  |	where *x* is the original value, *a* is the value of **multiply**
**multiply** option, *b* is the value of **add** option,		  |	option, *b* is the value of **add** option, and *y* is the new value.
and *y* is the new value. When **multiply** is not provided,		  |	When **multiply** is not provided, the value of *a* is 1. When **add**
the value of *a* is 1. When **add** is not provided, the value		  |	is not provided, the value of *b* is 0.
of *b* is 0.								  |
## NOTES									## NOTES

Every slice of the 3D raster map is copied to one 2D raster map. The m	  |	Every slice of the 3D raster map is copied to one 2D raster map. The
are named like **output***\_slicenumber*. Slices are counted from bott	  |	maps are named like **output***\_slicenumber*. Slices are counted from
to the top, so the bottom slice has number 1.				  |	bottom to the top, so the bottom slice has number 1.

The number of slices is equal to the number of depths.				The number of slices is equal to the number of depths.

To round floating point values to integers when using `type=CELL`,	  |	To round floating point values to integers when using `type=CELL`, the
the **add** option should be set to 0.5.				  |	**add** option should be set to 0.5.
									  >
## SEE ALSO									## SEE ALSO

*[r3.cross.rast](r3.cross.rast.html),					  |	*[r3.cross.rast](r3.cross.rast.md), [r3.out.vtk](r3.out.vtk.md),
[r3.out.vtk](r3.out.vtk.html),						  |	[r3.out.ascii](r3.out.ascii.md), [g.region](g.region.md)*
[r3.out.ascii](r3.out.ascii.html),					  <
[g.region](g.region.html)*						  <
## AUTHORS								  <

Sören Gebbert								  |	## AUTHORS

Vaclav Petras, [NCSU GeoForAll Lab](https://geospatial.ncsu.edu/geofor	  |	Sören Gebbert  
									  >	Vaclav Petras, [NCSU GeoForAll
									  >	Lab](https://geospatial.ncsu.edu/geoforall/)

Differences:

the grass_html2md.sh (pandoc based) conversion of figures is suboptimal (see above on top)
the markitdown based result lacks an empty line before the section titles
the markitdown based conversion of figures is suboptimal (cause is the HTML code quality)
the markitdown based conversion code block fencing is yet suboptimal
the markitdown based conversion local URLs isn't right (.html should be .md, see "SEE ALSO" section)
the markitdown based conversion treats <br> better in the "AUTHORS" section

Here the MD files for easier local comparison with e.g. meld:

ninsbl · 2025-01-07T09:41:59Z

You can visually inspect the results of the conversion here:

https://github.com/ninsbl/grass/blob/md_test/md

For selected manuals I added the image files:
https://github.com/ninsbl/grass/blob/md_test/md/v.fill.holes.md
https://github.com/ninsbl/grass/blob/md_test/md/r3.to.rast.md
https://github.com/ninsbl/grass/blob/md_test/md/v.to.rast3.md

Language for fenced code (assuming all is shell / sh) and linebreaks above headings could be addressed with a little script I guess.
Mending local URLs could be doable that way as well.

Yet, some errors I guess require manual adjustment, like some of the inline-html (not sure if this should be done in the HTML files then)?...

@neteler what issues did you observe with the figure conversion?

And last but not least how should we proceed?

If we manage to fix

MD022 # Headings should be surrounded by blank lines
MD040 # Fenced code blocks should have a language specified
MD026 # Trailing punctuation in heading

with a script and if we ignore

MD013 # Line length
MD041 # First line in a file should be a top-level heading

and maybe

MD036 # Emphasis used instead of a heading

The remaining issues are not overwhelmingly many:
1 MD018 # No space after hash on atx style heading
10 MD024 # Multiple headings with the same content
3 MD025 # Multiple top-level headings in the same document
1 MD031 # Fenced code blocks should be surrounded by blank lines
7 MD032 # Lists should be surrounded by blank lines
50 MD033 # Inline HTML
2 MD034 # Bare URL used

ninsbl · 2025-01-07T10:06:26Z

Also markitdown does not convert <dl> elements very well (e.g. in grass.html / grass.md)

ninsbl · 2025-01-07T10:30:06Z

BTW: markitdown uses markdownify under the hood, with some adjustments:

class _CustomMarkdownify(markdownify.MarkdownConverter):

Maybe something worth considering?

neteler · 2025-01-07T11:12:50Z

You can visually inspect the results of the conversion here:

https://github.com/ninsbl/grass/blob/md_test/md

For selected manuals I added the image files:

https://github.com/ninsbl/grass/blob/md_test/md/v.fill.holes.md

https://github.com/ninsbl/grass/blob/md_test/md/r3.to.rast.md

https://github.com/ninsbl/grass/blob/md_test/md/v.to.rast3.md

[...]

@neteler what issues did you observe with the figure conversion?

Indeed, the markitdown based figure conversion looks much better than that of pandoc.

echoix · 2025-01-07T12:54:32Z

Thanks for the note @echoix .

Now I noticed, pymarkdownlint has a --continue-on-error flag that can be used to apply automatic fixes as far as possible... So the sed command that removes double line breaks can be safely removed...

Do you have a suggestion for another markdown linter? pymarkdownlint (which I used here) changes files forth and back which seems a little unstable...

In Megalinter, we have:

https://megalinter.io/latest/descriptors/markdown/

markdownlint
remark-lint
markdown-link-check
markdown-table-formatter

Try the first two?

Did you also try pymarkdown?

Also, you talked about links somewhere below. What I've observed is mostly to have the links be correct in the repo, with relative links referring to other markdown files like they do in the repo, and the build tool that's creates the HTML website adjusts the links accordingly when they process it. But some places, I think that the Microsoft docs, don't necessarily use that.

echoix · 2025-01-07T12:55:59Z

And what's the big deal of having the figures look right by using html inside markdown as a first iteration?

ninsbl · 2025-01-07T13:30:54Z

And what's the big deal of having the figures look right by using html inside markdown as a first iteration?

My understanding was that figures did not look to well with inline HTML...

Also, conversion of dt elements seems to be an issue for files like grass.html; compare that to: https://github.com/ninsbl/grass/blob/md_test/md/grass.md where I had to use a custom version og markdownify to achieve something similar.

In general, I would suggest to use markdownify directly instead of the markitdown wrapper, which gives us fewer options...

neteler added bug Something isn't working manual Documentation related issues HTML Related code is in HTML docs markdown labels Dec 20, 2024

neteler added this to the 8.5.0 milestone Dec 20, 2024

neteler self-assigned this Dec 20, 2024

neteler added this to GRASS Markdown Documentation Dec 20, 2024

neteler mentioned this issue Dec 20, 2024

[Bug] docs: remaining issues with markdown manual using mkdocs #4748

Open

15 tasks

neteler mentioned this issue Dec 20, 2024

manual: conversion of all HTML manual pages to markdown #4865

Draft

neteler mentioned this issue Dec 22, 2024

docs: script to convert HTML manual pages to markdown #4620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

[Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

neteler commented Dec 20, 2024

echoix commented Dec 20, 2024

neteler commented Dec 20, 2024 •

edited

Loading

ninsbl commented Dec 23, 2024

echoix commented Dec 23, 2024

ninsbl commented Dec 23, 2024

neteler commented Dec 27, 2024

neteler commented Jan 2, 2025

ninsbl commented Jan 6, 2025

echoix commented Jan 6, 2025

ninsbl commented Jan 7, 2025

neteler commented Jan 7, 2025

ninsbl commented Jan 7, 2025

ninsbl commented Jan 7, 2025

ninsbl commented Jan 7, 2025

neteler commented Jan 7, 2025

echoix commented Jan 7, 2025

echoix commented Jan 7, 2025

ninsbl commented Jan 7, 2025

[Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

[Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

Comments

neteler commented Dec 20, 2024

Describe the bug

To reproduce

Expected behavior

echoix commented Dec 20, 2024

neteler commented Dec 20, 2024 • edited Loading

ninsbl commented Dec 23, 2024

echoix commented Dec 23, 2024

ninsbl commented Dec 23, 2024

neteler commented Dec 27, 2024

neteler commented Jan 2, 2025

ninsbl commented Jan 6, 2025

echoix commented Jan 6, 2025

ninsbl commented Jan 7, 2025

neteler commented Jan 7, 2025

ninsbl commented Jan 7, 2025

ninsbl commented Jan 7, 2025

ninsbl commented Jan 7, 2025

neteler commented Jan 7, 2025

echoix commented Jan 7, 2025

echoix commented Jan 7, 2025

ninsbl commented Jan 7, 2025

neteler commented Dec 20, 2024 •

edited

Loading