Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

Open
4 tasks
neteler opened this issue Dec 20, 2024 · 18 comments
Open
4 tasks
Assignees
Labels
bug Something isn't working docs HTML Related code is in HTML manual Documentation related issues markdown
Milestone

Comments

@neteler
Copy link
Member

neteler commented Dec 20, 2024

Describe the bug

I am working on the mass conversion of all HTML manual pages to markdown. To convert all HTML files to markdown I have written a pandoc based converter script (see #4620) which already does most of the job.

A showstopper in the conversion of HTML manual pages to markdown are the figures as the related HTML snippets vary from manual page to manual page, nonetheless there is a style recommendation.

For an easier discussion, I have moved the figure issue here to separate it out from #4748.

Many figures looks ugly after MD conversion (resulting MD code is paertially garbage):

  • v.fill.holes.html figures
  • v.to.rast3.html figure
  • ... many more
  • often the figure caption are not properly detected: mkdocs/site/raster3dintro.html

I have written a LUA filter for pandoc (yet unsubmitted) but it can only convert that specific HTML code. With so many HTML variants I have no idea how to do that.

To reproduce

I tried to submit the converted MD files for community review but I get stuck in the pre-commit stage:

From my terminal:

markdownlint-fix.........................................................Failed
- hook id: markdownlint-fix
- exit code: 1
- files were modified by this hook
display/d.rast/d.rast.md:14:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:16:1 MD033/no-inline-html Inline HTML [Element: img]
display/d.rast/d.rast.md:29:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:31:1 MD033/no-inline-html Inline HTML [Element: img]
display/d.rast/d.rast.md:43:1 MD033/no-inline-html Inline HTML [Element: div]
display/d.rast/d.rast.md:45:1 MD033/no-inline-html Inline HTML [Element: img]
gui/wxpython/docs/wxGUI.toolboxes.md:180:1 MD033/no-inline-html Inline HTML [Element: img]
gui/wxpython/timeline/g.gui.timeline.md:14:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.li/r.li.cwed/r.li.cwed.md:12:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:12:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:14:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:14:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.cwed/r.li.cwed.md:21:1 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.mpa/r.li.mpa.md:10:6 MD033/no-inline-html Inline HTML [Element: span]
raster/r.li/r.li.mpa/r.li.mpa.md:10:26 MD033/no-inline-html Inline HTML [Element: span]
raster/r.path/r.path.md:122:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.path/r.path.md:124:2 MD033/no-inline-html Inline HTML [Element: img]
raster/r.path/r.path.md:176:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.path/r.path.md:178:2 MD033/no-inline-html Inline HTML [Element: img]
raster/r.resamp.filter/r.resamp.filter.md:98:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.resamp.filter/r.resamp.filter.md:100:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sim/r.sim.water/r.sim.water.md:30:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.sim/r.sim.water/r.sim.water.md:32:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sim/r.sim.water/r.sim.water.md:154:81 MD013/line-length Line length [Expected: 80; Actual: 147]
raster/r.sim/r.sim.water/r.sim.water.md:168:81 MD013/line-length Line length [Expected: 80; Actual: 95]
raster/r.sim/r.sim.water/r.sim.water.md:175:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.sim/r.sim.water/r.sim.water.md:177:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.sunmask/r.sunmask.md:89:81 MD013/line-length Line length [Expected: 80; Actual: 96]
raster/r.univar/r.univar.md:59:1 MD033/no-inline-html Inline HTML [Element: div]
raster/r.univar/r.univar.md:61:1 MD033/no-inline-html Inline HTML [Element: img]
raster/r.univar/r.univar.md:187:1 MD033/no-inline-html Inline HTML [Element: div]
...

Expected behavior

I wonder if we have to touch the ~170 HTML files manually to streamline the HTML figure code therein in order to eventually develop a single pandoc LUA filer.

Support welcome!

@neteler neteler added bug Something isn't working manual Documentation related issues HTML Related code is in HTML docs markdown labels Dec 20, 2024
@neteler neteler added this to the 8.5.0 milestone Dec 20, 2024
@neteler neteler self-assigned this Dec 20, 2024
@echoix
Copy link
Member

echoix commented Dec 20, 2024

Or, if we want to keep things moving, add an exclusion for now. Is there a pattern that could be used or it would be impossible?

It's ok to not have them perfect on the first try.

neteler added a commit to neteler/grass that referenced this issue Dec 20, 2024
Test submission of conversion of all HTML manual pages to markdown using the `pandoc` based converter script (see OSGeo#4620).

For figure code conversion issues, see OSGeo#4864
@neteler
Copy link
Member Author

neteler commented Dec 20, 2024

For easier inspection, converted MD files submitted in #4865.

@ninsbl
Copy link
Member

ninsbl commented Dec 23, 2024

Maybe this python library by Microsoft could be worth a try: https://github.com/microsoft/markitdown ?

@echoix
Copy link
Member

echoix commented Dec 23, 2024

Maybe this python library by Microsoft could be worth a try: https://github.com/microsoft/markitdown ?

I didn't know about this one :)

@ninsbl
Copy link
Member

ninsbl commented Dec 23, 2024

I just tried the markitdown tool on v.fill.holes.html And the result looks quite OK. Images are bigger compared to the pandoc conversion. However, pymarkdownlnt and markdownlint-cli for example complain about line length and missing blank lines (amongst others)... Also code blocks are not automatically defined as shell... So, there some post-processing would be needed too...

@neteler
Copy link
Member Author

neteler commented Dec 27, 2024

I tried it as well, but no success with e.g. this file:

cd raster3d/r3.to.rast/
cat r3.to.rast.html | markitdown  
Traceback (most recent call last):
  File "/home/mneteler/.local/bin/markitdown", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/__main__.py", line 38, in main
    result = markitdown.convert_stream(sys.stdin.buffer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/_markitdown.py", line 1142, in convert_stream
    result = self._convert(temp_path, extensions, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mneteler/.local/lib/python3.12/site-packages/markitdown/_markitdown.py", line 1260, in _convert
    raise UnsupportedFormatException(
markitdown._markitdown.UnsupportedFormatException: Could not convert '/home/mneteler/tmp/tmplrsg96v_' to Markdown. The formats [] are not supported.

What's the trick, @ninsbl ?

@neteler
Copy link
Member Author

neteler commented Jan 2, 2025

I just tried the markitdown tool on v.fill.holes.html And the result looks quite OK.

@ninsbl would you mind to share the command you have used?

@ninsbl
Copy link
Member

ninsbl commented Jan 6, 2025

OK, now I tested and managed to convert all HTML manual pages into markdown with markitdown.

Here is what I did from the root of the GRASS GIS source tree, writing the mardown files into a directory named md:

mkdir md
find . -type f -iname *.html | grep -v "x86_64-pc-linux-gnu" | awk -v FS='/', -v OFS='/' 'BEGIN{FS="/"}; {PATH=$0; MD=$NF} {gsub("html", "md", MD)};{print "if [ -s \"" PATH "\" ] ; then markitdown " PATH " | sed '"':a;/^[ \\\\n]*$/{\$d;N;ba}'"' | sed -E '"'s/([[:space:]]*)\`\`\`/\\\\n\`\`\`\\\\n/g'"' | sed '"'s/[[:blank:]]*$//g'"' | sed '"'/^$/{:a;N;s/\\\\n$//;ta}'"' > ./md/" MD "; else echo \"" MD " is empty\"; touch " MD "; fi\0"}' | xargs -0 -n 1 -P 12 bash -c

grep is used to exclude the compilation directory: dist.x86_64-pc-linux-gnu. Some HTML files (e.g.: test.rtree.lib.) are empty and cause markitdown to fail, so I had to handle them separate (if then fi).
Also some files had multiple blank lines at the end, trailing whitespaces and other issues that caused fixing with pymarkdownlint to fail. Thus I inserted the sed sequences in order to make pymarkdownlint work I also had to add newlines around fences code using sed. This could be extended for issues like e.g. missing blank lines around headings (MD022) or missng language info for fenced code (MD040)...

And here is a summary of the warnings/errors that pymarkdownlint reports on the generated .md files:

pymarkdown scan md/ | cut -f2 -d " " | tr -d ':' | sort | uniq -c
      5 MD001
     29 MD004
      9 MD007
    736 MD010
      1 MD011
   1197 MD013
      1 MD018
   2308 MD022
     10 MD024
      3 MD025
     38 MD026
     11 MD029
      6 MD032
     50 MD033
      2 MD034
    198 MD036
    128 MD037
   2121 MD040
    597 MD041
    147 MD045
     11 MD046

consecutive runs of pymarkdown fix md can reduce the number of markdown formating issues to this:

pymarkdown scan md/ | cut -f2 -d " " | tr -d ':' | sort | uniq -c
      1 MD011 # Reversed link syntax
      1 MD012 # Multiple consecutive blank lines
   1192 MD013 # Line length
      1 MD018 # No space after hash on atx style heading
   2306 MD022 # Headings should be surrounded by blank lines
     10 MD024 # Multiple headings with the same content
      3 MD025 # Multiple top-level headings in the same document
     38 MD026 # Trailing punctuation in heading
      1 MD031 # Fenced code blocks should be surrounded by blank lines
      7 MD032 # Lists should be surrounded by blank lines
     50 MD033 # Inline HTML
      2 MD034 # Bare URL used
    198 MD036 # Emphasis used instead of a heading
   2134 MD040 # Fenced code blocks should have a language specified
    597 MD041 # First line in a file should be a top-level heading
    147 MD045 #  Images should have alternate text (alt text)

However, fixing with pymarkdown can introduce new issues and sometimes rules cause conflicts in the tool...

I did not check how those files look...

This is my setup:

Ubuntu 22.04.5 LTS
Python 3.10.12
markitdown 0.0.1a3
pymarkdown 0.9.26

@echoix
Copy link
Member

echoix commented Jan 6, 2025

Just to note that two (or more, but markdown lint has rules for that) has a meaning in markdown, it means to add a line break, to not wrap in the same paragraph.

@ninsbl
Copy link
Member

ninsbl commented Jan 7, 2025

Thanks for the note @echoix .

Now I noticed, pymarkdownlint has a --continue-on-error flag that can be used to apply automatic fixes as far as possible... So the sed command that removes double line breaks can be safely removed...

Do you have a suggestion for another markdown linter? pymarkdownlint (which I used here) changes files forth and back which seems a little unstable...

@neteler
Copy link
Member Author

neteler commented Jan 7, 2025

Thanks @ninsbl!

I have taken r3.to.rast as an example to compare the result of your script above with the output of the proposed utils/grass_html2md.sh:

diff  --side-by-side --width=150 r3.to.rast.md ~/software/grass_main/raster3d/r3.to.rast/r3.to.rast.md
									  <
## DESCRIPTION									## DESCRIPTION

Converts one 3D raster map into several 2D raster maps (depends on dep	  |	Converts one 3D raster map into several 2D raster maps (depends on
If the 2D and 3D region settings are different, the 3D resolution will	  |	depths). If the 2D and 3D region settings are different, the 3D
adjusted to the 2D resolution (the depths are not touched).		  |	resolution will be adjusted to the 2D resolution (the depths are not
The user can force *r3.to.rast* to use the 2D resolution of the input	  |	touched). The user can force *r3.to.rast* to use the 2D resolution of
3D raster map for the output maps, independently from the current regi	  |	the input 3D raster map for the output maps, independently from the
![](r3.to.rast.png)							  |	current region settings.
									  >
									  >	<img src="r3.to.rast.png" data-border="0" />  

									  >	|                        |
									  >	|------------------------|
| *How r3.to.rast works* |							| *How r3.to.rast works* |
| --- |									  <

### Map type conversions							### Map type conversions

Type of resulting 2D raster maps is determined by the type of the	  |	Type of resulting 2D raster maps is determined by the type of the inpu
input 3D raster, i.e. 3D raster of type DCELL (double) will result in	  |	3D raster, i.e. 3D raster of type DCELL (double) will result in DCELL 
DCELL 2D rasters. A specific type for 2D rasters can be requested usin	  |	rasters. A specific type for 2D rasters can be requested using the
the **type** option.							  |	**type** option.
									  |
The **type** option is especially advantageous when the 3D raster	  |	The **type** option is especially advantageous when the 3D raster map
map stores categories (which need to be stored as floating point numbe	  |	stores categories (which need to be stored as floating point numbers)
and the 2D raster map should be also categorical, i.e. use integers.	  |	and the 2D raster map should be also categorical, i.e. use integers. T
The type is set to `CELL` in this case.					  |	type is set to `CELL` in this case.
### Modifying the values						  <

The values in the 3D raster map can be modified prior to storing in	  |	### Modifying the values
the 2D raster map. The values can be scaled using the option **multipl	  <
and a constant value can be added using the option **add**.		  <
The new value is computed using the following equation:			  <

```									  |	The values in the 3D raster map can be modified prior to storing in th
									  >	2D raster map. The values can be scaled using the option **multiply**
									  >	and a constant value can be added using the option **add**. The new
									  >	value is computed using the following equation:

									  >	```bash
y = ax + b									y = ax + b
									  <
```										```

where *x* is the original value, *a* is the value of			  |	where *x* is the original value, *a* is the value of **multiply**
**multiply** option, *b* is the value of **add** option,		  |	option, *b* is the value of **add** option, and *y* is the new value.
and *y* is the new value. When **multiply** is not provided,		  |	When **multiply** is not provided, the value of *a* is 1. When **add**
the value of *a* is 1. When **add** is not provided, the value		  |	is not provided, the value of *b* is 0.
of *b* is 0.								  |
## NOTES									## NOTES

Every slice of the 3D raster map is copied to one 2D raster map. The m	  |	Every slice of the 3D raster map is copied to one 2D raster map. The
are named like **output***\_slicenumber*. Slices are counted from bott	  |	maps are named like **output***\_slicenumber*. Slices are counted from
to the top, so the bottom slice has number 1.				  |	bottom to the top, so the bottom slice has number 1.

The number of slices is equal to the number of depths.				The number of slices is equal to the number of depths.

To round floating point values to integers when using `type=CELL`,	  |	To round floating point values to integers when using `type=CELL`, the
the **add** option should be set to 0.5.				  |	**add** option should be set to 0.5.
									  >
## SEE ALSO									## SEE ALSO

*[r3.cross.rast](r3.cross.rast.html),					  |	*[r3.cross.rast](r3.cross.rast.md), [r3.out.vtk](r3.out.vtk.md),
[r3.out.vtk](r3.out.vtk.html),						  |	[r3.out.ascii](r3.out.ascii.md), [g.region](g.region.md)*
[r3.out.ascii](r3.out.ascii.html),					  <
[g.region](g.region.html)*						  <
## AUTHORS								  <

Sören Gebbert								  |	## AUTHORS

Vaclav Petras, [NCSU GeoForAll Lab](https://geospatial.ncsu.edu/geofor	  |	Sören Gebbert  
									  >	Vaclav Petras, [NCSU GeoForAll
									  >	Lab](https://geospatial.ncsu.edu/geoforall/)

Differences:

  • the grass_html2md.sh (pandoc based) conversion of figures is suboptimal (see above on top)
  • the markitdown based result lacks an empty line before the section titles
  • the markitdown based conversion of figures is suboptimal (cause is the HTML code quality)
  • the markitdown based conversion code block fencing is yet suboptimal
  • the markitdown based conversion local URLs isn't right (.html should be .md, see "SEE ALSO" section)
  • the markitdown based conversion treats <br> better in the "AUTHORS" section

Here the MD files for easier local comparison with e.g. meld:

@ninsbl
Copy link
Member

ninsbl commented Jan 7, 2025

You can visually inspect the results of the conversion here:

https://github.com/ninsbl/grass/blob/md_test/md

For selected manuals I added the image files:
https://github.com/ninsbl/grass/blob/md_test/md/v.fill.holes.md
https://github.com/ninsbl/grass/blob/md_test/md/r3.to.rast.md
https://github.com/ninsbl/grass/blob/md_test/md/v.to.rast3.md

Language for fenced code (assuming all is shell / sh) and linebreaks above headings could be addressed with a little script I guess.
Mending local URLs could be doable that way as well.

Yet, some errors I guess require manual adjustment, like some of the inline-html (not sure if this should be done in the HTML files then)?...

@neteler what issues did you observe with the figure conversion?

And last but not least how should we proceed?

If we manage to fix

  • MD022 # Headings should be surrounded by blank lines
  • MD040 # Fenced code blocks should have a language specified
  • MD026 # Trailing punctuation in heading

with a script and if we ignore

  • MD013 # Line length
  • MD041 # First line in a file should be a top-level heading

and maybe

  • MD036 # Emphasis used instead of a heading

The remaining issues are not overwhelmingly many:
1 MD018 # No space after hash on atx style heading
10 MD024 # Multiple headings with the same content
3 MD025 # Multiple top-level headings in the same document
1 MD031 # Fenced code blocks should be surrounded by blank lines
7 MD032 # Lists should be surrounded by blank lines
50 MD033 # Inline HTML
2 MD034 # Bare URL used

@ninsbl
Copy link
Member

ninsbl commented Jan 7, 2025

Also markitdown does not convert <dl> elements very well (e.g. in grass.html / grass.md)

@ninsbl
Copy link
Member

ninsbl commented Jan 7, 2025

BTW: markitdown uses markdownify under the hood, with some adjustments:

class _CustomMarkdownify(markdownify.MarkdownConverter):

Maybe something worth considering?

@neteler
Copy link
Member Author

neteler commented Jan 7, 2025

You can visually inspect the results of the conversion here:

https://github.com/ninsbl/grass/blob/md_test/md

For selected manuals I added the image files:

[...]

@neteler what issues did you observe with the figure conversion?

Indeed, the markitdown based figure conversion looks much better than that of pandoc.

@echoix
Copy link
Member

echoix commented Jan 7, 2025

Thanks for the note @echoix .

Now I noticed, pymarkdownlint has a --continue-on-error flag that can be used to apply automatic fixes as far as possible... So the sed command that removes double line breaks can be safely removed...

Do you have a suggestion for another markdown linter? pymarkdownlint (which I used here) changes files forth and back which seems a little unstable...

In Megalinter, we have:

https://megalinter.io/latest/descriptors/markdown/

  • markdownlint
  • remark-lint
  • markdown-link-check
  • markdown-table-formatter

Try the first two?

Did you also try pymarkdown?

Also, you talked about links somewhere below. What I've observed is mostly to have the links be correct in the repo, with relative links referring to other markdown files like they do in the repo, and the build tool that's creates the HTML website adjusts the links accordingly when they process it. But some places, I think that the Microsoft docs, don't necessarily use that.

@echoix
Copy link
Member

echoix commented Jan 7, 2025

And what's the big deal of having the figures look right by using html inside markdown as a first iteration?

@ninsbl
Copy link
Member

ninsbl commented Jan 7, 2025

And what's the big deal of having the figures look right by using html inside markdown as a first iteration?

My understanding was that figures did not look to well with inline HTML...

Also, conversion of dt elements seems to be an issue for files like grass.html; compare that to: https://github.com/ninsbl/grass/blob/md_test/md/grass.md where I had to use a custom version og markdownify to achieve something similar.

In general, I would suggest to use markdownify directly instead of the markitdown wrapper, which gives us fewer options...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs HTML Related code is in HTML manual Documentation related issues markdown
Projects
Development

No branches or pull requests

3 participants