Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Equations when converting HTML to DOCX using Pandoc #10517

Open
arthmbarbosa opened this issue Jan 9, 2025 · 1 comment
Open

Comments

@arthmbarbosa
Copy link

Problem Description

When converting HTML files to DOCX using Pandoc, mathematical equations inserted in the content undergo unexpected changes:

  1. Equations in the Middle of Paragraphs: The equations are moved to the end of the paragraph, even when they should be integrated into the text.

  2. Equations with <mo> at the Root: When the <mo> tag is at the root of the MathML structure, the elements of the equation are rearranged, resulting in changes in the order of the components or displacements.

The command used for the conversion is:

pandoc -i ${htmlFilePath} -o ${outputFilePath} --mathml

These behaviors compromise the semantic and visual integrity of the generated document, especially in content with mathematical formulas that need to be in the context of the text.


Steps to Reproduce

  1. Create an HTML file containing paragraphs with embedded MathML equations.

Example:

<p>This is an equation: <math><msup><mi>x</mi><mn>2</mn></msup></math> in the middle of the text.</p>
<p>Another equation: <math><mo>=</mo><mn>5</mn></math>.</p>
  1. Convert the HTML file to DOCX using the command:
pandoc -i input.html -o output.docx --mathml
  1. Open the generated DOCX file and notice:
  • The equations in the middle of the paragraphs have been moved to the end.
  • The equations with <mo> at the root have been rearranged.

Expected Behavior

  • Equations should remain in the position they were inserted in the HTML.
    {6C3B4209-3E76-41D1-AD22-07C3CDED07D2}

  • The structure of MathML equations should be preserved in the DOCX document, without unexpected shifts or rearrangements.
    {F47CD1A7-C273-4941-B7C3-402D7F1A6A5E}


Current Behavior

  • Equations in the middle of paragraphs are shifted to the end of the paragraph in the DOCX.
    {5B7E3945-F9A3-44EA-BD89-3BF889668257}

  • Equations containing the <mo> tag in the root are rearranged incorrectly, compromising the order of the elements.
    {F73EED95-FE12-4832-A38C-395AA3C84436}


Example HTML File

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Example</title>
</head>
<body>
<p>This is an equation: <math><msup><mi>x</mi><mn>2</mn></msup></math> in the middle of the text.</p>
<p>Another equation: <math><mo>=</mo><mn>5</mn></math>.</p>
</body>
</html>

Environment

  • Pandoc Version: 3.6.1

Additional Information

  • The issue seems to be related to the processing of MathML tags during conversion to DOCX.
  • The --mathml flag was used to maintain support for equations in MathML format.
@jgm
Copy link
Owner

jgm commented Jan 9, 2025

Note that --mathml only affects HTML output, so you can leave that out.
pandoc -f html -o test.docx with input

<p>This is an equation: <math><msup><mi>x</mi><mn>2</mn></msup></math> in the middle of the text.</p>
<p>Another equation: <math><mo>=</mo><mn>5</mn></math>.</p>

yields a Word file that looks like this:

image

That looks correct to me. Are you seeing something different?
What are you using to view the docx file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants