-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xslt-util/calstable/xpl and com.xmlcalabash conversion errors #41
Comments
This must be the infamous Open Source Entitlement hitting us finally. |
Ha, sorry for being rude. But my beard length going down the hall entitles me to Level 4 open source entitlements when the wind blows from the east on Tuesdays. Workaround 1 helps isolate the input bug:
A libreoffice select-all, copy and paste performs some kind of normalization operation on the faulty .docx nested table object without destroying the variation in the varying rows and columns. |
OpenOffice or LibreOffice might create OOXML (docx) structures in a legal yet unexpected way. The tool should (in the sense of: “we should make it so”, not in the sense of: “it should already be Ok”) convert tables saved by recent versions of LibreOffice correctly provided they are valid OOXML, so I think we will fix this soon. |
I've reproduced the error closer to the source. This screenshot tells the story: The conversion of "CALS tables" to latex tables fails because for it doesn't handle variation in the number of columns or rows. The conversion error is asserted here: https://github.com/transpect/xslt-util/blob/74bb4f7d3c15b8649a71dfc55dae085ab6dfd38e/calstable/xsl/normalize.xsl So now I can create an SSCCE using microsoft word, linux libreoffice and docx2tex thustly:
Workaround 2: docx2tex can't handle Microsoft Word tables with an inconsistent number of columns and rows. If you must use them, a cleansing operation is to copy and paste those tables using libreoffice -writer into a fresh libreoffice document with docx format. Then all is well. This .docx is a minimum possible document to illuminate the problem, it's just an empty word document with a table containing inconsistent number of rows: http://www.filedropper.com/ssccefordocx2tex Microsoft's Office word document has an option to join cells of a table horizontally on a row by row basis, wheras libreoffice doesn't seem to allow me to do so, however I can copy and paste such things and the distinctions aren't destroyed, the copy/paste cleanses them. So maybe you can program in an auto cleanse xsl. |
Thanks for the repro. I don’t think it’s related to merged cells per se. It occurs when there are merged cells within nested tables. Investigating… |
The error doesn’t occur if I revert to transpect/xslt-util@271dd78. So there seems to be a regression. It is caused by another fix that improved other aspects of CALS table normalization and that is not covered by any test yet, apparently. |
I was able to resolve the first error (not pushed the commit yet). However, there are more fundamental reasons why both your sample files don’t compile. The default mode of operation for docx2tex is to resolve embedded tables, that is, to add more columns and rows to the containing table so that the embedded table becomes part of the containing table. The outer table’s rows and columns will turn into merged cells. But this only works if the embedded table occupies a full cell of the containing table, with no paragraphs and/or other embedded tables in the same cell. The alternative to resolving embedded tables is to keep them nested (there’s an option for this that is currently not exposed in The other sample file, Since these errors don’t affect our daily production lines (that produce hundreds of thousands pages per year), we are unlikely to look at them with high priority. However, we are constantly trying to improve the tool, and your examples are certainly helpful since they are demanding in terms of table nesting requirements, despite their small size. I tried also the workaround, pasting the document into an empty LibreOffice document. But the embedded tables came out the same way. I’m using LO 6.0.0.3, maybe LO 5 flattened the tables while LO 6 keeps them nested. So this workaround is not working for me. Let me stress again that not colspans or rowspans are an issue. They are supported in principle. The main problem is nested tables, but also other issues have shown that are related to special characters and definition lists. |
Thanks for the quick turnaround, that sounds right. The above workarounds handled my cases and I can tweak the input .docx to remove the bad table. Maybe a better error message would help future users realize the limitation quicker, without having to trial and error input files. Maybe even a flag to aggressively string-join-flatten horizontal and string-join-flatten-vertically the offending table. I'd prefer a best-attempt result Looks like the escape from Microsoft Island is not so easy as it sounds. Not big surprise. 👍 : |
Bug Report:
My OS:
Linux Gentoo Base System release 2.24.1.12 64 bit PC desktop
Java:
1.8.0_66
Shell:
bash 4.3.42 (x86_64-pc-linux-gnu)
Install:
cd /home/el/bin; git clone https://github.com/transpect/docx2tex --recursive
The input docx has a few unicode shenanigans, but nothing too out of band: http://www.filedropper.com/examplefail
Run you code:
cd /home/el/bin/docx2tex; ./d2t ExampleFail.docx
Failure .log File: http://www.filedropper.com/examplefaild2t
What I expected: I expected some kind of output file
ExampleFail.tex
output containing latex code.Quarantining the bug, proving the bug isn't on my side:
Use libreoffice version
5.2.3.3
-writer to create an new empty .docx document containing the ascii textasdf
.Save the above file as
Untitled.docx
using formatMicrosoft Word 2007-2013 XML (.docx)
format.Openoffice -writer produces this Untitled.docx: http://www.filedropper.com/untitled_22
Run the code:
cd /home/el/bin/docx2tex; ./d2t Untitled.docx
docx2tex works as expected, the contents of
Untitled.tex
render by pdflatex to a similar looking pdf:The problem is in the table layouts.
The text was updated successfully, but these errors were encountered: