Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differing output when providing output_file argument to render() #2264

Open
5 tasks done
grimbough opened this issue Dec 21, 2021 · 6 comments
Open
5 tasks done

Differing output when providing output_file argument to render() #2264

grimbough opened this issue Dec 21, 2021 · 6 comments
Labels
bug an unexpected problem or unintended behavior next to consider for next release

Comments

@grimbough
Copy link

grimbough commented Dec 21, 2021

Checklist

When filing a bug report, please check the boxes below to confirm that you have provided us with the information we need. Have you:

  • formatted your issue so it is easier for us to read?

  • included a minimal, self-contained, and reproducible example?

  • pasted the output from xfun::session_info('rmarkdown') in your issue?

  • upgraded all your packages to their latest versions (including your versions of R, the RStudio IDE, and relevant R packages)?

  • installed and tested your bug with the development version of the rmarkdown package using remotes::install_github("rstudio/rmarkdown")?


This issue concerns the BiocStyle package and producing a PDF vignette from R Markdown. This issue has initially been tracked in Bioconductor/BiocStyle#93 I'm happy to incorporate necessary changes into BiocStyle, but wanted to report the unexpected behaviour here.

Consider the following minimal Rmd file (referred to as test.Rmd below) using the BiocStyle::pdf_document output format:

---
title: "Vignette Title"
output:
  BiocStyle::pdf_document:
    keep_tex: true
---

Using the `images = "lowres"` is fine.
rmarkdown::render("test.Rmd")
#> ...
#> Output created: test.pdf

However it fails when supply an output_file argument. (In the example below I've used test1.pdf to distinguish between files, but it fails with the same error even if you use is the same as the default i.e. test.pdf).

rmarkdown::render("test.Rmd", output_file = "test1.pdf")
#> ....
#> ! Package soul Error: Reconstruction failed.

Examining the two TeX files produced, the line containing the inline code differs, and seems to have some escaped spaces in the version that fails to compile. The two lines are show below for comparison.

Using the \texttt{images = "lowres"} is fine.   %Compiles
Using the \texttt{images\ =\ "lowres"} is fine. %Does not compile

Normally BiocStyle is aware of this issue and attempts to remove the escaped spaces before Latex is called. However this patch is not applied when we specify the name of an output file with ".pdf" extension.

I believe the difference in output starts with the following code block:

rmarkdown/R/render.R

Lines 487 to 491 in 0af6b35

output_auto <- pandoc_output_file(input, output_format$pandoc)
if (is.null(output_file) || is.na(output_file)) output_file <- output_auto else {
if (!inherits(output_file, "AsIs") && xfun::file_ext(output_file) == "")
output_file <- paste(output_file, xfun::file_ext(output_auto), sep = ".")
}

In our example where output_file is unspecified then it will be assigned test.tex (the value of output_auto) with the file extension reflecting the definition in output_format$pandoc$ext. When we provide output_file = "test1.pdf" as an argument it obviously has ".pdf" as the extension.

This then affects the following code block, which is executed only in the case where we've specified output_file = "test1.pdf":

rmarkdown/R/render.R

Lines 974 to 982 in 0af6b35

if (!grepl('[.]tex$', output_file)) {
latexmk(texfile, output_format$pandoc$latex_engine, '--biblatex' %in% output_format$pandoc$args)
file.rename(file_with_ext(texfile, "pdf"), output_file)
# clean up the tex file if necessary
if (!output_format$pandoc$keep_tex) {
texfile <- normalize_path(texfile)
on.exit(unlink(texfile), add = TRUE)
}
}

In the case of output_file = "test1.pdf" the latex compilation is carried out immediately at this step via latexmk() and fails due to something in the BiocStyle template. However when output_file has a ".tex" extension compilation is deferred and carried out as part of output_format$post_processor, which also patches the template issues. This is based on the technique used in bookdown.


Hopefully the above makes sense! I'm not sure if we're doing something unsupported in BiocStyle but it seems the crux of this issue is that specifying the output file name will override the setting defined by our output_format$pandoc, and cause latex compilation to happen earlier than it otherwise would. I wanted to check whether this is the expected behaviour. I believe the issue is also present (but silently works) if you use output_format = bookdown::pdf_document2. There latexmk() will be called twice - once in render() and once in the output_format$post_processor. with the second overwriting the first.


> xfun::session_info(c('rmarkdown', 'BiocStyle'))
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20.2, RStudio 2021.9.0.351

Locale:
  LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
  LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
  LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
  LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

Package version:
  base64enc_0.1.3     BiocManager_1.30.16 BiocStyle_2.23.2    bookdown_0.24      
  bslib_0.3.1         digest_0.6.29       evaluate_0.14       fastmap_1.1.0      
  fs_1.5.2            glue_1.6.0          graphics_4.1.1      grDevices_4.1.1    
  highr_0.9           htmltools_0.5.2     jquerylib_0.1.4     jsonlite_1.7.2     
  knitr_1.37          magrittr_2.0.1      methods_4.1.1       R6_2.5.1           
  rappdirs_0.3.3      rlang_0.4.12        rmarkdown_2.11.3    sass_0.4.0         
  stats_4.1.1         stringi_1.7.6       stringr_1.4.0       tinytex_0.36       
  tools_4.1.1         utils_4.1.1         xfun_0.29           yaml_2.2.1         

Pandoc version: 2.14.0.3
@cderv
Copy link
Collaborator

cderv commented Dec 21, 2021

@grimbough thanks for the report.

Can you explain the hack done for the soul package ?

having backslash seems the way Pandoc transforms it

❯ pandoc -t latex
Using the `images = "lowres"` is fine.
^Z
Using the \texttt{images\ =\ "lowres"} is fine.

Also using rmarkdown::pdf_document will have the escaped spaces in the intermediates .tex output.

I would like to understand what exactly is needed for this soul package.

That being said, I don't yet understand why a post processor would not apply if an extension is set to output_file (Bioconductor/BiocStyle#93 (comment))

@grimbough
Copy link
Author

grimbough commented Dec 22, 2021

The addition of the soul package to the BiocStyle .sty predates my time contributing to the package, so I can't comment on why we choose to use it, but the initial reports of problems with it can be found in Bioconductor/BiocStyle#14 That seems to acknowledge that the pandoc output is incompatible with the soul package, hence the need for some sort of patch.

The comment on the soul patch reads "# substitute all control spaces "\ " in \texttt by regular spaces" and the actual patch code can be found in the post_processor function at:
https://github.com/Bioconductor/BiocStyle/blob/96ad6f7cc4d87f9705f51bacb5c62fc4c7e4f6de/R/pdf_document.R#L146-L163

Also using rmarkdown::pdf_document will have the escaped spaces in the intermediates .tex output.

I think this isn't a problem because rmarkdown::pdf_document doesn't use the soul package in it's style file.

That being said, I don't yet understand why a post processor would not apply if an extension is set to output_file (Bioconductor/BiocStyle#93 (comment))

The problem in this case is not that the post processor doesn't get applied when we set a file extension, it's more that unless the file extension is .tex the function latexmk() gets called before the post processor is applied. For BiocStyle::pdf_document this results in an error because that post processor patch is necessary for the latex file to be compiled, and for something like bookdown::pdf_document2 it means the output PDF file is generated twice (although you'll only notice that if stepping through the code).

I the case where we don't set a file extension to the output_file argument the code will use whatever is set in output_format$pandoc$ext which for BiocStyle::pdf_document or bookdown::pdf_document2 is .tex rather than .pdf. This is behaviour is performed by:

rmarkdown/R/render.R

Lines 487 to 491 in 0af6b35

output_auto <- pandoc_output_file(input, output_format$pandoc)
if (is.null(output_file) || is.na(output_file)) output_file <- output_auto else {
if (!inherits(output_file, "AsIs") && xfun::file_ext(output_file) == "")
output_file <- paste(output_file, xfun::file_ext(output_auto), sep = ".")
}

@cderv
Copy link
Collaborator

cderv commented Dec 22, 2021

Thanks for your feedback. I got some hint yesterday in my bug hunt but did not share because it was not ended.
I think you helped us unravel a odd behavior and potentially important hidden bug.

I found the part of the code that make the post processing not run correctly

rmarkdown/R/render.R

Lines 974 to 981 in 0af6b35

if (!grepl('[.]tex$', output_file)) {
latexmk(texfile, output_format$pandoc$latex_engine, '--biblatex' %in% output_format$pandoc$args)
file.rename(file_with_ext(texfile, "pdf"), output_file)
# clean up the tex file if necessary
if (!output_format$pandoc$keep_tex) {
texfile <- normalize_path(texfile)
on.exit(unlink(texfile), add = TRUE)
}

The code you shared above is trying to determine what should be the output file extension when none is asked (most of the time output_file is auto determined), and you're right with the difference between bookdown::pdf_document2() and pdf_document().

bookdown::pdf_document2() seems to enforce .tex extension because it will handle the PDF conversion in its post processor after tweaking the tex file. When extension is not .tex, output PDF is assumed and the conversion to PDF happens directly inside rmarkdown::render() before any post processing of the .tex file.

The post processor will be run after on the output file being a pdf. Until recently, pdf_document() had no post processor and it seemed to work. But there is one today and I don't think it is working correctly. Oddly, it throws a warning only in debug mode.

bookdpwn::pdf_document2() doesn't have an issue because

.tex to .pdf conversion is run twice !
First in

rmarkdown/R/render.R

Lines 974 to 981 in 0af6b35

if (!grepl('[.]tex$', output_file)) {
latexmk(texfile, output_format$pandoc$latex_engine, '--biblatex' %in% output_format$pandoc$args)
file.rename(file_with_ext(texfile, "pdf"), output_file)
# clean up the tex file if necessary
if (!output_format$pandoc$keep_tex) {
texfile <- normalize_path(texfile)
on.exit(unlink(texfile), add = TRUE)
}

and then in

rmarkdown/R/render.R

Lines 1001 to 1006 in 0af6b35

if (!is.null(output_format$post_processor))
output_file <- output_format$post_processor(front_matter,
input,
output_file,
clean,
!quiet)

From what I understand, BiocStyle::pdf_document() requires to have a processing of the .tex file so that the PDF can be built. With current behavior, the file extension needs to be .tex, otherwise a first conversion will happen before any post processing. When output_file is set with a .tex extension, rmarkdown::render() will only convert .md to .tex, and the post processor bookdown::pdf_document2() will run. It will run your custom post processor, then its post processor, including a .tex to .pdf conversion.

With the current way rmarkdown is working, BiocStyle::pdf_document() won't work when output_file is set using an extension different than .tex.

In a way, output_format$post processor is supposed to run on the output file from the conversion. So when output is supposed to be PDF, it runs after pdf conversion, but I don't think it makes sense - usually you want to post process the .tex file from after pandoc conversion.

I don't think all of this is expected and we will look to improve that. I need to discuss with @yihui to understand the historical behavior and see how we can change that.

Personally, I did not have any idea of this issue - thanks for reporting ! That is an odd bug - it took me time to understand.

@cderv cderv added bug an unexpected problem or unintended behavior next to consider for next release labels Dec 22, 2021
@cderv
Copy link
Collaborator

cderv commented Dec 22, 2021

I think this isn't a problem because rmarkdown::pdf_document doesn't use the soul package in it's style file.

About that, I tried to reproduce the initial BiocStyle is trying to solve with it patch by using soul CTAN package in a regular PDF document. At first, I did not manage. By looking at the .sty file, I understand that the issue comes from

https://github.com/Bioconductor/BiocStyle/blob/932ae2eacb8766dc1bb552be0ecfd1a71e6a2723/inst/resources/tex/Bioconductor.sty#L274-L279

% local redefinitions in \texttt to allow use of \`{}  and \^{} in \hl
\renewcommand{\texttt}[2][codecolor]{{%
 \def\`{\textasciigrave}%
 \def\^{\textasciicircum}%
 \textcolor{#1}{\hl{\ttfamily#2}}%
}}

This will redefine the the \texttt command, and pass its content to \hl which is a soul command it seems. That is where control space \ is not supported and you need the tweak.

I was just curious to understand. I don't know if there is anything else that can be done to be compatible with control spaces.

@grimbough
Copy link
Author

grimbough commented Dec 22, 2021

Thanks for the feedback. I agree it's an odd bug, and also took me quite some time to pin down the behaviour. I was reluctant to report anything until I was sure it wasn't BiocStyle doing something unsupported. Reading through I think we've come to exactly the same conclusions about what's happening, so that's encouraging!

From my point of view there no problem if it takes a while to fix - that soul patch has been in BiocStyle for 5 years and it's only been reported now, so I guess providing a non-default output file name doesn't occur very often for our users, and we have the "no extension" work around anyway.


I guess there's something philosophical about the choice of output file if someone does something weird like output_format = "BiocStyle::pdf_document()" and output_file = "test.html" - should they get a PDF, an HTML, or a PDF called "test.html"? At the moment the render() code takes slightly more cues from the output file name than the output format options. I'm not sure what's "right" but it definitely seems weird that output_format = BiocStyle::pdf_document(), output_file = "test.tex" generates a PDF and output_format = BiocStyle::pdf_document(), output_file = "test.pdf" throws an error!

@cderv
Copy link
Collaborator

cderv commented Dec 22, 2021

but it definitely seems weird that output_format = BiocStyle::pdf_document(), output_file = "test.tex" generates a PDF and output_format = BiocStyle::pdf_document(), output_file = "test.pdf" throws an error!

I can understand - it is not like that with all formats. Somehow in all relies on what BiocStyle::pdf_document() is using internally.

output_format = BiocStyle::pdf_document(), output_file = "test.pdf" should work but BiocStyle requires a tex post processing before rendering the PDF, with does not happen unless render() is tricked to produce a tex file only and PDF conversion happens in the post processor. bookdown takes that into account but the double latexmk() rendering does not seem right. I'll look into that.

This output_format() mechanism is not really documented and a lot of different things happens in render() depending on arguments. One day we'll document and it may unravel more than this odd behavior 😅

Thanks a lot for this discussion ! We'll work on rmarkdown early next year so we'll probably modify this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior next to consider for next release
Projects
Status: Backlog
Development

No branches or pull requests

2 participants