vignettes: avoid using _ or . in header IDs (#6784)

* vignettes: avoid using _ or . in header IDs data.table-intro on CRAN: <h3 id="h-great-but-how-can-i-refer-to-columns-by-names-in-j-like-in-a-data-frame" #refer_j>h) Great! But how can I refer to columns by names in <code>j</code> (like in a <code>data.frame</code>)?</h3> <h4 id="how-can-we-calculate-the-number-of-trips-for-each-origin-airport-for-carrier-code-quot-aa-quot" #origin-.N>– How can we calculate the number of trips for each origin airport for carrier code <code>"AA"</code>?</h4> <h4 id="how-can-we-get-the-total-number-of-trips-for-each-origin-dest-pair-for-carrier-code-quot-aa-quot" #origin-dest-.N>– How can we get the total number of trips for each <code>origin, dest</code> pair for carrier code <code>"AA"</code>?</h4> This is not valid HTML and the links to these headers don't work. "Intro" seems to be the only vignette affected. * Add a linter to validate vignette heading ids This doesn't parse the full range of Pandoc's header_attributes extension (one shouldn't be using regular expressions for that), but at least it captures the existing mistakes without obvious false positives. * Fix the heading ids in 'Introduction à data.table'
Rdatatable · Feb 2, 2025 · 50e056f · 50e056f
1 parent b089b74
commit 50e056f
Show file tree

Hide file tree

Showing 3 changed files with 36 additions and 10 deletions.
diff --git a/.ci/linters/md/heading_id_linter.R b/.ci/linters/md/heading_id_linter.R
@@ -0,0 +1,26 @@
+any_mismatch = FALSE
+
+# ensure that ids are limited to alphanumerics and dashes
+# (in particular, dots and underscores break the links)
+check_header_ids = function(md) {
+  # A bit surprisingly, some headings don't start with a letter.
+  # We're interested in those that set an id to link to, i.e., end with {#id}.
+  heading_captures = regmatches(md, regexec("^#+ \\S.*[{]#([^}]*)[}]$", md))
+  lines_with_id = which(lengths(heading_captures) > 0)
+  ids = vapply(heading_captures[lines_with_id], `[`, '', 2)
+  # ids must start with a letter and consist of alphanumerics or dashes.
+  good_ids = grepl('^[A-Za-z][A-Za-z0-9-]*$', ids)
+  for (line in lines_with_id[!good_ids]) cat(sprintf(
+    "On line %d, bad heading id '%s':\n%s\n",
+    line, heading_captures[[line]][2], heading_captures[[line]][1]
+  ))
+  !all(good_ids)
+}
+
+any_error = FALSE
+for (vignette in list.files('vignettes', pattern = "[.]Rmd$", recursive = TRUE, full.name = TRUE)) {
+  cat(sprintf("Checking vignette file %s...\n", vignette))
+  rmd_lines = readLines(vignette)
+  any_error = check_header_ids(rmd_lines) || any_error
+}
+if (any_error) stop("Please fix the vignette issues above.")
diff --git a/vignettes/datatable-intro.Rmd b/vignettes/datatable-intro.Rmd
@@ -316,7 +316,7 @@ ans
 
 We could have accomplished the same operation by doing `nrow(flights[origin == "JFK" & month == 6L])`. However, it would have to subset the entire `data.table` first corresponding to the *row indices* in `i` *and then* return the rows using `nrow()`, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the *`data.table` design* vignette.
 
-### h) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
+### h) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer-j}
 
 If you're writing out the column names explicitly, there's no difference compared to a `data.frame` (since v1.9.8).
 
@@ -422,7 +422,7 @@ ans
 
     We'll use this convenient form wherever applicable hereafter.
 
-#### -- How can we calculate the number of trips for each origin airport for carrier code `"AA"`? {#origin-.N}
+#### -- How can we calculate the number of trips for each origin airport for carrier code `"AA"`? {#origin-N}
 
 The unique carrier code `"AA"` corresponds to *American Airlines Inc.*
 
@@ -435,7 +435,7 @@ ans
 
 * Using those *row indices*, we obtain the number of rows while grouped by `origin`. Once again no columns are actually materialised here, because the `j-expression` does not require any columns to be actually subsetted and is therefore fast and memory efficient.
 
-#### -- How can we get the total number of trips for each `origin, dest` pair for carrier code `"AA"`? {#origin-dest-.N}
+#### -- How can we get the total number of trips for each `origin, dest` pair for carrier code `"AA"`? {#origin-dest-N}
 
 ```{r}
 ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
@@ -483,7 +483,7 @@ We'll learn more about `keys` in the [`vignette("datatable-keys-fast-subset", pa
 
 ### c) Chaining
 
-Let's reconsider the task of [getting the total number of trips for each `origin, dest` pair for carrier *"AA"*](#origin-dest-.N).
+Let's reconsider the task of [getting the total number of trips for each `origin, dest` pair for carrier *"AA"*](#origin-dest-N).
 
 ```{r}
 ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
@@ -583,7 +583,7 @@ We are almost there. There is one little thing left to address. In our `flights`
 
 Using the argument `.SDcols`. It accepts either column names or column indices. For example, `.SDcols = c("arr_delay", "dep_delay")` ensures that `.SD` contains only these two columns for each group.
 
-Similar to [part g)](#refer_j), you can also specify the columns to remove instead of columns to keep using `-` or `!`. Additionally, you can select consecutive columns as `colA:colB` and deselect them as `!(colA:colB)` or `-(colA:colB)`.
+Similar to [part g)](#refer-j), you can also specify the columns to remove instead of columns to keep using `-` or `!`. Additionally, you can select consecutive columns as `colA:colB` and deselect them as `!(colA:colB)` or `-(colA:colB)`.
 
 Now let us try to use `.SD` along with `.SDcols` to get the `mean()` of `arr_delay` and `dep_delay` columns grouped by `origin`, `dest` and `month`.
 

diff --git a/vignettes/fr/datatable-intro.Rmd b/vignettes/fr/datatable-intro.Rmd
@@ -312,7 +312,7 @@ ans
 
 On aurait pu faire la même opération en écrivant `nrow(flights[origin == "JFK" & month == 6L])`. Néanmoins il aurait fallu d'abord dissocier la `data.table` entière  en fonction des *indices de lignes* dans `i` *puis* renvoyer les lignes en utilisant `nrow()`, ce qui est inutile et pas efficace. Nous aborderons en détails ce sujet et d'autres aspects de l'optimisation dans la vignette *architecture de `data.table`*.
 
-### h) Super !  Mais comment référencer les colonnes par nom dans `j` (comme avec un `data.frame`) ? {#refer_j}
+### h) Super !  Mais comment référencer les colonnes par nom dans `j` (comme avec un `data.frame`) ? {#refer-j}
 
 Si vous imprimez le nom des colonnes explicitement, il n'y a pas de différence avec un `data.frame` (depuis v1.9.8).
 
@@ -418,7 +418,7 @@ ans
     
     Nous utiliserons cette forme pratique chaque fois que cela sera possible.
 
-#### -- Comment calculer le nombre de voyages au départ de chaque aéroport pour le transporteur ayant le code `"AA"`? {#origin-.N}
+#### -- Comment calculer le nombre de voyages au départ de chaque aéroport pour le transporteur ayant le code `"AA"`? {#origin-N}
 
 Le code unique de transporteur `"AA"` correspond à *American Airlines Inc.*
 
@@ -431,7 +431,7 @@ ans
 
 * En utilisant ces *index de ligne*, nous obtenons le nombre de lignes groupées par `origine`. Une fois de plus, aucune colonne n'est matérialisée ici, car l'expression `j' ne nécessite aucune colonne pour définir le sous-ensemble et le calcul est donc rapide et peu gourmand en mémoire.
 
-#### -- Comment obtenir le nombre total de voyages pour chaque paire `origin, dest` du transporteur ayant pour code `"AA"`? {#origin-dest-.N}
+#### -- Comment obtenir le nombre total de voyages pour chaque paire `origin, dest` du transporteur ayant pour code `"AA"`? {#origin-dest-N}
 
 ```{r}
 ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
@@ -479,7 +479,7 @@ Nous en apprendrons plus au sujet des `clés` dans la vignette *Clés et sous-en
 
 ### c) Chaînage
 
-Considérons la tâche consistant à [récupérer le nombre total de voyages pour chaque couple `origin, dest` du transporteur *"AA"*](#origin-dest-.N).
+Considérons la tâche consistant à [récupérer le nombre total de voyages pour chaque couple `origin, dest` du transporteur *"AA"*](#origin-dest-N).
 
 ```{r}
 ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
@@ -579,7 +579,7 @@ Nous y sommes presque. Il reste encore une petite chose à régler. Dans notre `
 
 En utilisant l'argument `.SDcols`. Il accepte soit des noms soit des indices de colonnes. Par exemple, `.SDcols = c("arr_delay", "dep_delay")` permet que `.SD` ne comporte que ces deux colonnes pour chaque groupe.
 
-De la même manière que [part g)](#refer_j), vous pouvez également spécifier les colonnes à supprimer au lieu des colonnes à garder en utilisant le `-` ou `!`. De plus, vous pouvez sélectionner des colonnes consécutives avec `colA:colB` et les désélectionner avec `!(colA:colB)` ou `-(colA:colB)`.
+De la même manière que [part g)](#refer-j), vous pouvez également spécifier les colonnes à supprimer au lieu des colonnes à garder en utilisant le `-` ou `!`. De plus, vous pouvez sélectionner des colonnes consécutives avec `colA:colB` et les désélectionner avec `!(colA:colB)` ou `-(colA:colB)`.
 
 Maintenant essayons d'utiliser `.SD` avec `.SDcols` pour obtenir la moyenne `mean()` des colonnes `arr_delay` et `dep_delay` groupées par `origin`, `dest` et `month`.