This describes the overall steps that content goes through. Several of these steps can be grouped together into a larger step. Combining the steps would reduce the amount of validation after each step, the size of the step-dependency graph people have to keep in their heads, and reduce the amout of files that need to be documented (as output of a step and input to another step).
Listed here is the pipeline steps grouped together into steps that could be combined.
- git-fetch: Clone URL & checkout commit
- git-fetch-metadata, git-assemble, git-assemble-meta
- Replace
<md:metadata>
and move images to../resources/{sha}
, Convert CNXML to HTML and assemble all the files together, extract abstract & revised date for each Page
- Replace
- git-bake
- git-bake-meta, git-link
- Create a book metadata JSON file with slugs and abstracts, add attributes to links for REX so it knows the canonical book
- Output-specific steps
This just runs an authenticated git clone
and checks out the correct branch/commit.
Validation:
Use POET CLI to validate the results.
Three things happen.
- Replace
<md:metadata>
in CNXML and collxml files - Move images/resources into
../resources/
- Copy the web style into the resources directory (for use by REX)
- The metadata in every CNXML file is replaced with 2 fields:
revised
andcanonical-book-uuid
- The metadata in every Collection file is replaced with 2 fields:
revised
andversion
- Move resources into a
/resources/{sha}
format - update the CNXML references to these resource files
- generate neighboring JSON files for AWS to help set the content type when browsers fetch the resource
Validation:
The results can be validated like so:
- Should still validate using POET CLI (maybe a minor tweak is necessary?)
- Every
<link resource="..."
and<image src="..."
should begin with../resources/
- Every
<md:metadata>
in the CNXML file and the COLLXML file should contain 2 entries - A style file exists in
../resources/
wit hcorresponding sourcemap if it exists
This step performs several things to convert every collection.xml file into a gigantic {slug}.collection.xhtml
:
- Convert every CNXML file to an XHTML file
- Prefix id attributes and links to those attributes with the module ID so they are unique once they are combined into one XML document
- Depending on the
<c:link>
type, convert it to<a href="/contents/{MODULE_ID}">
(other-book link),<a href="#page_{PAGE_ID}">
(same-book link), or<a href="#page_{PAGE_ID}_{TARGET_ID}">
(element on a page), or<a href="https://cnx.org/content/{PAGE_ID}">
if the page does not exist in the REPO- Also add
class="autogenerated-content"
if CNXML does not have any link text - See https://openstax.atlassian.net/wiki/spaces/CE/pages/1759707137/Pipeline+Pipeline+Task+Definitions#Git-Links examples
- Also add
- Fetch exercise JSON and TeX Math from exercises.openstax.org and convert it to HTML and MathML
- Check if the TeX to MathML is dead. Because the code supposedly calls the MMLCloud API: https://github.com/openstax/cnx-epub/blob/master/cnxepub/formatters.py#L328
- When injected exercises have a cnx-context tag then resolve whether the exercise context should like to an element on this page, another page in this book, or a page in another book: https://github.com/openstax/cnx-epub/blob/master/cnxepub/formatters.py#L382
- Write the book out using this template (Do we need most of this?): https://github.com/openstax/cnx-epub/blob/master/cnxepub/formatters.py#L602
- A ToC is added to the top of the gigantic XHTML file: https://github.com/openstax/cnx-epub/blob/master/cnxepub/formatters.py#L932
Validation:
- XHTML validator should pass for every assembled XHTML file
- Some RNG to validate the root elements (unit, chapter, page).
Generates a {slug}.assembled-metadata.json
file which contains the abstract and revised date for each Page:
{
"{page_uuid}": { abstract: "...", revised: "2022-..." },
"{page_uuid}": { abstract: "...", revised: "2022-..." },
"{page_uuid}": { abstract: "...", revised: "2022-..." }
}
Validation:
A JSONSchema for each JSON file.
CS-Styles takes it over from here and bakes the big XHTML file using a Ruby recipe
Validation:
- XHTML validator
- The top elements that the disassembler looks for should be defined in an RNG
Create a {slug}.baked-metadata.json
which contains everything in {slug}.assembled-metadata.json
plus a book entry:
{
"{page_uuid}": { abstract: "...", revised: "2022-..." }
"{book_uuid}@{ver}": {
id: "{book_uuid}",
title: "Algebra",
revised: "2022-...",
slug: "algebra-trig",
version: "359e7eb",
language: "en",
license: {
url: "http://creativecommons.org/licenses/by/1.0",
name: "Creative Commons Attribution License"
},
tree: {
id: "{uuid}",
title: "Title of the chapter",
contents: [
id: "",
title: "<span>Title with</span> Markup",
slug: "1-1-addition"
]
}
}
}
Validation:
- JSONSchema on each generated book's JSON file.
- Maybe XHTML validation on each Baked XHTML file.
For links to other books, this step adds attributes on the link so REX will be able to choose the right book to link to:
data-book-uuid="..."
data-book-slug="..."
data-page-slug="..."
Validation:
Whatever REX expects these files to have.
This is the end of the common parts of the pipeline. Here things diverge for each output.
- Ruby for baking
- Something that supports parsing XML/JSON with source line/column numbers (Sourcemaps) to run all the other steps
- TypeScript for POET CLI
- Java: for XHTML and RNG validation