Skip to content

Latest commit

 

History

History
189 lines (127 loc) · 6.7 KB

docs-pipeline.md

File metadata and controls

189 lines (127 loc) · 6.7 KB

How content changes in enki

This describes the overall steps that content goes through. Several of these steps can be grouped together into a larger step. Combining the steps would reduce the amount of validation after each step, the size of the step-dependency graph people have to keep in their heads, and reduce the amout of files that need to be documented (as output of a step and input to another step).

Overall Process

Listed here is the pipeline steps grouped together into steps that could be combined.

  1. git-fetch: Clone URL & checkout commit
  2. git-fetch-metadata, git-assemble, git-assemble-meta
    • Replace <md:metadata> and move images to ../resources/{sha}, Convert CNXML to HTML and assemble all the files together, extract abstract & revised date for each Page
  3. git-bake
  4. git-bake-meta, git-link
    • Create a book metadata JSON file with slugs and abstracts, add attributes to links for REX so it knows the canonical book
  5. Output-specific steps

What happens in each step

git-fetch

This just runs an authenticated git clone and checks out the correct branch/commit.

Validation:

Use POET CLI to validate the results.

git-fetch-metadata

Three things happen.

  1. Replace <md:metadata> in CNXML and collxml files
  2. Move images/resources into ../resources/
  3. Copy the web style into the resources directory (for use by REX)

fetch-update-metadata

  • The metadata in every CNXML file is replaced with 2 fields: revised and canonical-book-uuid
  • The metadata in every Collection file is replaced with 2 fields: revised and version

fetch-map-resources

  1. Move resources into a /resources/{sha} format
  2. update the CNXML references to these resource files
  3. generate neighboring JSON files for AWS to help set the content type when browsers fetch the resource

Validation:

The results can be validated like so:

  1. Should still validate using POET CLI (maybe a minor tweak is necessary?)
  2. Every <link resource="..." and <image src="..." should begin with ../resources/
  3. Every <md:metadata> in the CNXML file and the COLLXML file should contain 2 entries
  4. A style file exists in ../resources/ wit hcorresponding sourcemap if it exists

git-assemble

This step performs several things to convert every collection.xml file into a gigantic {slug}.collection.xhtml:

  1. Convert every CNXML file to an XHTML file
  2. Prefix id attributes and links to those attributes with the module ID so they are unique once they are combined into one XML document
  3. Depending on the <c:link> type, convert it to <a href="/contents/{MODULE_ID}"> (other-book link), <a href="#page_{PAGE_ID}"> (same-book link), or <a href="#page_{PAGE_ID}_{TARGET_ID}"> (element on a page), or <a href="https://cnx.org/content/{PAGE_ID}"> if the page does not exist in the REPO
  4. Fetch exercise JSON and TeX Math from exercises.openstax.org and convert it to HTML and MathML
  5. When injected exercises have a cnx-context tag then resolve whether the exercise context should like to an element on this page, another page in this book, or a page in another book: https://github.com/openstax/cnx-epub/blob/master/cnxepub/formatters.py#L382
  6. Write the book out using this template (Do we need most of this?): https://github.com/openstax/cnx-epub/blob/master/cnxepub/formatters.py#L602
  7. A ToC is added to the top of the gigantic XHTML file: https://github.com/openstax/cnx-epub/blob/master/cnxepub/formatters.py#L932

Validation:

  1. XHTML validator should pass for every assembled XHTML file
  2. Some RNG to validate the root elements (unit, chapter, page).

git-assemble-meta

Generates a {slug}.assembled-metadata.json file which contains the abstract and revised date for each Page:

{
    "{page_uuid}": { abstract: "...", revised: "2022-..." },
    "{page_uuid}": { abstract: "...", revised: "2022-..." },
    "{page_uuid}": { abstract: "...", revised: "2022-..." }
}

Validation:

A JSONSchema for each JSON file.

git-bake

CS-Styles takes it over from here and bakes the big XHTML file using a Ruby recipe

Validation:

  • XHTML validator
  • The top elements that the disassembler looks for should be defined in an RNG

git-bake-meta

Create a {slug}.baked-metadata.json which contains everything in {slug}.assembled-metadata.json plus a book entry:

{
    "{page_uuid}": { abstract: "...", revised: "2022-..." }
    "{book_uuid}@{ver}": { 
        id: "{book_uuid}",
        title: "Algebra", 
        revised: "2022-...", 
        slug: "algebra-trig",
        version: "359e7eb",
        language: "en",
        license: {
            url: "http://creativecommons.org/licenses/by/1.0",
            name: "Creative Commons Attribution License"
        },
        tree: {
            id: "{uuid}",
            title: "Title of the chapter",
            contents: [
                id: "",
                title: "<span>Title with</span> Markup",
                slug: "1-1-addition"
            ]
        }
    }
}

Validation:

  • JSONSchema on each generated book's JSON file.
  • Maybe XHTML validation on each Baked XHTML file.

git-link

For links to other books, this step adds attributes on the link so REX will be able to choose the right book to link to:

  • data-book-uuid="..."
  • data-book-slug="..."
  • data-page-slug="..."

Validation:

Whatever REX expects these files to have.

Output-specific

This is the end of the common parts of the pipeline. Here things diverge for each output.

Required languages

  • Ruby for baking
  • Something that supports parsing XML/JSON with source line/column numbers (Sourcemaps) to run all the other steps

Validation

  • TypeScript for POET CLI
  • Java: for XHTML and RNG validation