Skip to content

feat(dialtone-docs): DLT-3109 add markdown-to-JSON generator for AI docs#1178

Open
belumontoya wants to merge 2 commits intostagingfrom
feature/DLT-3109-ai-docs-generator
Open

feat(dialtone-docs): DLT-3109 add markdown-to-JSON generator for AI docs#1178
belumontoya wants to merge 2 commits intostagingfrom
feature/DLT-3109-ai-docs-generator

Conversation

@belumontoya
Copy link
Copy Markdown
Collaborator

@belumontoya belumontoya commented Apr 7, 2026

feat(dialtone-docs): DLT-3109 add markdown-to-JSON generator for AI docs

Obligatory GIF (super important!)

Obligatory GIF

🛠️ Type Of Change

  • Feature

📖 Jira Ticket

https://dialpad.atlassian.net/browse/DLT-3109

📖 Description

Adds the markdown-to-JSON build pipeline for the dialtone-docs package:

  • src/generators/build-ai-docs.mjs — Reads all markdown files under src/content/, parses YAML frontmatter (type, category, keywords, ai_summary), strips markdown syntax, and compiles everything into dist/ai-docs.json — a flat JSON array of document entries for AI consumption.
  • src/utils/strip-markdown.mjs — Utility that strips frontmatter, code blocks, HTML, links, headings, emphasis, and other markdown syntax to produce searchable plain text.
  • package.json / project.json — Added build script and NX target so pnpm nx run dialtone-docs:build triggers the generator.
  • tests/tests/build-output.test.js — 11 tests validating the JSON output schema (required fields, types, no markdown artifacts in content, file path integrity, no duplicate IDs).
  • tests/tests/strip-markdown.test.js — Unit tests for the strip-markdown utility (headings, code blocks, links, emphasis, frontmatter removal).
  • tests/helpers/markdownParser.js — Refactored to import stripMarkdown/stripFrontmatter from the new utility instead of bundling its own copy.

💡 Context

The dialtone-docs package provides AI-discoverable documentation for the Dialtone monorepo. This PR adds Milestone 3: the build step that compiles markdown content into a structured JSON file (ai-docs.json). This JSON output will serve as the data source for MCP server and CLI search tools, enabling AI agents to search the entire documentation site programmatically.

Each JSON entry includes: id, title, type, category, keywords, summary, content (plain text), filePath, lastUpdated, and relatedPackages.

📝 Checklist

  • I have ensured no private Dialpad links or info are in the code or pull request description (Dialtone is a public repo!).
  • I have reviewed my changes.
  • I have added all relevant documentation.
  • I have considered the performance impact of my change.
  • I have added / updated unit tests.

🔮 Next Steps

  • Refactor existing test suite (consolidate 6 test files down to 3, remove hardcoded content assertions that do not scale)
  • Integrate ai-docs.json into the MCP server and CLI as a search data source

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 208c3b30bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

keywords: Array.isArray(frontmatter.keywords) ? frontmatter.keywords : [],
summary: frontmatter.ai_summary ?? null,
content,
lastUpdated: frontmatter.last_updated ? String(frontmatter.last_updated) : null,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize last_updated deterministically

Convert last_updated without String(...): gray-matter parses unquoted YAML dates (e.g. 2026-03-04) as Date, and String(date) emits a locale/timezone-dependent value. That makes ai-docs.json nondeterministic across environments and can even shift the calendar day (e.g. UTC date appears as previous day in US timezones), which breaks stable indexing and date-based consumers.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@braddialpad Brad Paugh (braddialpad) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, perhaps my only concern would be the large amount of confusing regex to parse the markdown, however it is necessary for this change and also easier to understand in the age of AI.

Couple of small comments, nothing major.

Comment on lines +45 to +48
for (const doc of docs) {
expect(doc.type, `"${doc.id}" type is null`).not.toBeNull();
expect(ALLOWED_TYPES, `"${doc.id}" invalid type "${doc.type}"`).toContain(doc.type);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all tests where we are looping through arrays like this, we should be using test.each instead of a for loop.

A single failure will mask all subsequent ones when doing it the current way.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but one thing worth noting. docs is built in beforeAll, so test.each can't get it at test definition time. I could build synchronously at module scope or keep the loop, but use soft assertions. Any preference? Or a better idea?

blockquote: /^>\s?.*/gm,
horizontalRule: /^(?:[-*_]){3,}\s*$/gm,
emphasis: /[*_]{1,2}([^*_]+)[*_]{1,2}/g,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PATTERNS is a duplicate of the one in markdownParser.js. Is that intentional? They could get out of sync in future changes if they are both being used.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, nice spotted, I'm actually thinking now to replace this with remove-markdown package to handle this instead of us doing it manually


test('type field uses allowed values', () => {
for (const doc of docs) {
expect(doc.type, `"${doc.id}" type is null`).not.toBeNull();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (optional): this first assertion is probably not necessary

- Fix non-deterministic date serialization: String(Date) is
  timezone-dependent, use toISOString().split('T')[0] for stable
  YYYY-MM-DD output
- Remove redundant null assertion in type field test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants