Conversation
Use duckdb node api to compile malloy models and extract schema information to build context files for LLMs.
f273484 to
b36297d
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements build-time generation of an llms.txt file to help LLMs understand the Malloy Data Explorer's data models and schema. The implementation uses the DuckDB Node.js API to compile Malloy models during the build process and extract schema information (sources, fields, queries) that is formatted into a standardized llms.txt file.
Changes:
- Added schema extraction module that compiles Malloy models and extracts metadata using DuckDB
- Created content generator that formats extracted schema into llms.txt format with site documentation
- Implemented Vite plugin that generates llms.txt at build time and serves it dynamically in dev mode
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| proposals/llm_context.md | Problem statement for LLM context generation feature |
| proposals/adr/001-llms-txt-build-time-generation.md | Architecture decision record documenting build-time vs runtime generation approach |
| src/llms-txt/types.ts | TypeScript interfaces for extracted model schema data structures |
| src/llms-txt/schema-extractor.ts | Core logic for compiling Malloy models with DuckDB and extracting schema information |
| src/llms-txt/generator.ts | Generates formatted llms.txt content from extracted schema |
| src/llms-txt/index.ts | Module exports for the llms-txt package |
| plugins/vite-plugin-llms-txt.ts | Vite plugin integrating llms.txt generation into build process |
| vite.config.ts | Adds llms-txt plugin to Vite configuration |
| tsconfig.node.json | Includes llms-txt directory in Node.js TypeScript compilation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| LLMs cannot directly analyze the site contents to understand the data models used and their schema, thus not able generate malloy queries. | ||
|
|
||
| ## Propasal |
There was a problem hiding this comment.
Spelling error: "Propasal" should be "Proposal"
| ## Propasal | |
| ## Proposal |
|
|
||
| ## Propasal | ||
|
|
||
| Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site sturcture, how to issue arbitary queries and download data. |
There was a problem hiding this comment.
Spelling error: "sturcture" should be "structure"
| Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site sturcture, how to issue arbitary queries and download data. | |
| Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site structure, how to issue arbitary queries and download data. |
| function generateHeader(siteTitle: string, basePath: string): string { | ||
| const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath; | ||
| return `# ${siteTitle} | ||
|
|
||
| > Malloy Data Explorer - Static web app for exploring semantic data models | ||
| > All queries run in-browser using DuckDB WASM | ||
|
|
||
| **Site URL:** \`${base}/\``; | ||
| } | ||
|
|
||
| function generateOverview( | ||
| _siteTitle: string, | ||
| basePath: string, | ||
| models: ExtractedModel[], | ||
| dataFiles: string[], | ||
| notebooks: string[], | ||
| ): string { | ||
| const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath; | ||
|
|
||
| // Content summary | ||
| const contentItems = [ | ||
| `${String(models.length)} Malloy model${models.length !== 1 ? "s" : ""}`, | ||
| `${String(dataFiles.length)} data file${dataFiles.length !== 1 ? "s" : ""}`, | ||
| ...(notebooks.length > 0 | ||
| ? [ | ||
| `${String(notebooks.length)} notebook${notebooks.length !== 1 ? "s" : ""}`, | ||
| ] | ||
| : []), | ||
| ]; | ||
|
|
||
| // Data files list (compact) | ||
| const dataFilesList = | ||
| dataFiles.length > 0 ? `\n\n**Data Files:** ${dataFiles.join(", ")}` : ""; | ||
|
|
||
| // Notebooks list (compact - just names) | ||
| const notebooksList = | ||
| notebooks.length > 0 ? `\n\n**Notebooks:** ${notebooks.join(", ")}` : ""; | ||
|
|
||
| return `## Overview | ||
|
|
||
| **Content:** ${contentItems.join(" • ")} | ||
| **Capabilities:** Browse schemas • Preview data • Build queries • Download results (CSV/JSON)${dataFilesList}${notebooksList} | ||
|
|
||
| ## URL Patterns | ||
|
|
||
| All URLs with \`/#/\` prefix return HTML pages. \`/downloads/\` URLs return raw files. | ||
|
|
||
| | Pattern | Returns | Description | | ||
| |---------|---------|-------------| | ||
| | \`${base}/#/\` | HTML | Home - list all models | | ||
| | \`${base}/#/model/{model}\` | HTML | Model schema browser | | ||
| | \`${base}/#/model/{model}/preview/{source}\` | HTML | Preview source data (50 rows) | | ||
| | \`${base}/#/model/{model}/explorer/{source}\` | HTML | Interactive query builder | | ||
| | \`${base}/#/model/{model}/explorer/{source}?query={malloy}&run=true\` | HTML | Execute query, show results | | ||
| | \`${base}/#/model/{model}/query/{queryName}\` | HTML | Run named query, show results | | ||
| | \`${base}/#/notebook/{notebook}\` | HTML | View notebook with queries/visualizations | | ||
| | \`${base}/downloads/models/{model}.malloy\` | Text | Download model source file | | ||
| | \`${base}/downloads/notebooks/{notebook}.malloynb\` | Text | Download notebook file | | ||
| | \`${base}/downloads/data/{file}\` | File | Download data file (CSV/Parquet/JSON/Excel) |`; | ||
| } | ||
|
|
||
| function generateModelsSection( | ||
| models: ExtractedModel[], | ||
| basePath: string, | ||
| ): string { | ||
| if (models.length === 0) { | ||
| return "## Models\n\nNo models available."; | ||
| } | ||
|
|
||
| const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath; |
There was a problem hiding this comment.
The basePath trimming logic is duplicated in multiple functions (generateHeader, generateOverview, generateModelsSection). Consider extracting this into a helper function or normalizing the basePath once at the start of generateLlmsTxtContent to improve maintainability and reduce duplication.
| readURL: async (url: URL) => { | ||
| // Handle file:// URLs pointing to our models directory | ||
| const fileName = url.pathname.split("/").pop() ?? ""; | ||
| const filePath = path.join(modelsDir, fileName); | ||
|
|
||
| try { | ||
| await fs.access(filePath); | ||
| } catch { | ||
| throw new Error(`Model file not found: ${filePath}`); | ||
| } | ||
|
|
||
| const contents = await fs.readFile(filePath, "utf-8"); | ||
| return { contents }; | ||
| }, |
There was a problem hiding this comment.
The URL reader implementation only extracts the filename from the URL path without validating the full URL structure. This could potentially allow path traversal if a malicious URL is constructed (e.g., file:///../../etc/passwd). Consider adding validation to ensure the URL is properly formed and the filename doesn't contain path traversal sequences like .. or absolute paths. Additionally, verify that the resolved filePath stays within the modelsDir boundary.
| // Create runtime with a URL reader that loads from the filesystem | ||
| const runtime = new malloy.SingleConnectionRuntime({ | ||
| connection, | ||
| urlReader: { | ||
| readURL: async (url: URL) => { | ||
| // Handle file:// URLs pointing to our models directory | ||
| const fileName = url.pathname.split("/").pop() ?? ""; | ||
| const filePath = path.join(modelsDir, fileName); | ||
|
|
||
| try { | ||
| await fs.access(filePath); | ||
| } catch { | ||
| throw new Error(`Model file not found: ${filePath}`); | ||
| } | ||
|
|
||
| const contents = await fs.readFile(filePath, "utf-8"); | ||
| return { contents }; | ||
| }, | ||
| }, | ||
| }); | ||
|
|
||
| // Process all models in parallel - they're read-only operations | ||
| const results = await Promise.all( | ||
| malloyFiles.map(async (filePath) => { | ||
| const modelName = path.basename(filePath, ".malloy"); | ||
| const modelCode = await fs.readFile(filePath, "utf-8"); | ||
|
|
||
| try { | ||
| const modelUrl = new URL(`file:///${modelName}.malloy`); | ||
| const modelMaterializer = runtime.loadModel(modelUrl); | ||
| const model = await modelMaterializer.getModel(); | ||
|
|
||
| return extractFromModel(modelName, model, modelCode); | ||
| } catch (error) { | ||
| // Log but continue with other models | ||
| console.warn( | ||
| `[llms.txt] Warning: Could not compile model ${modelName}:`, | ||
| error instanceof Error ? error.message : error, | ||
| ); | ||
| return null; | ||
| } | ||
| }), | ||
| ); | ||
|
|
||
| // Clean up connection | ||
| await connection.close(); | ||
|
|
There was a problem hiding this comment.
If an error occurs during model processing (lines 71-91), the DuckDB connection might not be properly closed, leading to a resource leak. Consider wrapping the model processing logic in a try-finally block to ensure the connection is always closed, even if an error occurs during Promise.all execution.
| // Create runtime with a URL reader that loads from the filesystem | |
| const runtime = new malloy.SingleConnectionRuntime({ | |
| connection, | |
| urlReader: { | |
| readURL: async (url: URL) => { | |
| // Handle file:// URLs pointing to our models directory | |
| const fileName = url.pathname.split("/").pop() ?? ""; | |
| const filePath = path.join(modelsDir, fileName); | |
| try { | |
| await fs.access(filePath); | |
| } catch { | |
| throw new Error(`Model file not found: ${filePath}`); | |
| } | |
| const contents = await fs.readFile(filePath, "utf-8"); | |
| return { contents }; | |
| }, | |
| }, | |
| }); | |
| // Process all models in parallel - they're read-only operations | |
| const results = await Promise.all( | |
| malloyFiles.map(async (filePath) => { | |
| const modelName = path.basename(filePath, ".malloy"); | |
| const modelCode = await fs.readFile(filePath, "utf-8"); | |
| try { | |
| const modelUrl = new URL(`file:///${modelName}.malloy`); | |
| const modelMaterializer = runtime.loadModel(modelUrl); | |
| const model = await modelMaterializer.getModel(); | |
| return extractFromModel(modelName, model, modelCode); | |
| } catch (error) { | |
| // Log but continue with other models | |
| console.warn( | |
| `[llms.txt] Warning: Could not compile model ${modelName}:`, | |
| error instanceof Error ? error.message : error, | |
| ); | |
| return null; | |
| } | |
| }), | |
| ); | |
| // Clean up connection | |
| await connection.close(); | |
| let results: (ExtractedModel | null)[] = []; | |
| try { | |
| // Create runtime with a URL reader that loads from the filesystem | |
| const runtime = new malloy.SingleConnectionRuntime({ | |
| connection, | |
| urlReader: { | |
| readURL: async (url: URL) => { | |
| // Handle file:// URLs pointing to our models directory | |
| const fileName = url.pathname.split("/").pop() ?? ""; | |
| const filePath = path.join(modelsDir, fileName); | |
| try { | |
| await fs.access(filePath); | |
| } catch { | |
| throw new Error(`Model file not found: ${filePath}`); | |
| } | |
| const contents = await fs.readFile(filePath, "utf-8"); | |
| return { contents }; | |
| }, | |
| }, | |
| }); | |
| // Process all models in parallel - they're read-only operations | |
| results = await Promise.all( | |
| malloyFiles.map(async (filePath) => { | |
| const modelName = path.basename(filePath, ".malloy"); | |
| const modelCode = await fs.readFile(filePath, "utf-8"); | |
| try { | |
| const modelUrl = new URL(`file:///${modelName}.malloy`); | |
| const modelMaterializer = runtime.loadModel(modelUrl); | |
| const model = await modelMaterializer.getModel(); | |
| return extractFromModel(modelName, model, modelCode); | |
| } catch (error) { | |
| // Log but continue with other models | |
| console.warn( | |
| `[llms.txt] Warning: Could not compile model ${modelName}:`, | |
| error instanceof Error ? error.message : error, | |
| ); | |
| return null; | |
| } | |
| }), | |
| ); | |
| } finally { | |
| // Clean up connection even if an error occurs during processing | |
| await connection.close(); | |
| } |
| server.middlewares.use((req, res, next) => { | ||
| if (req.url === "/llms.txt") { | ||
| void (async () => { | ||
| try { | ||
| // Regenerate on each request in dev mode for hot reloading | ||
| const content = await generateContent(); | ||
| res.setHeader("Content-Type", "text/plain; charset=utf-8"); | ||
| res.end(content); | ||
| } catch (error) { | ||
| console.error("[llms.txt] Error generating content:", error); | ||
| res.statusCode = 500; | ||
| res.end( | ||
| `Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`, | ||
| ); | ||
| } | ||
| })(); |
There was a problem hiding this comment.
The void IIFE pattern void (async () => { ... })() is used to handle the async operation in the middleware. While this works, there's a subtle issue: if an error is thrown after the response headers are sent but before res.end() is called, the response might be left hanging. Consider adding error handling around the entire async block to ensure the response is always properly closed, or use a safer pattern like awaiting the promise and catching errors at the top level.
| server.middlewares.use((req, res, next) => { | |
| if (req.url === "/llms.txt") { | |
| void (async () => { | |
| try { | |
| // Regenerate on each request in dev mode for hot reloading | |
| const content = await generateContent(); | |
| res.setHeader("Content-Type", "text/plain; charset=utf-8"); | |
| res.end(content); | |
| } catch (error) { | |
| console.error("[llms.txt] Error generating content:", error); | |
| res.statusCode = 500; | |
| res.end( | |
| `Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`, | |
| ); | |
| } | |
| })(); | |
| server.middlewares.use(async (req, res, next) => { | |
| if (req.url === "/llms.txt") { | |
| try { | |
| // Regenerate on each request in dev mode for hot reloading | |
| const content = await generateContent(); | |
| res.setHeader("Content-Type", "text/plain; charset=utf-8"); | |
| res.end(content); | |
| } catch (error) { | |
| console.error("[llms.txt] Error generating content:", error); | |
| res.statusCode = 500; | |
| res.end( | |
| `Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`, | |
| ); | |
| } |
| export async function extractModelsSchema( | ||
| modelsDir: string, | ||
| ): Promise<ExtractedModel[]> { | ||
| try { | ||
| await fs.access(modelsDir); | ||
| } catch { | ||
| return []; | ||
| } | ||
|
|
||
| const files = await fs.readdir(modelsDir); | ||
| const malloyFiles = files | ||
| .filter((f) => f.endsWith(".malloy")) | ||
| .map((f) => path.join(modelsDir, f)); | ||
|
|
||
| if (malloyFiles.length === 0) { | ||
| return []; | ||
| } | ||
|
|
||
| // Create a DuckDB connection for model compilation | ||
| // Set workingDirectory so DuckDB can find data files referenced in models | ||
| const connection = new DuckDBConnection({ | ||
| name: "llms-txt-build", | ||
| workingDirectory: modelsDir, | ||
| }); | ||
|
|
||
| // Create runtime with a URL reader that loads from the filesystem | ||
| const runtime = new malloy.SingleConnectionRuntime({ | ||
| connection, | ||
| urlReader: { | ||
| readURL: async (url: URL) => { | ||
| // Handle file:// URLs pointing to our models directory | ||
| const fileName = url.pathname.split("/").pop() ?? ""; | ||
| const filePath = path.join(modelsDir, fileName); | ||
|
|
||
| try { | ||
| await fs.access(filePath); | ||
| } catch { | ||
| throw new Error(`Model file not found: ${filePath}`); | ||
| } | ||
|
|
||
| const contents = await fs.readFile(filePath, "utf-8"); | ||
| return { contents }; | ||
| }, | ||
| }, | ||
| }); | ||
|
|
||
| // Process all models in parallel - they're read-only operations | ||
| const results = await Promise.all( | ||
| malloyFiles.map(async (filePath) => { | ||
| const modelName = path.basename(filePath, ".malloy"); | ||
| const modelCode = await fs.readFile(filePath, "utf-8"); | ||
|
|
||
| try { | ||
| const modelUrl = new URL(`file:///${modelName}.malloy`); | ||
| const modelMaterializer = runtime.loadModel(modelUrl); | ||
| const model = await modelMaterializer.getModel(); | ||
|
|
||
| return extractFromModel(modelName, model, modelCode); | ||
| } catch (error) { | ||
| // Log but continue with other models | ||
| console.warn( | ||
| `[llms.txt] Warning: Could not compile model ${modelName}:`, | ||
| error instanceof Error ? error.message : error, | ||
| ); | ||
| return null; | ||
| } | ||
| }), | ||
| ); | ||
|
|
||
| // Clean up connection | ||
| await connection.close(); | ||
|
|
||
| // Filter out failed models (null values) and return | ||
| return results.filter((result): result is ExtractedModel => result !== null); | ||
| } |
There was a problem hiding this comment.
The new llms-txt module lacks test coverage. Based on the existing test patterns in this repository (tests/download-utils.test.ts, tests/notebook-parser.test.ts, tests/schema-utils.test.ts), consider adding unit tests for the schema extraction and content generation functions. Key areas to test include: extractModelsSchema handling of various model structures, error handling for malformed models, getDataFiles and getNotebooks file filtering logic, and generateLlmsTxtContent output format.
| function generateOverview( | ||
| _siteTitle: string, | ||
| basePath: string, | ||
| models: ExtractedModel[], | ||
| dataFiles: string[], | ||
| notebooks: string[], | ||
| ): string { |
There was a problem hiding this comment.
The parameter _siteTitle is prefixed with an underscore to indicate it's intentionally unused, which is a good practice. However, this parameter is not actually used anywhere in the function. Consider removing it from the function signature entirely since the site title is not needed for the overview section's content generation.
| const modelCode = await fs.readFile(filePath, "utf-8"); | ||
|
|
||
| try { | ||
| const modelUrl = new URL(`file:///${modelName}.malloy`); |
There was a problem hiding this comment.
The URL construction file:///${modelName}.malloy creates a root-level file URL that works only because the custom urlReader extracts just the filename. While functional, this is somewhat unconventional. Consider using a custom scheme similar to the runtime code (e.g., file://models/${modelName}.malloy) or adding a comment explaining why this URL format is used, to improve code clarity.
| const modelUrl = new URL(`file:///${modelName}.malloy`); | |
| const modelUrl = new URL(`file://models/${modelName}.malloy`); |
Use duckdb node api to compile malloy models and
extract schema information to build context files for LLMs.