Skip to content

feat: LLM context file builder#137

Merged
aszenz merged 1 commit intomasterfrom
feat/llm-txt-context
Feb 3, 2026
Merged

feat: LLM context file builder#137
aszenz merged 1 commit intomasterfrom
feat/llm-txt-context

Conversation

@aszenz
Copy link
Owner

@aszenz aszenz commented Feb 3, 2026

Use duckdb node api to compile malloy models and
extract schema information to build context files for LLMs.

Copilot AI review requested due to automatic review settings February 3, 2026 22:36
Use duckdb node api to compile malloy models and
extract schema information to build context files for LLMs.
@aszenz aszenz force-pushed the feat/llm-txt-context branch from f273484 to b36297d Compare February 3, 2026 22:38
@aszenz aszenz merged commit 3f92d6e into master Feb 3, 2026
3 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements build-time generation of an llms.txt file to help LLMs understand the Malloy Data Explorer's data models and schema. The implementation uses the DuckDB Node.js API to compile Malloy models during the build process and extract schema information (sources, fields, queries) that is formatted into a standardized llms.txt file.

Changes:

  • Added schema extraction module that compiles Malloy models and extracts metadata using DuckDB
  • Created content generator that formats extracted schema into llms.txt format with site documentation
  • Implemented Vite plugin that generates llms.txt at build time and serves it dynamically in dev mode

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
proposals/llm_context.md Problem statement for LLM context generation feature
proposals/adr/001-llms-txt-build-time-generation.md Architecture decision record documenting build-time vs runtime generation approach
src/llms-txt/types.ts TypeScript interfaces for extracted model schema data structures
src/llms-txt/schema-extractor.ts Core logic for compiling Malloy models with DuckDB and extracting schema information
src/llms-txt/generator.ts Generates formatted llms.txt content from extracted schema
src/llms-txt/index.ts Module exports for the llms-txt package
plugins/vite-plugin-llms-txt.ts Vite plugin integrating llms.txt generation into build process
vite.config.ts Adds llms-txt plugin to Vite configuration
tsconfig.node.json Includes llms-txt directory in Node.js TypeScript compilation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


LLMs cannot directly analyze the site contents to understand the data models used and their schema, thus not able generate malloy queries.

## Propasal
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "Propasal" should be "Proposal"

Suggested change
## Propasal
## Proposal

Copilot uses AI. Check for mistakes.

## Propasal

Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site sturcture, how to issue arbitary queries and download data.
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "sturcture" should be "structure"

Suggested change
Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site sturcture, how to issue arbitary queries and download data.
Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site structure, how to issue arbitary queries and download data.

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +99
function generateHeader(siteTitle: string, basePath: string): string {
const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath;
return `# ${siteTitle}

> Malloy Data Explorer - Static web app for exploring semantic data models
> All queries run in-browser using DuckDB WASM

**Site URL:** \`${base}/\``;
}

function generateOverview(
_siteTitle: string,
basePath: string,
models: ExtractedModel[],
dataFiles: string[],
notebooks: string[],
): string {
const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath;

// Content summary
const contentItems = [
`${String(models.length)} Malloy model${models.length !== 1 ? "s" : ""}`,
`${String(dataFiles.length)} data file${dataFiles.length !== 1 ? "s" : ""}`,
...(notebooks.length > 0
? [
`${String(notebooks.length)} notebook${notebooks.length !== 1 ? "s" : ""}`,
]
: []),
];

// Data files list (compact)
const dataFilesList =
dataFiles.length > 0 ? `\n\n**Data Files:** ${dataFiles.join(", ")}` : "";

// Notebooks list (compact - just names)
const notebooksList =
notebooks.length > 0 ? `\n\n**Notebooks:** ${notebooks.join(", ")}` : "";

return `## Overview

**Content:** ${contentItems.join(" • ")}
**Capabilities:** Browse schemas • Preview data • Build queries • Download results (CSV/JSON)${dataFilesList}${notebooksList}

## URL Patterns

All URLs with \`/#/\` prefix return HTML pages. \`/downloads/\` URLs return raw files.

| Pattern | Returns | Description |
|---------|---------|-------------|
| \`${base}/#/\` | HTML | Home - list all models |
| \`${base}/#/model/{model}\` | HTML | Model schema browser |
| \`${base}/#/model/{model}/preview/{source}\` | HTML | Preview source data (50 rows) |
| \`${base}/#/model/{model}/explorer/{source}\` | HTML | Interactive query builder |
| \`${base}/#/model/{model}/explorer/{source}?query={malloy}&run=true\` | HTML | Execute query, show results |
| \`${base}/#/model/{model}/query/{queryName}\` | HTML | Run named query, show results |
| \`${base}/#/notebook/{notebook}\` | HTML | View notebook with queries/visualizations |
| \`${base}/downloads/models/{model}.malloy\` | Text | Download model source file |
| \`${base}/downloads/notebooks/{notebook}.malloynb\` | Text | Download notebook file |
| \`${base}/downloads/data/{file}\` | File | Download data file (CSV/Parquet/JSON/Excel) |`;
}

function generateModelsSection(
models: ExtractedModel[],
basePath: string,
): string {
if (models.length === 0) {
return "## Models\n\nNo models available.";
}

const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath;
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basePath trimming logic is duplicated in multiple functions (generateHeader, generateOverview, generateModelsSection). Consider extracting this into a helper function or normalizing the basePath once at the start of generateLlmsTxtContent to improve maintainability and reduce duplication.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +66
readURL: async (url: URL) => {
// Handle file:// URLs pointing to our models directory
const fileName = url.pathname.split("/").pop() ?? "";
const filePath = path.join(modelsDir, fileName);

try {
await fs.access(filePath);
} catch {
throw new Error(`Model file not found: ${filePath}`);
}

const contents = await fs.readFile(filePath, "utf-8");
return { contents };
},
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL reader implementation only extracts the filename from the URL path without validating the full URL structure. This could potentially allow path traversal if a malicious URL is constructed (e.g., file:///../../etc/passwd). Consider adding validation to ensure the URL is properly formed and the filename doesn't contain path traversal sequences like .. or absolute paths. Additionally, verify that the resolved filePath stays within the modelsDir boundary.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +95
// Create runtime with a URL reader that loads from the filesystem
const runtime = new malloy.SingleConnectionRuntime({
connection,
urlReader: {
readURL: async (url: URL) => {
// Handle file:// URLs pointing to our models directory
const fileName = url.pathname.split("/").pop() ?? "";
const filePath = path.join(modelsDir, fileName);

try {
await fs.access(filePath);
} catch {
throw new Error(`Model file not found: ${filePath}`);
}

const contents = await fs.readFile(filePath, "utf-8");
return { contents };
},
},
});

// Process all models in parallel - they're read-only operations
const results = await Promise.all(
malloyFiles.map(async (filePath) => {
const modelName = path.basename(filePath, ".malloy");
const modelCode = await fs.readFile(filePath, "utf-8");

try {
const modelUrl = new URL(`file:///${modelName}.malloy`);
const modelMaterializer = runtime.loadModel(modelUrl);
const model = await modelMaterializer.getModel();

return extractFromModel(modelName, model, modelCode);
} catch (error) {
// Log but continue with other models
console.warn(
`[llms.txt] Warning: Could not compile model ${modelName}:`,
error instanceof Error ? error.message : error,
);
return null;
}
}),
);

// Clean up connection
await connection.close();

Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an error occurs during model processing (lines 71-91), the DuckDB connection might not be properly closed, leading to a resource leak. Consider wrapping the model processing logic in a try-finally block to ensure the connection is always closed, even if an error occurs during Promise.all execution.

Suggested change
// Create runtime with a URL reader that loads from the filesystem
const runtime = new malloy.SingleConnectionRuntime({
connection,
urlReader: {
readURL: async (url: URL) => {
// Handle file:// URLs pointing to our models directory
const fileName = url.pathname.split("/").pop() ?? "";
const filePath = path.join(modelsDir, fileName);
try {
await fs.access(filePath);
} catch {
throw new Error(`Model file not found: ${filePath}`);
}
const contents = await fs.readFile(filePath, "utf-8");
return { contents };
},
},
});
// Process all models in parallel - they're read-only operations
const results = await Promise.all(
malloyFiles.map(async (filePath) => {
const modelName = path.basename(filePath, ".malloy");
const modelCode = await fs.readFile(filePath, "utf-8");
try {
const modelUrl = new URL(`file:///${modelName}.malloy`);
const modelMaterializer = runtime.loadModel(modelUrl);
const model = await modelMaterializer.getModel();
return extractFromModel(modelName, model, modelCode);
} catch (error) {
// Log but continue with other models
console.warn(
`[llms.txt] Warning: Could not compile model ${modelName}:`,
error instanceof Error ? error.message : error,
);
return null;
}
}),
);
// Clean up connection
await connection.close();
let results: (ExtractedModel | null)[] = [];
try {
// Create runtime with a URL reader that loads from the filesystem
const runtime = new malloy.SingleConnectionRuntime({
connection,
urlReader: {
readURL: async (url: URL) => {
// Handle file:// URLs pointing to our models directory
const fileName = url.pathname.split("/").pop() ?? "";
const filePath = path.join(modelsDir, fileName);
try {
await fs.access(filePath);
} catch {
throw new Error(`Model file not found: ${filePath}`);
}
const contents = await fs.readFile(filePath, "utf-8");
return { contents };
},
},
});
// Process all models in parallel - they're read-only operations
results = await Promise.all(
malloyFiles.map(async (filePath) => {
const modelName = path.basename(filePath, ".malloy");
const modelCode = await fs.readFile(filePath, "utf-8");
try {
const modelUrl = new URL(`file:///${modelName}.malloy`);
const modelMaterializer = runtime.loadModel(modelUrl);
const model = await modelMaterializer.getModel();
return extractFromModel(modelName, model, modelCode);
} catch (error) {
// Log but continue with other models
console.warn(
`[llms.txt] Warning: Could not compile model ${modelName}:`,
error instanceof Error ? error.message : error,
);
return null;
}
}),
);
} finally {
// Clean up connection even if an error occurs during processing
await connection.close();
}

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +74
server.middlewares.use((req, res, next) => {
if (req.url === "/llms.txt") {
void (async () => {
try {
// Regenerate on each request in dev mode for hot reloading
const content = await generateContent();
res.setHeader("Content-Type", "text/plain; charset=utf-8");
res.end(content);
} catch (error) {
console.error("[llms.txt] Error generating content:", error);
res.statusCode = 500;
res.end(
`Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`,
);
}
})();
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The void IIFE pattern void (async () => { ... })() is used to handle the async operation in the middleware. While this works, there's a subtle issue: if an error is thrown after the response headers are sent but before res.end() is called, the response might be left hanging. Consider adding error handling around the entire async block to ensure the response is always properly closed, or use a safer pattern like awaiting the promise and catching errors at the top level.

Suggested change
server.middlewares.use((req, res, next) => {
if (req.url === "/llms.txt") {
void (async () => {
try {
// Regenerate on each request in dev mode for hot reloading
const content = await generateContent();
res.setHeader("Content-Type", "text/plain; charset=utf-8");
res.end(content);
} catch (error) {
console.error("[llms.txt] Error generating content:", error);
res.statusCode = 500;
res.end(
`Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`,
);
}
})();
server.middlewares.use(async (req, res, next) => {
if (req.url === "/llms.txt") {
try {
// Regenerate on each request in dev mode for hot reloading
const content = await generateContent();
res.setHeader("Content-Type", "text/plain; charset=utf-8");
res.end(content);
} catch (error) {
console.error("[llms.txt] Error generating content:", error);
res.statusCode = 500;
res.end(
`Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`,
);
}

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +98
export async function extractModelsSchema(
modelsDir: string,
): Promise<ExtractedModel[]> {
try {
await fs.access(modelsDir);
} catch {
return [];
}

const files = await fs.readdir(modelsDir);
const malloyFiles = files
.filter((f) => f.endsWith(".malloy"))
.map((f) => path.join(modelsDir, f));

if (malloyFiles.length === 0) {
return [];
}

// Create a DuckDB connection for model compilation
// Set workingDirectory so DuckDB can find data files referenced in models
const connection = new DuckDBConnection({
name: "llms-txt-build",
workingDirectory: modelsDir,
});

// Create runtime with a URL reader that loads from the filesystem
const runtime = new malloy.SingleConnectionRuntime({
connection,
urlReader: {
readURL: async (url: URL) => {
// Handle file:// URLs pointing to our models directory
const fileName = url.pathname.split("/").pop() ?? "";
const filePath = path.join(modelsDir, fileName);

try {
await fs.access(filePath);
} catch {
throw new Error(`Model file not found: ${filePath}`);
}

const contents = await fs.readFile(filePath, "utf-8");
return { contents };
},
},
});

// Process all models in parallel - they're read-only operations
const results = await Promise.all(
malloyFiles.map(async (filePath) => {
const modelName = path.basename(filePath, ".malloy");
const modelCode = await fs.readFile(filePath, "utf-8");

try {
const modelUrl = new URL(`file:///${modelName}.malloy`);
const modelMaterializer = runtime.loadModel(modelUrl);
const model = await modelMaterializer.getModel();

return extractFromModel(modelName, model, modelCode);
} catch (error) {
// Log but continue with other models
console.warn(
`[llms.txt] Warning: Could not compile model ${modelName}:`,
error instanceof Error ? error.message : error,
);
return null;
}
}),
);

// Clean up connection
await connection.close();

// Filter out failed models (null values) and return
return results.filter((result): result is ExtractedModel => result !== null);
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new llms-txt module lacks test coverage. Based on the existing test patterns in this repository (tests/download-utils.test.ts, tests/notebook-parser.test.ts, tests/schema-utils.test.ts), consider adding unit tests for the schema extraction and content generation functions. Key areas to test include: extractModelsSchema handling of various model structures, error handling for malformed models, getDataFiles and getNotebooks file filtering logic, and generateLlmsTxtContent output format.

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +46
function generateOverview(
_siteTitle: string,
basePath: string,
models: ExtractedModel[],
dataFiles: string[],
notebooks: string[],
): string {
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter _siteTitle is prefixed with an underscore to indicate it's intentionally unused, which is a good practice. However, this parameter is not actually used anywhere in the function. Consider removing it from the function signature entirely since the site title is not needed for the overview section's content generation.

Copilot uses AI. Check for mistakes.
const modelCode = await fs.readFile(filePath, "utf-8");

try {
const modelUrl = new URL(`file:///${modelName}.malloy`);
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL construction file:///${modelName}.malloy creates a root-level file URL that works only because the custom urlReader extracts just the filename. While functional, this is somewhat unconventional. Consider using a custom scheme similar to the runtime code (e.g., file://models/${modelName}.malloy) or adding a comment explaining why this URL format is used, to improve code clarity.

Suggested change
const modelUrl = new URL(`file:///${modelName}.malloy`);
const modelUrl = new URL(`file://models/${modelName}.malloy`);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants