feat: LLM context file builder by aszenz · Pull Request #137 · aszenz/data-explorer

aszenz · 2026-02-03T22:36:58Z

Use duckdb node api to compile malloy models and
extract schema information to build context files for LLMs.

Use duckdb node api to compile malloy models and extract schema information to build context files for LLMs.

Copilot

Pull request overview

This PR implements build-time generation of an llms.txt file to help LLMs understand the Malloy Data Explorer's data models and schema. The implementation uses the DuckDB Node.js API to compile Malloy models during the build process and extract schema information (sources, fields, queries) that is formatted into a standardized llms.txt file.

Changes:

Added schema extraction module that compiles Malloy models and extracts metadata using DuckDB
Created content generator that formats extracted schema into llms.txt format with site documentation
Implemented Vite plugin that generates llms.txt at build time and serves it dynamically in dev mode

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
proposals/llm_context.md	Problem statement for LLM context generation feature
proposals/adr/001-llms-txt-build-time-generation.md	Architecture decision record documenting build-time vs runtime generation approach
src/llms-txt/types.ts	TypeScript interfaces for extracted model schema data structures
src/llms-txt/schema-extractor.ts	Core logic for compiling Malloy models with DuckDB and extracting schema information
src/llms-txt/generator.ts	Generates formatted llms.txt content from extracted schema
src/llms-txt/index.ts	Module exports for the llms-txt package
plugins/vite-plugin-llms-txt.ts	Vite plugin integrating llms.txt generation into build process
vite.config.ts	Adds llms-txt plugin to Vite configuration
tsconfig.node.json	Includes llms-txt directory in Node.js TypeScript compilation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T22:43:31Z

proposals/llm_context.md

+
+LLMs cannot directly analyze the site contents to understand the data models used and their schema, thus not able generate malloy queries.
+
+## Propasal


Spelling error: "Propasal" should be "Proposal"

Suggested change

## Propasal

## Proposal

Copilot · 2026-02-03T22:43:31Z

proposals/llm_context.md

+
+## Propasal
+
+Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site sturcture, how to issue arbitary queries and download data.


Spelling error: "sturcture" should be "structure"

Suggested change

Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site sturcture, how to issue arbitary queries and download data.

Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site structure, how to issue arbitary queries and download data.

Copilot · 2026-02-03T22:43:31Z

src/llms-txt/generator.ts

+function generateHeader(siteTitle: string, basePath: string): string {
+  const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath;
+  return `# ${siteTitle}
+
+> Malloy Data Explorer - Static web app for exploring semantic data models
+> All queries run in-browser using DuckDB WASM
+
+**Site URL:** \`${base}/\``;
+}
+
+function generateOverview(
+  _siteTitle: string,
+  basePath: string,
+  models: ExtractedModel[],
+  dataFiles: string[],
+  notebooks: string[],
+): string {
+  const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath;
+
+  // Content summary
+  const contentItems = [
+    `${String(models.length)} Malloy model${models.length !== 1 ? "s" : ""}`,
+    `${String(dataFiles.length)} data file${dataFiles.length !== 1 ? "s" : ""}`,
+    ...(notebooks.length > 0
+      ? [
+          `${String(notebooks.length)} notebook${notebooks.length !== 1 ? "s" : ""}`,
+        ]
+      : []),
+  ];
+
+  // Data files list (compact)
+  const dataFilesList =
+    dataFiles.length > 0 ? `\n\n**Data Files:** ${dataFiles.join(", ")}` : "";
+
+  // Notebooks list (compact - just names)
+  const notebooksList =
+    notebooks.length > 0 ? `\n\n**Notebooks:** ${notebooks.join(", ")}` : "";
+
+  return `## Overview
+
+**Content:** ${contentItems.join(" • ")}
+**Capabilities:** Browse schemas • Preview data • Build queries • Download results (CSV/JSON)${dataFilesList}${notebooksList}
+
+## URL Patterns
+
+All URLs with \`/#/\` prefix return HTML pages. \`/downloads/\` URLs return raw files.
+
+| Pattern | Returns | Description |
+|---------|---------|-------------|
+| \`${base}/#/\` | HTML | Home - list all models |
+| \`${base}/#/model/{model}\` | HTML | Model schema browser |
+| \`${base}/#/model/{model}/preview/{source}\` | HTML | Preview source data (50 rows) |
+| \`${base}/#/model/{model}/explorer/{source}\` | HTML | Interactive query builder |
+| \`${base}/#/model/{model}/explorer/{source}?query={malloy}&run=true\` | HTML | Execute query, show results |
+| \`${base}/#/model/{model}/query/{queryName}\` | HTML | Run named query, show results |
+| \`${base}/#/notebook/{notebook}\` | HTML | View notebook with queries/visualizations |
+| \`${base}/downloads/models/{model}.malloy\` | Text | Download model source file |
+| \`${base}/downloads/notebooks/{notebook}.malloynb\` | Text | Download notebook file |
+| \`${base}/downloads/data/{file}\` | File | Download data file (CSV/Parquet/JSON/Excel) |`;
+}
+
+function generateModelsSection(
+  models: ExtractedModel[],
+  basePath: string,
+): string {
+  if (models.length === 0) {
+    return "## Models\n\nNo models available.";
+  }
+
+  const base = basePath.endsWith("/") ? basePath.slice(0, -1) : basePath;


The basePath trimming logic is duplicated in multiple functions (generateHeader, generateOverview, generateModelsSection). Consider extracting this into a helper function or normalizing the basePath once at the start of generateLlmsTxtContent to improve maintainability and reduce duplication.

Copilot · 2026-02-03T22:43:32Z

src/llms-txt/schema-extractor.ts

+      readURL: async (url: URL) => {
+        // Handle file:// URLs pointing to our models directory
+        const fileName = url.pathname.split("/").pop() ?? "";
+        const filePath = path.join(modelsDir, fileName);
+
+        try {
+          await fs.access(filePath);
+        } catch {
+          throw new Error(`Model file not found: ${filePath}`);
+        }
+
+        const contents = await fs.readFile(filePath, "utf-8");
+        return { contents };
+      },


The URL reader implementation only extracts the filename from the URL path without validating the full URL structure. This could potentially allow path traversal if a malicious URL is constructed (e.g., file:///../../etc/passwd). Consider adding validation to ensure the URL is properly formed and the filename doesn't contain path traversal sequences like .. or absolute paths. Additionally, verify that the resolved filePath stays within the modelsDir boundary.

Copilot · 2026-02-03T22:43:32Z

src/llms-txt/schema-extractor.ts

+  // Create runtime with a URL reader that loads from the filesystem
+  const runtime = new malloy.SingleConnectionRuntime({
+    connection,
+    urlReader: {
+      readURL: async (url: URL) => {
+        // Handle file:// URLs pointing to our models directory
+        const fileName = url.pathname.split("/").pop() ?? "";
+        const filePath = path.join(modelsDir, fileName);
+
+        try {
+          await fs.access(filePath);
+        } catch {
+          throw new Error(`Model file not found: ${filePath}`);
+        }
+
+        const contents = await fs.readFile(filePath, "utf-8");
+        return { contents };
+      },
+    },
+  });
+
+  // Process all models in parallel - they're read-only operations
+  const results = await Promise.all(
+    malloyFiles.map(async (filePath) => {
+      const modelName = path.basename(filePath, ".malloy");
+      const modelCode = await fs.readFile(filePath, "utf-8");
+
+      try {
+        const modelUrl = new URL(`file:///${modelName}.malloy`);
+        const modelMaterializer = runtime.loadModel(modelUrl);
+        const model = await modelMaterializer.getModel();
+
+        return extractFromModel(modelName, model, modelCode);
+      } catch (error) {
+        // Log but continue with other models
+        console.warn(
+          `[llms.txt] Warning: Could not compile model ${modelName}:`,
+          error instanceof Error ? error.message : error,
+        );
+        return null;
+      }
+    }),
+  );
+
+  // Clean up connection
+  await connection.close();
+


If an error occurs during model processing (lines 71-91), the DuckDB connection might not be properly closed, leading to a resource leak. Consider wrapping the model processing logic in a try-finally block to ensure the connection is always closed, even if an error occurs during Promise.all execution.

Suggested change

// Create runtime with a URL reader that loads from the filesystem

const runtime = new malloy.SingleConnectionRuntime({

connection,

urlReader: {

readURL: async (url: URL) => {

// Handle file:// URLs pointing to our models directory

const fileName = url.pathname.split("/").pop() ?? "";

const filePath = path.join(modelsDir, fileName);

try {

await fs.access(filePath);

} catch {

throw new Error(`Model file not found: ${filePath}`);

}

const contents = await fs.readFile(filePath, "utf-8");

return { contents };

},

},

});

// Process all models in parallel - they're read-only operations

const results = await Promise.all(

malloyFiles.map(async (filePath) => {

const modelName = path.basename(filePath, ".malloy");

const modelCode = await fs.readFile(filePath, "utf-8");

try {

const modelUrl = new URL(`file:///${modelName}.malloy`);

const modelMaterializer = runtime.loadModel(modelUrl);

const model = await modelMaterializer.getModel();

return extractFromModel(modelName, model, modelCode);

} catch (error) {

// Log but continue with other models

console.warn(

`[llms.txt] Warning: Could not compile model ${modelName}:`,

error instanceof Error ? error.message : error,

);

return null;

}

}),

);

// Clean up connection

await connection.close();

let results: (ExtractedModel | null)[] = [];

try {

// Create runtime with a URL reader that loads from the filesystem

const runtime = new malloy.SingleConnectionRuntime({

connection,

urlReader: {

readURL: async (url: URL) => {

// Handle file:// URLs pointing to our models directory

const fileName = url.pathname.split("/").pop() ?? "";

const filePath = path.join(modelsDir, fileName);

try {

await fs.access(filePath);

} catch {

throw new Error(`Model file not found: ${filePath}`);

}

const contents = await fs.readFile(filePath, "utf-8");

return { contents };

},

},

});

// Process all models in parallel - they're read-only operations

results = await Promise.all(

malloyFiles.map(async (filePath) => {

const modelName = path.basename(filePath, ".malloy");

const modelCode = await fs.readFile(filePath, "utf-8");

try {

const modelUrl = new URL(`file:///${modelName}.malloy`);

const modelMaterializer = runtime.loadModel(modelUrl);

const model = await modelMaterializer.getModel();

return extractFromModel(modelName, model, modelCode);

} catch (error) {

// Log but continue with other models

console.warn(

`[llms.txt] Warning: Could not compile model ${modelName}:`,

error instanceof Error ? error.message : error,

);

return null;

}

}),

);

} finally {

// Clean up connection even if an error occurs during processing

await connection.close();

}

Copilot · 2026-02-03T22:43:32Z

plugins/vite-plugin-llms-txt.ts

+      server.middlewares.use((req, res, next) => {
+        if (req.url === "/llms.txt") {
+          void (async () => {
+            try {
+              // Regenerate on each request in dev mode for hot reloading
+              const content = await generateContent();
+              res.setHeader("Content-Type", "text/plain; charset=utf-8");
+              res.end(content);
+            } catch (error) {
+              console.error("[llms.txt] Error generating content:", error);
+              res.statusCode = 500;
+              res.end(
+                `Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`,
+              );
+            }
+          })();


The void IIFE pattern void (async () => { ... })() is used to handle the async operation in the middleware. While this works, there's a subtle issue: if an error is thrown after the response headers are sent but before res.end() is called, the response might be left hanging. Consider adding error handling around the entire async block to ensure the response is always properly closed, or use a safer pattern like awaiting the promise and catching errors at the top level.

Suggested change

server.middlewares.use((req, res, next) => {

if (req.url === "/llms.txt") {

void (async () => {

try {

// Regenerate on each request in dev mode for hot reloading

const content = await generateContent();

res.setHeader("Content-Type", "text/plain; charset=utf-8");

res.end(content);

} catch (error) {

console.error("[llms.txt] Error generating content:", error);

res.statusCode = 500;

res.end(

`Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`,

);

}

})();

server.middlewares.use(async (req, res, next) => {

if (req.url === "/llms.txt") {

try {

// Regenerate on each request in dev mode for hot reloading

const content = await generateContent();

res.setHeader("Content-Type", "text/plain; charset=utf-8");

res.end(content);

} catch (error) {

console.error("[llms.txt] Error generating content:", error);

res.statusCode = 500;

res.end(

`Error generating llms.txt: ${error instanceof Error ? error.message : String(error)}`,

);

}

Copilot · 2026-02-03T22:43:33Z

src/llms-txt/schema-extractor.ts

+export async function extractModelsSchema(
+  modelsDir: string,
+): Promise<ExtractedModel[]> {
+  try {
+    await fs.access(modelsDir);
+  } catch {
+    return [];
+  }
+
+  const files = await fs.readdir(modelsDir);
+  const malloyFiles = files
+    .filter((f) => f.endsWith(".malloy"))
+    .map((f) => path.join(modelsDir, f));
+
+  if (malloyFiles.length === 0) {
+    return [];
+  }
+
+  // Create a DuckDB connection for model compilation
+  // Set workingDirectory so DuckDB can find data files referenced in models
+  const connection = new DuckDBConnection({
+    name: "llms-txt-build",
+    workingDirectory: modelsDir,
+  });
+
+  // Create runtime with a URL reader that loads from the filesystem
+  const runtime = new malloy.SingleConnectionRuntime({
+    connection,
+    urlReader: {
+      readURL: async (url: URL) => {
+        // Handle file:// URLs pointing to our models directory
+        const fileName = url.pathname.split("/").pop() ?? "";
+        const filePath = path.join(modelsDir, fileName);
+
+        try {
+          await fs.access(filePath);
+        } catch {
+          throw new Error(`Model file not found: ${filePath}`);
+        }
+
+        const contents = await fs.readFile(filePath, "utf-8");
+        return { contents };
+      },
+    },
+  });
+
+  // Process all models in parallel - they're read-only operations
+  const results = await Promise.all(
+    malloyFiles.map(async (filePath) => {
+      const modelName = path.basename(filePath, ".malloy");
+      const modelCode = await fs.readFile(filePath, "utf-8");
+
+      try {
+        const modelUrl = new URL(`file:///${modelName}.malloy`);
+        const modelMaterializer = runtime.loadModel(modelUrl);
+        const model = await modelMaterializer.getModel();
+
+        return extractFromModel(modelName, model, modelCode);
+      } catch (error) {
+        // Log but continue with other models
+        console.warn(
+          `[llms.txt] Warning: Could not compile model ${modelName}:`,
+          error instanceof Error ? error.message : error,
+        );
+        return null;
+      }
+    }),
+  );
+
+  // Clean up connection
+  await connection.close();
+
+  // Filter out failed models (null values) and return
+  return results.filter((result): result is ExtractedModel => result !== null);
+}


The new llms-txt module lacks test coverage. Based on the existing test patterns in this repository (tests/download-utils.test.ts, tests/notebook-parser.test.ts, tests/schema-utils.test.ts), consider adding unit tests for the schema extraction and content generation functions. Key areas to test include: extractModelsSchema handling of various model structures, error handling for malformed models, getDataFiles and getNotebooks file filtering logic, and generateLlmsTxtContent output format.

Copilot · 2026-02-03T22:43:33Z

src/llms-txt/generator.ts

+function generateOverview(
+  _siteTitle: string,
+  basePath: string,
+  models: ExtractedModel[],
+  dataFiles: string[],
+  notebooks: string[],
+): string {


The parameter _siteTitle is prefixed with an underscore to indicate it's intentionally unused, which is a good practice. However, this parameter is not actually used anywhere in the function. Consider removing it from the function signature entirely since the site title is not needed for the overview section's content generation.

Copilot · 2026-02-03T22:43:33Z

src/llms-txt/schema-extractor.ts

+      const modelCode = await fs.readFile(filePath, "utf-8");
+
+      try {
+        const modelUrl = new URL(`file:///${modelName}.malloy`);


The URL construction file:///${modelName}.malloy creates a root-level file URL that works only because the custom urlReader extracts just the filename. While functional, this is somewhat unconventional. Consider using a custom scheme similar to the runtime code (e.g., file://models/${modelName}.malloy) or adding a comment explaining why this URL format is used, to improve code clarity.

Suggested change

const modelUrl = new URL(`file:///${modelName}.malloy`);

const modelUrl = new URL(`file://models/${modelName}.malloy`);

Copilot AI review requested due to automatic review settings February 3, 2026 22:36

Copilot started reviewing on behalf of aszenz February 3, 2026 22:37 View session

feat: LLM context file builder

b36297d

Use duckdb node api to compile malloy models and extract schema information to build context files for LLMs.

aszenz force-pushed the feat/llm-txt-context branch from f273484 to b36297d Compare February 3, 2026 22:38

aszenz merged commit 3f92d6e into master Feb 3, 2026
3 checks passed

Copilot AI reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLM context file builder#137

feat: LLM context file builder#137
aszenz merged 1 commit intomasterfrom
feat/llm-txt-context

aszenz commented Feb 3, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		LLMs cannot directly analyze the site contents to understand the data models used and their schema, thus not able generate malloy queries.

		## Propasal


		## Propasal

		Create `llms.txt` build at build time that uses the internal vite import.meta to gather context of malloy models and schema. Structure this for llm consumption in a single file so that llms can fully comprehend the data models support by the site, the site sturcture, how to issue arbitary queries and download data.

	const modelUrl = new URL(`file:///${modelName}.malloy`);
	const modelUrl = new URL(`file://models/${modelName}.malloy`);

Conversation

aszenz commented Feb 3, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants