Grid Incident Rag

A SQL-native RAG system that transforms California wildfire incident reports into searchable intelligence using BigQuery AI.

The Problem

California utilities have 153 PDF incident reports containing patterns that could prevent catastrophic wildfires. But this critical data is trapped in unstructured documents, taking months to analyze manually while disasters unfold.

The Solution

Built a complete RAG system using BigQuery AI that processes regulatory PDFs into an intelligent knowledge base. Ask questions like "What caused the Dixie Fire?" and get answers in 4 seconds with citations.

Architecture

📄 153 PDFs → 🤖 Document AI → 📝 Text Chunks → 🧠 Embeddings → 🔍 Vector Search → 💬 Gemini → ✨ Answers
     ↓              ↓              ↓              ↓               ↓              ↓          ↓
Cloud Storage → BigQuery → Structured Data → Vector Store → RAG Function → Dashboard → Insights

Key Innovation: Everything runs in SQL. No Python orchestration, no external services, no complex infrastructure.

BigQuery AI Implementation

🧠 AI Architect (Generative AI)

ML.GENERATE_TEXT: Powers RAG Q&A system
AI.GENERATE_TABLE: Extracts structured data from PDFs
AI.GENERATE_BOOL/INT: Categorizes incidents and metrics

🕵️‍♀️ Semantic Detective (Vector Search)

ML.GENERATE_EMBEDDING: Creates 768-dimensional vectors
VECTOR_SEARCH: Semantic similarity for contextual retrieval

🖼️ Multimodal Pioneer (Object Tables)

Object Tables: SQL interface to PDFs in Cloud Storage
Document AI: Handles complex regulatory document layouts

Results

153 PDFs → 126 validated incidents (82% success rate)
$39.57 total processing cost
~4.7 seconds average query response time
3,229 searchable text chunks with embeddings
3.76/5.0 correctness score, 100% faithfulness rate

Quick Start

Prerequisites

Google Cloud Project with billing enabled
BigQuery, Vertex AI, Document AI APIs enabled

1. Environment Setup

export PROJECT_ID="your-project-id"
export REGION="us"
export DATASET_ID="grid_incidents_rag_ds"
export BUCKET_NAME="your-bucket-name"

2. Create BigQuery Resources

# Create dataset
bq mk --location=$REGION --dataset $PROJECT_ID:$DATASET_ID

# Create connection
bq mk --connection --location=$REGION --connection_type=CLOUD_RESOURCE your-connection-name

3. Set Up Document AI

Go to Google Cloud Console → Document AI
Create "Layout Parser" processor
Copy the processor ID

4. Create Remote Models

-- Document AI Model
CREATE OR REPLACE MODEL `your-dataset.doc_parser_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS(
  remote_service_type = 'CLOUD_AI_DOCUMENT_V1',
  document_processor = 'projects/your-project/locations/us/processors/your-processor-id'
);

-- Embedding Model  
CREATE OR REPLACE MODEL `your-dataset.embedding_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS (ENDPOINT = 'textembedding-gecko@003');

-- LLM Model
CREATE OR REPLACE MODEL `your-dataset.llm_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS (ENDPOINT = 'gemini-2.5-flash');

5. Process Documents

-- Create object table
CREATE OR REPLACE EXTERNAL TABLE `your-dataset.pdf_object_table`
WITH CONNECTION `your-connection`
OPTIONS (
  object_metadata = "SIMPLE",
  uris = ["gs://your-bucket/*.pdf"]
);

-- Parse PDFs
CREATE OR REPLACE TABLE `your-dataset.parsed_pdf_chunks` AS
SELECT 
  uri,
  JSON_VALUE(chunk, '$.content') as content,
  ROW_NUMBER() OVER() as chunk_id
FROM 
  ML.PROCESS_DOCUMENT(
    MODEL `your-dataset.doc_parser_model`,
    TABLE `your-dataset.pdf_object_table`
  ),
  UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result, '$.chunked_document.chunks')) as chunk
WHERE LENGTH(JSON_VALUE(chunk, '$.content')) > 100;

-- Generate embeddings
CREATE OR REPLACE TABLE `your-dataset.pdf_embeddings` AS
SELECT * FROM ML.GENERATE_EMBEDDING(
  MODEL `your-dataset.embedding_model`,
  (SELECT content, chunk_id, uri FROM `your-dataset.parsed_pdf_chunks`)
);

6. Create RAG Function

CREATE OR REPLACE FUNCTION `your-dataset.ask_llm`(question STRING)
RETURNS STRING
AS (
  (
    SELECT ML.GENERATE_TEXT(
      MODEL `your-dataset.llm_model`,
      CONCAT(
        'Based on the following context about California wildfire incidents, answer the question.\n\nContext:\n',
        STRING_AGG(base.content, '\n\n'),
        '\n\nQuestion: ', question,
        '\n\nAnswer based only on the provided context:'
      )
    ).ml_generate_text_result
    FROM VECTOR_SEARCH(
      TABLE `your-dataset.pdf_embeddings`,
      'ml_generate_embedding_result',
      (SELECT ml_generate_embedding_result FROM ML.GENERATE_EMBEDDING(
        MODEL `your-dataset.embedding_model`,
        (SELECT question as content)
      )),
      top_k => 8
    )
  )
);

7. Test the System

SELECT `your-dataset.ask_llm`('What were the main causes of wildfire incidents?');

Key Files

walkthrough.ipynb - Complete demonstration with live BigQuery integration
streamlit_app/ - Interactive dashboard with RAG chat interface
kaggle_writeup.md - Formal project submission
survey_response.txt - BigQuery AI experience feedback

Production Patterns

Error Handling

-- Always use SAFE_CAST for AI-generated data
SAFE_CAST(ai_generated_date AS DATE) as incident_date

Embedding Consistency

Use single embedding model across entire pipeline - mixing models breaks vector search even with identical dimensions.

Parallel Processing

BigQuery processes all documents simultaneously:

-- Process all documents at once (fast)
INSERT INTO results
SELECT uri, ML.GENERATE_TEXT(...) FROM documents;

-- Not sequential processing (slow)
FOR doc IN (SELECT uri FROM documents) DO
  INSERT INTO results SELECT ML.GENERATE_TEXT(...);
END FOR;

Cost Analysis

Total processing cost for 153 documents: $39.57

Document AI parsing: ~$15
LLM generation calls: ~$20
Embedding generation: ~$5

Demonstrates production viability at scale.

Contributing

Built by Reza Madani

Repository: https://github.com/srmadani/Grid-Incident-Rag
Medium Blog: Detailed walkthrough and insights

License

CC BY 4.0 - Freely available for commercial and non-commercial use.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
streamlit_app		streamlit_app
LICENSE		LICENSE
detailed_readme.md		detailed_readme.md
readme.md		readme.md
walkthrough.ipynb		walkthrough.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grid Incident Rag

The Problem

The Solution

Architecture

BigQuery AI Implementation

🧠 AI Architect (Generative AI)

🕵️‍♀️ Semantic Detective (Vector Search)

🖼️ Multimodal Pioneer (Object Tables)

Results

Quick Start

Prerequisites

1. Environment Setup

2. Create BigQuery Resources

3. Set Up Document AI

4. Create Remote Models

5. Process Documents

6. Create RAG Function

7. Test the System

Key Files

Production Patterns

Error Handling

Embedding Consistency

Parallel Processing

Cost Analysis

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

srmadani/Grid-Incident-Rag

Folders and files

Latest commit

History

Repository files navigation

Grid Incident Rag

The Problem

The Solution

Architecture

BigQuery AI Implementation

🧠 AI Architect (Generative AI)

🕵️‍♀️ Semantic Detective (Vector Search)

🖼️ Multimodal Pioneer (Object Tables)

Results

Quick Start

Prerequisites

1. Environment Setup

2. Create BigQuery Resources

3. Set Up Document AI

4. Create Remote Models

5. Process Documents

6. Create RAG Function

7. Test the System

Key Files

Production Patterns

Error Handling

Embedding Consistency

Parallel Processing

Cost Analysis

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages