Skip to content

SQL-native RAG system that transforms 153 California wildfire incident PDFs into searchable intelligence using BigQuery AI. Processes regulatory documents into a knowledge base with vector search and generative AI for wildfire prevention.

License

Notifications You must be signed in to change notification settings

srmadani/Grid-Incident-Rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Grid Incident Rag

A SQL-native RAG system that transforms California wildfire incident reports into searchable intelligence using BigQuery AI.

Grid Incident RAG cover

The Problem

California utilities have 153 PDF incident reports containing patterns that could prevent catastrophic wildfires. But this critical data is trapped in unstructured documents, taking months to analyze manually while disasters unfold.

The Solution

Built a complete RAG system using BigQuery AI that processes regulatory PDFs into an intelligent knowledge base. Ask questions like "What caused the Dixie Fire?" and get answers in 4 seconds with citations.

Architecture

πŸ“„ 153 PDFs β†’ πŸ€– Document AI β†’ πŸ“ Text Chunks β†’ 🧠 Embeddings β†’ πŸ” Vector Search β†’ πŸ’¬ Gemini β†’ ✨ Answers
     ↓              ↓              ↓              ↓               ↓              ↓          ↓
Cloud Storage β†’ BigQuery β†’ Structured Data β†’ Vector Store β†’ RAG Function β†’ Dashboard β†’ Insights

Key Innovation: Everything runs in SQL. No Python orchestration, no external services, no complex infrastructure.

BigQuery AI Implementation

🧠 AI Architect (Generative AI)

  • ML.GENERATE_TEXT: Powers RAG Q&A system
  • AI.GENERATE_TABLE: Extracts structured data from PDFs
  • AI.GENERATE_BOOL/INT: Categorizes incidents and metrics

πŸ•΅οΈβ€β™€οΈ Semantic Detective (Vector Search)

  • ML.GENERATE_EMBEDDING: Creates 768-dimensional vectors
  • VECTOR_SEARCH: Semantic similarity for contextual retrieval

πŸ–ΌοΈ Multimodal Pioneer (Object Tables)

  • Object Tables: SQL interface to PDFs in Cloud Storage
  • Document AI: Handles complex regulatory document layouts

Results

  • 153 PDFs β†’ 126 validated incidents (82% success rate)
  • $39.57 total processing cost
  • ~4.7 seconds average query response time
  • 3,229 searchable text chunks with embeddings
  • 3.76/5.0 correctness score, 100% faithfulness rate

Quick Start

Prerequisites

  • Google Cloud Project with billing enabled
  • BigQuery, Vertex AI, Document AI APIs enabled

1. Environment Setup

export PROJECT_ID="your-project-id"
export REGION="us"
export DATASET_ID="grid_incidents_rag_ds"
export BUCKET_NAME="your-bucket-name"

2. Create BigQuery Resources

# Create dataset
bq mk --location=$REGION --dataset $PROJECT_ID:$DATASET_ID

# Create connection
bq mk --connection --location=$REGION --connection_type=CLOUD_RESOURCE your-connection-name

3. Set Up Document AI

  1. Go to Google Cloud Console β†’ Document AI
  2. Create "Layout Parser" processor
  3. Copy the processor ID

4. Create Remote Models

-- Document AI Model
CREATE OR REPLACE MODEL `your-dataset.doc_parser_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS(
  remote_service_type = 'CLOUD_AI_DOCUMENT_V1',
  document_processor = 'projects/your-project/locations/us/processors/your-processor-id'
);

-- Embedding Model  
CREATE OR REPLACE MODEL `your-dataset.embedding_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS (ENDPOINT = 'textembedding-gecko@003');

-- LLM Model
CREATE OR REPLACE MODEL `your-dataset.llm_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS (ENDPOINT = 'gemini-2.5-flash');

5. Process Documents

-- Create object table
CREATE OR REPLACE EXTERNAL TABLE `your-dataset.pdf_object_table`
WITH CONNECTION `your-connection`
OPTIONS (
  object_metadata = "SIMPLE",
  uris = ["gs://your-bucket/*.pdf"]
);

-- Parse PDFs
CREATE OR REPLACE TABLE `your-dataset.parsed_pdf_chunks` AS
SELECT 
  uri,
  JSON_VALUE(chunk, '$.content') as content,
  ROW_NUMBER() OVER() as chunk_id
FROM 
  ML.PROCESS_DOCUMENT(
    MODEL `your-dataset.doc_parser_model`,
    TABLE `your-dataset.pdf_object_table`
  ),
  UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result, '$.chunked_document.chunks')) as chunk
WHERE LENGTH(JSON_VALUE(chunk, '$.content')) > 100;

-- Generate embeddings
CREATE OR REPLACE TABLE `your-dataset.pdf_embeddings` AS
SELECT * FROM ML.GENERATE_EMBEDDING(
  MODEL `your-dataset.embedding_model`,
  (SELECT content, chunk_id, uri FROM `your-dataset.parsed_pdf_chunks`)
);

6. Create RAG Function

CREATE OR REPLACE FUNCTION `your-dataset.ask_llm`(question STRING)
RETURNS STRING
AS (
  (
    SELECT ML.GENERATE_TEXT(
      MODEL `your-dataset.llm_model`,
      CONCAT(
        'Based on the following context about California wildfire incidents, answer the question.\n\nContext:\n',
        STRING_AGG(base.content, '\n\n'),
        '\n\nQuestion: ', question,
        '\n\nAnswer based only on the provided context:'
      )
    ).ml_generate_text_result
    FROM VECTOR_SEARCH(
      TABLE `your-dataset.pdf_embeddings`,
      'ml_generate_embedding_result',
      (SELECT ml_generate_embedding_result FROM ML.GENERATE_EMBEDDING(
        MODEL `your-dataset.embedding_model`,
        (SELECT question as content)
      )),
      top_k => 8
    )
  )
);

7. Test the System

SELECT `your-dataset.ask_llm`('What were the main causes of wildfire incidents?');

Key Files

  • walkthrough.ipynb - Complete demonstration with live BigQuery integration
  • streamlit_app/ - Interactive dashboard with RAG chat interface
  • kaggle_writeup.md - Formal project submission
  • survey_response.txt - BigQuery AI experience feedback

Production Patterns

Error Handling

-- Always use SAFE_CAST for AI-generated data
SAFE_CAST(ai_generated_date AS DATE) as incident_date

Embedding Consistency

Use single embedding model across entire pipeline - mixing models breaks vector search even with identical dimensions.

Parallel Processing

BigQuery processes all documents simultaneously:

-- Process all documents at once (fast)
INSERT INTO results
SELECT uri, ML.GENERATE_TEXT(...) FROM documents;

-- Not sequential processing (slow)
FOR doc IN (SELECT uri FROM documents) DO
  INSERT INTO results SELECT ML.GENERATE_TEXT(...);
END FOR;

Cost Analysis

Total processing cost for 153 documents: $39.57

  • Document AI parsing: ~$15
  • LLM generation calls: ~$20
  • Embedding generation: ~$5

Demonstrates production viability at scale.

Contributing

Built by Reza Madani

Repository: https://github.com/srmadani/Grid-Incident-Rag
Medium Blog: Detailed walkthrough and insights

License

CC BY 4.0 - Freely available for commercial and non-commercial use.

About

SQL-native RAG system that transforms 153 California wildfire incident PDFs into searchable intelligence using BigQuery AI. Processes regulatory documents into a knowledge base with vector search and generative AI for wildfire prevention.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published