A SQL-native RAG system that transforms California wildfire incident reports into searchable intelligence using BigQuery AI.
California utilities have 153 PDF incident reports containing patterns that could prevent catastrophic wildfires. But this critical data is trapped in unstructured documents, taking months to analyze manually while disasters unfold.
Built a complete RAG system using BigQuery AI that processes regulatory PDFs into an intelligent knowledge base. Ask questions like "What caused the Dixie Fire?" and get answers in 4 seconds with citations.
π 153 PDFs β π€ Document AI β π Text Chunks β π§ Embeddings β π Vector Search β π¬ Gemini β β¨ Answers
β β β β β β β
Cloud Storage β BigQuery β Structured Data β Vector Store β RAG Function β Dashboard β Insights
Key Innovation: Everything runs in SQL. No Python orchestration, no external services, no complex infrastructure.
- ML.GENERATE_TEXT: Powers RAG Q&A system
- AI.GENERATE_TABLE: Extracts structured data from PDFs
- AI.GENERATE_BOOL/INT: Categorizes incidents and metrics
- ML.GENERATE_EMBEDDING: Creates 768-dimensional vectors
- VECTOR_SEARCH: Semantic similarity for contextual retrieval
- Object Tables: SQL interface to PDFs in Cloud Storage
- Document AI: Handles complex regulatory document layouts
- 153 PDFs β 126 validated incidents (82% success rate)
- $39.57 total processing cost
- ~4.7 seconds average query response time
- 3,229 searchable text chunks with embeddings
- 3.76/5.0 correctness score, 100% faithfulness rate
- Google Cloud Project with billing enabled
- BigQuery, Vertex AI, Document AI APIs enabled
export PROJECT_ID="your-project-id"
export REGION="us"
export DATASET_ID="grid_incidents_rag_ds"
export BUCKET_NAME="your-bucket-name"# Create dataset
bq mk --location=$REGION --dataset $PROJECT_ID:$DATASET_ID
# Create connection
bq mk --connection --location=$REGION --connection_type=CLOUD_RESOURCE your-connection-name- Go to Google Cloud Console β Document AI
- Create "Layout Parser" processor
- Copy the processor ID
-- Document AI Model
CREATE OR REPLACE MODEL `your-dataset.doc_parser_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS(
remote_service_type = 'CLOUD_AI_DOCUMENT_V1',
document_processor = 'projects/your-project/locations/us/processors/your-processor-id'
);
-- Embedding Model
CREATE OR REPLACE MODEL `your-dataset.embedding_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS (ENDPOINT = 'textembedding-gecko@003');
-- LLM Model
CREATE OR REPLACE MODEL `your-dataset.llm_model`
REMOTE WITH CONNECTION `your-connection`
OPTIONS (ENDPOINT = 'gemini-2.5-flash');-- Create object table
CREATE OR REPLACE EXTERNAL TABLE `your-dataset.pdf_object_table`
WITH CONNECTION `your-connection`
OPTIONS (
object_metadata = "SIMPLE",
uris = ["gs://your-bucket/*.pdf"]
);
-- Parse PDFs
CREATE OR REPLACE TABLE `your-dataset.parsed_pdf_chunks` AS
SELECT
uri,
JSON_VALUE(chunk, '$.content') as content,
ROW_NUMBER() OVER() as chunk_id
FROM
ML.PROCESS_DOCUMENT(
MODEL `your-dataset.doc_parser_model`,
TABLE `your-dataset.pdf_object_table`
),
UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result, '$.chunked_document.chunks')) as chunk
WHERE LENGTH(JSON_VALUE(chunk, '$.content')) > 100;
-- Generate embeddings
CREATE OR REPLACE TABLE `your-dataset.pdf_embeddings` AS
SELECT * FROM ML.GENERATE_EMBEDDING(
MODEL `your-dataset.embedding_model`,
(SELECT content, chunk_id, uri FROM `your-dataset.parsed_pdf_chunks`)
);CREATE OR REPLACE FUNCTION `your-dataset.ask_llm`(question STRING)
RETURNS STRING
AS (
(
SELECT ML.GENERATE_TEXT(
MODEL `your-dataset.llm_model`,
CONCAT(
'Based on the following context about California wildfire incidents, answer the question.\n\nContext:\n',
STRING_AGG(base.content, '\n\n'),
'\n\nQuestion: ', question,
'\n\nAnswer based only on the provided context:'
)
).ml_generate_text_result
FROM VECTOR_SEARCH(
TABLE `your-dataset.pdf_embeddings`,
'ml_generate_embedding_result',
(SELECT ml_generate_embedding_result FROM ML.GENERATE_EMBEDDING(
MODEL `your-dataset.embedding_model`,
(SELECT question as content)
)),
top_k => 8
)
)
);SELECT `your-dataset.ask_llm`('What were the main causes of wildfire incidents?');walkthrough.ipynb- Complete demonstration with live BigQuery integrationstreamlit_app/- Interactive dashboard with RAG chat interfacekaggle_writeup.md- Formal project submissionsurvey_response.txt- BigQuery AI experience feedback
-- Always use SAFE_CAST for AI-generated data
SAFE_CAST(ai_generated_date AS DATE) as incident_dateUse single embedding model across entire pipeline - mixing models breaks vector search even with identical dimensions.
BigQuery processes all documents simultaneously:
-- Process all documents at once (fast)
INSERT INTO results
SELECT uri, ML.GENERATE_TEXT(...) FROM documents;
-- Not sequential processing (slow)
FOR doc IN (SELECT uri FROM documents) DO
INSERT INTO results SELECT ML.GENERATE_TEXT(...);
END FOR;Total processing cost for 153 documents: $39.57
- Document AI parsing: ~$15
- LLM generation calls: ~$20
- Embedding generation: ~$5
Demonstrates production viability at scale.
Built by Reza Madani
Repository: https://github.com/srmadani/Grid-Incident-Rag
Medium Blog: Detailed walkthrough and insights
CC BY 4.0 - Freely available for commercial and non-commercial use.
