dataaispark-spec · Kumarvels · Sep 17, 2025 · Sep 17, 2025 · Sep 17, 2025 · Sep 17, 2025
diff --git a/haloguard-pro/.env.example b/haloguard-pro/.env.example
@@ -0,0 +1,19 @@
+# Environment variables for HaloGuard Pro
+
+# Server
+HOST=0.0.0.0
+PORT=8000
+WORKERS=4
+
+# Logging
+LOG_LEVEL=INFO
+LOG_FILE=/app/logs/haloguard.log
+
+# Knowledge Base
+DEFAULT_DOMAIN=general
+
+# Health Check
+HEALTH_CHECK_PATH=/health
+
+# Optional: Enable auto-correction only for high-risk domains
+AUTO_CORRECT_DOMAINS=medical,legal,finance
diff --git a/haloguard-pro/.gitignore b/haloguard-pro/.gitignore
@@ -0,0 +1,10 @@
+__pycache__/
+*.pyc
+*.log
+*.db
+.env
+data/
+.DS_Store
+venv/
+.idea/
+.vscode/
diff --git a/haloguard-pro/Dockerfile b/haloguard-pro/Dockerfile
@@ -0,0 +1,24 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Create directories
+RUN mkdir -p /app/facts /app/logs
+
+# Copy everything
+COPY facts/ /app/facts/
+COPY main.py /app/main.py
+COPY config/ /app/config/
+COPY .env.example /app/.env.example
+
+# Expose port
+EXPOSE 8000
+
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+
+# Run app
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
diff --git a/haloguard-pro/README.md b/haloguard-pro/README.md
@@ -0,0 +1,112 @@
+# HaloGuard Pro: Technical Deep Dive
+
+This document provides a detailed explanation of the logical design, architecture, and execution of the HaloGuard Pro system.
+
+## 1. Core Philosophy: Determinism Over AI
+
+The fundamental design choice of HaloGuard Pro is to **reject AI-based solutions for fact-checking** in favor of a deterministic, rule-based system. This choice was made to prioritize reliability, speed, and cost-effectiveness for production environments.
+
+| Why Not AI/ML? | The HaloGuard Pro Approach |
+| :--- | :--- |
+| **Non-Deterministic**: LLM-as-judge or semantic search can produce different results for the same input and are prone to the very hallucinations they are supposed to prevent. | **100% Deterministic**: Given the same input and the same knowledge base, the output will *always* be the same. There are zero false positives for defined rules. |
+| **High Cost**: Requires expensive GPU hardware and has a high cost per query. | **Extremely Low Cost**: Runs on a cheap CPU ($5/month VPS) with a per-query cost near zero. |
+| **Complex**: Requires expertise in embeddings, vector databases, RAG pipelines, and model fine-tuning. | **Simple**: The logic is based on string matching. The knowledge base is a human-readable JSON file. No ML expertise is needed. |
+| **Slow**: Verifying an output can take several seconds, making it unsuitable for real-time chat. | **Blazing Fast**: Verification takes only a few milliseconds, making it perfect for real-time applications. |
+
+This approach is ideal for domains where factual accuracy is non-negotiable and "good enough" is not an option, such as in medical, legal, and financial applications.
+
+## 2. System Architecture
+
+The system is a monolithic FastAPI application containerized with Docker. The architecture is designed for simplicity, scalability, and ease of maintenance.
+
+**Key Components:**
+
+1.  **FastAPI Web Server (`main.py`)**: A lightweight Python web server that exposes the core verification logic via a REST API. It serves as the single entry point for all requests.
+2.  **Knowledge Base (KB)**: A set of JSON files located in the `facts/` directory. Each file represents a specific domain (e.g., `medical.json`, `legal.json`). This separation allows for modular and domain-specific fact management without touching the application code.
+3.  **Docker Environment (`Dockerfile`, `docker-compose.yml`)**: Packages the application and its dependencies into a portable container image. This ensures consistent behavior across development, testing, and production environments and simplifies deployment immensely.
+4.  **Configuration (`config/settings.py`, `.env`)**: Centralized configuration management for server settings, logging, and application behavior. It loads settings from environment variables, following the 12-factor app methodology.
+5.  **Testing Suite (`tests/`)**: A robust set of unit and integration tests using `pytest` and `FastAPI.TestClient` to ensure code quality, prevent regressions, and validate the behavior of the API endpoints.
+
+## 3. Execution Flow: A Step-by-Step Breakdown
+
+When a request hits the `/verify` endpoint, the following sequence of operations occurs:
+
+### Step 1: Request Handling
+The server receives a POST request containing the LLM `output` to be verified and an optional `prompt`.
+
+```json
+{
+  "output": "Einstein died in 1950",
+  "prompt": "When did Einstein die?"
+}
+```
+
+### Step 2: Domain Detection (`detect_domain`)
+The system first attempts to determine the conversational domain to use the correct knowledge base. This is achieved through a simple and fast keyword-matching algorithm on the user's `prompt`.
+
+```python
+# In main.py
+def detect_domain(prompt: str) -> str:
+    prompt_lower = prompt.lower()
+    domain_map = {
+        "doctor": "medical",
+        "lawyer": "legal",
+        # ... more keywords
+    }
+    for keyword, domain in domain_map.items():
+        if keyword in prompt_lower:
+            return domain
+    return settings.DEFAULT_DOMAIN # Falls back to 'general'
+```
+
+### Step 3: Knowledge Base Loading & Caching
+The corresponding JSON file for the detected domain (e.g., `facts/general.json`) is loaded from disk. To optimize performance, loaded KBs are stored in an in-memory dictionary (`KB_CACHE`) to avoid repeated file I/O for subsequent requests in the same domain.
+
+### Step 4: Issue Detection
+The core logic runs in two parallel streams to identify potential problems:
+
+**A. Factual Error Detection (`detect_factual_errors`)**
+
+This function iterates through each entry in the loaded knowledge base. For each entry, it checks if any of the `incorrect_patterns` are present in the LLM `output`.
+
+-   **Logic**: A case-insensitive substring check (`pattern.lower() in output.lower()`).
+-   **Output**: A list of "issue" dictionaries, each containing the `found` pattern and the `expected` correct fact from the KB.
+
+**B. Linguistic Red Flag Detection (`detect_linguistic_red_flags`)**
+
+This function scans the output for common markers of low-quality or non-committal LLM responses, such as:
+-   **Absolute Claims**: "always", "never", "completely safe".
+-   **Vague Language**: "some say", "it is believed", "possibly".
+
+### Step 5: Auto-Correction (`generate_corrected_output`)
+If any factual errors are found and marked for `auto_correct: true` in the KB, this function performs the correction.
+
+-   **Logic**: It uses a **case-insensitive regular expression substitution** (`re.sub` with `re.IGNORECASE`). This is a crucial bug fix over a simple `string.replace()`, as it ensures that the replacement works regardless of the casing in the LLM's output, matching the case-insensitive nature of the detection logic.
+-   **Example**: It reliably replaces `"Einstein died in 1950"` with the correct statement, even if the input was `"einstein died in 1950"`.
+
+```python
+# In main.py
+def generate_corrected_output(...):
+    # ...
+    for issue in fact_issues:
+        # ...
+        corrected = re.sub(
+            re.escape(issue["found"]),
+            issue["expected"],
+            corrected,
+            count=1,
+            flags=re.IGNORECASE
+        )
+    return corrected
+```
+
+### Step 6: Response Generation
+Finally, the server returns a detailed JSON response containing the original output, the verified/corrected output, a list of all detected issues, and metadata about the verification process.
+
+## 4. How to Extend and Maintain
+
+-   **Adding New Facts**: Simply edit the relevant `facts/*.json` file and restart the container. No code changes are needed. The system is designed to be managed by content experts, not just developers.
+-   **Adding a New Domain**:
+    1.  Create a new `facts/your_domain.json` file following the existing structure.
+    2.  (Optional) Add keywords to the `domain_map` in `main.py` to enable auto-detection for the new domain.
+-   **Running Tests**: The project uses `pytest` for testing. From the `haloguard-pro` directory, simply run `pytest` to execute the full test suite.
diff --git a/haloguard-pro/config/settings.py b/haloguard-pro/config/settings.py
@@ -0,0 +1,15 @@
+import os
+from typing import List
+
+class Settings:
+    HOST: str = os.getenv("HOST", "0.0.0.0")
+    PORT: int = int(os.getenv("PORT", 8000))
+    WORKERS: int = int(os.getenv("WORKERS", 4))
+    LOG_LEVEL: str = os.getenv("LOG_LEVEL", "INFO")
+    LOG_FILE: str = os.getenv("LOG_FILE", "/app/logs/haloguard.log")
+    DEFAULT_DOMAIN: str = os.getenv("DEFAULT_DOMAIN", "general")
+    AUTO_CORRECT_DOMAINS: List[str] = [
+        d.strip() for d in os.getenv("AUTO_CORRECT_DOMAINS", "medical,legal,finance").split(",") if d.strip()
+    ]
+
+settings = Settings()
diff --git a/haloguard-pro/docker-compose.yml b/haloguard-pro/docker-compose.yml
@@ -0,0 +1,19 @@
+version: '3.8'
+
+services:
+  haloguard:
+    build: .
+    ports:
+      - "8000:8000"
+    volumes:
+      - ./facts:/app/facts
+      - ./logs:/app/logs
+    env_file:
+      - .env
+    restart: unless-stopped
+    networks:
+      - haloguard-net
+
+networks:
+  haloguard-net:
+    driver: bridge
diff --git a/haloguard-pro/docs/api_spec.md b/haloguard-pro/docs/api_spec.md
@@ -0,0 +1,76 @@
+# HaloGuard Pro API Specification
+
+## Base URL
+`http://localhost:8000`
+
+## `/verify` — POST
+
+### Request Body
+```json
+{
+  "output": "Thomas Edison invented the light bulb in 1879",
+  "prompt": "Who invented the light bulb?",
+  "prev_messages": [
+    "User: I'm writing a school report.",
+    "Bot: Okay, let me help!"
+  ]
+}
+```
+
+### Response Body
+```json
+{
+  "original": "Thomas Edison invented the light bulb in 1879",
+  "verified": "Joseph Swan and Thomas Edison independently developed practical incandescent lamps in 1878–1879",
+  "issues": [
+    {
+      "type": "fact_error",
+      "entity": "light_bulb",
+      "expected": "Joseph Swan and Thomas Edison independently developed practical incandescent lamps in 1878–1879",
+      "found": "Thomas Edison invented the light bulb in 1879",
+      "context": "Thomas Edison invented the light bulb in 1879",
+      "confidence": "high",
+      "auto_correct": true
+    }
+  ],
+  "auto_corrected": true,
+  "confidence": "high",
+  "processing_time_ms": 2,
+  "timestamp": "2024-06-15T12:34:56Z",
+  "domain_used": "general"
+}
+```
+
+## `/health` — GET
+
+Returns:
+```json
+{
+  "status": "healthy",
+  "version": "1.0.0",
+  "timestamp": "2024-06-15T12:34:56Z",
+  "domains_loaded": 5,
+  "default_domain": "general",
+  "auto_correct_domains": ["medical", "legal", "finance"]
+}
+```
+
+## `/facts/{domain}` — GET
+
+Returns full KB for domain:
+```json
+{
+  "einstein": { ... },
+  "light_bulb": { ... }
+}
+```
+
+## Status Codes
+- `200 OK` — Success
+- `422 Unprocessable Entity` — Invalid JSON
+- `404 Not Found` — Domain not found
+- `500 Internal Server Error` — facts.json malformed
+
+## Security
+- No authentication required (add JWT/API key if needed)
+- All data processed locally — no external calls
diff --git a/haloguard-pro/facts/education.json b/haloguard-pro/facts/education.json
@@ -0,0 +1,46 @@
+{
+  "earth_shape": {
+    "correct": "The Earth is an oblate spheroid — slightly flattened at the poles and bulging at the equator — not flat.",
+    "incorrect_patterns": [
+      "The earth is flat",
+      "Scientists are lying about the earth being round",
+      "Flat earth is a valid theory",
+      "Gravity doesn't exist, the earth is flat"
+    ],
+    "auto_correct": true,
+    "alert_to_admin": false
+  },
+  "dinosaur_extinction": {
+    "correct": "The non-avian dinosaurs went extinct approximately 66 million years ago due to a mass extinction event, likely triggered by an asteroid impact in present-day Mexico and massive volcanic activity.",
+    "incorrect_patterns": [
+      "Dinosaurs went extinct 10,000 years ago",
+      "Humans and dinosaurs lived together",
+      "Dinosaurs were killed by climate change alone",
+      "Dinosaurs evolved into birds slowly over millions of years"
+    ],
+    "auto_correct": true,
+    "alert_to_admin": false
+  },
+  "gravity_definition": {
+    "correct": "Gravity is a fundamental force of nature that attracts two bodies with mass toward each other. On Earth, it gives weight to physical objects and causes them to fall toward the ground.",
+    "incorrect_patterns": [
+      "Gravity is just a theory",
+      "Gravity doesn't exist, things fall because of density",
+      "Gravity is caused by the Earth spinning",
+      "There's no gravity in space"
+    ],
+    "auto_correct": true,
+    "alert_to_admin": false
+  },
+  "photosynthesis_equation": {
+    "correct": "The chemical equation for photosynthesis is: 6CO₂ + 6H₂O + light energy → C₆H₁₂O₆ + 6O₂",
+    "incorrect_patterns": [
+      "Photosynthesis: CO2 + H2O → O2 + glucose",
+      "Plants breathe in oxygen and breathe out carbon dioxide",
+      "Photosynthesis happens at night",
+      "Trees make food from soil"
+    ],
+    "auto_correct": true,
+    "alert_to_admin": false
+  }
+}
diff --git a/haloguard-pro/facts/finance.json b/haloguard-pro/facts/finance.json
@@ -0,0 +1,46 @@
+{
+  "bitcoin_5_year_return": {
+    "correct": "Bitcoin is highly volatile and does not reliably outperform stocks over 5-year periods. Historical returns are not indicative of future performance. Financial advisors generally do not recommend allocating core retirement savings to cryptocurrency.",
+    "incorrect_patterns": [
+      "Bitcoin outperforms stocks over 5 years",
+      "Crypto has higher returns than stocks",
+      "Investing in Bitcoin is safer than stocks",
+      "Financial advisors recommend crypto for retirees"
+    ],
+    "auto_correct": true,
+    "alert_to_admin": true
+  },
+  "roth_ira_contribution_limit": {
+    "correct": "For 2024, the maximum contribution limit for a Roth IRA is $7,000 ($8,000 if age 50 or older). Income limits apply: single filers phase out between $146,000 and $161,000.",
+    "incorrect_patterns": [
+      "You can contribute $10,000 to a Roth IRA",
+      "Roth IRA limit is $5,000",
+      "Anyone can contribute regardless of income",
+      "Roth IRA has no income limits"
+    ],
+    "auto_correct": true,
+    "alert_to_admin": true
+  },
+  "crypto_tax_reporting": {
+    "correct": "In the U.S., cryptocurrency is treated as property by the IRS. You must report capital gains/losses on Form 8949 and Schedule D. Simply holding crypto is not taxable; buying, selling, trading, or earning rewards triggers tax events.",
+    "incorrect_patterns": [
+      "Crypto is tax-free if you don't cash out",
+      "You don't need to report crypto holdings",
+      "Only selling crypto for fiat triggers tax",
+      "Crypto earnings are not taxable"
+    ],
+    "auto_correct": true,
+    "alert_to_admin": true
+  },
+  "stock_dividend_tax": {
+    "correct": "Qualified dividends are taxed at long-term capital gains rates (0%, 15%, or 20%). Non-qualified dividends are taxed as ordinary income. Holding period must be more than 60 days during the 121-day period surrounding the ex-dividend date.",
+    "incorrect_patterns": [
+      "All dividends are taxed at 15%",
+      "Dividends are tax-free",
+      "You pay 37% tax on all dividends",
+      "Dividend tax doesn't depend on holding period"
+    ],
+    "auto_correct": true,
+    "alert_to_admin": true
+  }
+}