Learning Grade AI Web Vulnerability Scanner

A portfolio-ready, learning-grade web vulnerability scanner and lightweight AI-assisted report viewer. This project demonstrates a polite, non-destructive approach to crawling and finding common web security issues (security headers, insecure cookie flags, reflected XSS heuristics, and basic error-based SQL injection indicators). It ships with a small Flask-based AI proxy intended to power the in-report AI assistance (optional).

⚠️ Important — Authorization & Ethics

This tool is meant only for authorized security testing, education, and labs. Always obtain explicit written permission before scanning any domain you do not own or are not authorized to test. The scanner requires a --confirm flag to run as an extra safety step.

Pictures

After running python app.py

After running learning_grade_web_scanner.py

After running cd Reports, python -m http.server 8080

JSON Report Example

HTML Report Example

AI Help Center

1). 2).

withOut Anonymize before send

with Anonymize before send

Key features

Queue-based polite crawler (no recursive thread spawning).

Quick explanation — Queue-based crawler, Polite crawler & No recursive thread spawning (beginner-friendly)

1. Queue-based crawler

A queue-based crawler uses a single shared work queue to manage all URLs that need to be visited.

How it works (conceptually):

Start with the base URL → put it into the queue
Worker threads repeatedly:
- Take one URL from the queue
- Fetch the page
- Extract links
- Add new, allowed URLs back into the same queue
Repeat until the queue is empty or the depth limit is reached

What this means:

Each worker processes one URL at a time
There is central control over what gets scanned
Crawl order, depth, and limits remain predictable

Key point:
All crawling tasks go through one controlled pipeline (the queue).

2. Polite crawler

A polite crawler is designed not to stress or harm servers.

In this scanner, politeness includes:

⏱ Per-host rate limiting (--delay)
🤖 Respecting robots.txt
🌱 GET-only requests (non-destructive)
🚫 No brute-force or payload floods
🧵 Controlled number of threads

Instead of hammering a site, the scanner behaves more like a careful human using a browser.

3. No recursive thread spawning (critical design choice)

This explains what the crawler deliberately does NOT do.

✘ Bad design (recursive thread spawning):


Thread A visits URL A
└── spawns Thread B for link B
└── spawns Thread C for link C
└── spawns Thread D for link D

A thread is a lightweight unit of execution that allows a program to do work in parallel (In parallel means multiple tasks are executed at the same time, instead of one after another.).

Problems with this approach:

Unbounded thread growth
Loss of concurrency control
Servers get overwhelmed
Scanner runs out of memory or sockets
Hard to enforce delays and crawl depth

This pattern is called recursive thread spawning — every discovered link creates new threads.

✓ What this scanner does instead?

[ Queue ] ↓ Worker Thread Pool (fixed size) ↓ Fetch → Extract → Enqueue (back to Queue)

A fixed pool equal to the --threads setting (default: 10). Threads are created once at startup and reused — no new thread per discovered link.

Threads are created once
Thread count is fixed (--threads)
No thread creates another thread
Discovered links are treated as data, not new execution contexts

Core idea:
No recursive thread spawning = threads do not create threads

Why this design is considered best practice?

Aspect	Queue-based crawler	Recursive spawning
Thread control	✅ Fixed & predictable	❌ Unbounded
Rate limiting	✅ Enforceable	❌ Difficult
Server safety	✅ Polite	❌ Aggressive
Memory safety	✅ Stable	❌ Risky
Debugging	✅ Easier	❌ Chaotic
Legal / ethical safety	✅ Much safer	❌ Risky

This is why professional tools and search engine crawlers use queue-based designs.

Result: safer scans, predictable behavior, ethical crawling, and easier extensibility.

Per-host rate limiting (configurable --delay) to avoid hammering a server
robots.txt awareness (the scanner checks and respects rules where available).

robots.txt

robots.txt: A website rule file that tells scanners which URLs they should avoid.

robots.txt status → Scanner behavior
robots.txt exists → Scanner follows the rules (`Disallow` / `Allow`)
robots.txt is missing → Scanner treats the whole site as accessible
robots.txt is empty → Same as missing; no restrictions are applied

📍 there is no robots.txt in the repository. The scanner only fetches/respects robots.txt from target sites at runtime;

ThreadPool-style workers for scalable crawling and analysis

Quick explanation — ThreadPool & scalable crawling and analysis (beginner-friendly)

ThreadPool-style workers

ThreadPool-style workers means the scanner creates a fixed number of worker threads once and reuses them to process many URLs, instead of creating new threads again and again.

What is a ThreadPool?

A ThreadPool is simply a group of pre-created worker threads that wait for work.

How it works in this scanner:

The scanner starts with a fixed number of workers (--threads)

A fixed pool equal to the --threads setting (default: 10). Threads are created once at startup and reused — no new thread per discovered link.

Workers stay idle until a URL appears in the shared queue
Each worker:
- Takes one URL from the queue
- Fetches the page (GET only)
- Runs all analyses (headers, cookies, XSS, SQL error checks)
- Reports findings
- Goes back to the pool to process the next URL

No new threads are created during scanning

Why this is called scalable?

Scalability here does not mean infinite threads.

It means:

You can safely increase or decrease workers using --threads
The scanner handles more URLs without losing control
Performance improves predictably, not randomly

Examples:

--threads 4 → slower, very gentle on the server
--threads 8 → balanced and recommended
--threads 16 → faster, still controlled and polite

Why this matters for crawling and analysis

Each worker does both jobs:

Crawling (fetching pages)
Analysis:
- Checking security headers
- Inspecting cookies
- Detecting reflected XSS markers
- Scanning responses for SQL error patterns

So the same worker repeats this cycle:

fetch → analyze → report → repeat

This keeps scanning controlled, consistent, and safe.

Bad vs good design (quick contrast)

✘ Without ThreadPool (bad design):

New thread created for every URL
Hard to limit concurrency
Easy to overload the target
Unstable performance and high memory usage

✓ With ThreadPool (this scanner):

Fixed number of reusable workers
Predictable concurrency
Easy rate limiting
Stable memory and network usage

Simple mental model

ThreadPool = a fixed team of workers sharing a task list

They don’t hire new workers every time work appears —
they just keep working until the task list is empty.

Result: safer scans, polite crawling, and predictable performance.

Session reuse and connection pooling (via requests.Session)

Quick explanation — Session reuse & connection pooling (beginner-friendly)

Session reuse and connection pooling

Session reuse and connection pooling mean the scanner keeps a small, persistent HTTP client state and reuses network connections, instead of opening a brand-new connection for every request.

In simple terms:

The scanner behaves like a browser that stays open, not like one that restarts for every page.

1. What is a Session (in simple terms)?

A session is a persistent context that remembers information across multiple HTTP requests.

Using requests.Session() allows the scanner to reuse:

Cookies (Set-Cookie)
Headers (User-Agent, custom headers)
TCP connections
TLS (HTTPS) handshakes

Without a session, every request starts from scratch and forgets previous responses.

Beginner analogy:
A session is like keeping one browser tab open, instead of opening a brand-new browser for every page you visit.

2. What is connection pooling?

Connection pooling means:

Open a small number of network connections
Keep them alive
Reuse them for multiple requests to the same host

✘ Without pooling:

Connect → Request → Close
Connect → Request → Close
Connect → Request → Close

✓ With pooling (used by this scanner):

Connect once → Request → Request → Request → Close later

This behavior is handled automatically by requests.Session().

3. Why this matters for a crawler / scanner?

✘ Without session & pooling (bad practice):

New TCP connection per request
Repeated TLS handshakes
Slower scans
Higher load on the target server
Wasted CPU, memory, and sockets

✓ With session reuse & pooling (this scanner):

Faster requests
Lower server load
More realistic, browser-like behavior
Efficient and predictable resource usage
Better handling of cookies and redirects

4. Why this is important for security analysis?

Session reuse improves accuracy, not just performance:

Cookie flags (Secure, HttpOnly, SameSite) persist across requests
Some security headers only appear after earlier responses
Behavior looks closer to a real user, not a naive bot
Reduces false negatives during analysis

Professional tools such as:

Burp Suite
OWASP ZAP
Web browsers
Search engine crawlers

all rely on session reuse and connection pooling for correctness and performance.

Ultra-short version:
A session keeps HTTP state; connection pooling reuses network connections.

Non-destructive, GET-only probing for:
- Missing security headers (Content-Security-Policy, X-Content-Type-Options, X-Frame-Options, Referrer-Policy, Strict-Transport-Security)
- Insecure Set-Cookie flags (Secure, HttpOnly, SameSite)
- Dangerous HTTP methods reported in Allow header (PUT, DELETE, TRACE, CONNECT)
- Reflected XSS detection using contextual reflection checks
- Error-based SQL detection by searching for common database error messages
Depth range support for analysis (e.g., --depth 1-2)
Optional custom headers and cookies for authenticated testing (no automated login flows)
JSON report + a simple HTML summary generated in the output directory
HTML report generator with embedded client-side UI for filtering, previewing, and an "AI Help Center"
Optional AI backend integration (server proxy) to provide remediation guidance from an LLM
--ai-test-ui to generate a seeded test HTML for UI validation without running a scan
Safety gate: requires --confirm to run

What this scanner does (and doesn't)

Does:

Crawl a single host (the host of the provided base URL) up to a configurable depth
Crawl to reach deeper pages while only analyzing within a configurable depth range
Respect robots.txt entries where available
Perform non-destructive GET-only probes designed to identify possible issues
Produce human-readable reports suitable for learning and triage
supports authenticated scanning without login automation.
Optionally embeds an AI-assisted remediation helper (via server proxy)

Does not:

Perform automated login flows — you can extend it to do so
Perform aggressive or destructive tests (no POST/PUT/DELETE payloads by default)
Brute-force authentication or credentials, Automatically log into applications.
Perform POST-based fuzzing or state-changing requests
Guarantee the existence of a vulnerability — the scanner flags possible issues for manual verification

Requirements

Python 3.9+ recommended (3.10+ tested)
Virtual environment recommended

Python dependencies (install via pip):

pip install -r requirements.txt

requirements.txt should include at minimum:

  - `requests`
  - `beautifulsoup4`
  - `flask`
  - `flask-cors`
  - `groq`

(Optionally add python-dotenv or similar if you want environment-based configuration.)

Files

learning_grade_web_scanner.py — main scanner implementation and CLI
app.py — minimal Flask AI proxy that forwards context to a Groq client
Reports/ — default output directory for JSON and HTML results

Installation

Clone the repository:

git clone https://github.com/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner.git
cd Web_Vulnerability_Scanner

Create and activate a virtual environment (recommended):

python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# Windows CMD
.\.venv\Scripts\activate.bat
# macOS / Linux
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Instructions (AI-enabled local setup)

This workflow uses three terminals: one for the AI backend, one to serve reports over HTTP, and one to run the scanner.

DM me for a free API key (used by the example Groq-based AI backend).

1. Terminal 1 — Start the AI backend (Flask)

The included app.py is a minimal Flask endpoint that demonstrates how the embedded report can call an AI service. It expects an environment variable GROQ_API_KEY for the Groq client used in the example. The scanner embeds the AI server URL into the generated HTML when --ai-enabled --ai-server are passed.

Set the API key and run the server:

$env:GROQ_API_KEY="your api key"
echo $env:GROQ_API_KEY
python app.py

The AI server will listen on:

http://127.0.0.1:5000/api/ai-chat

2. Terminal 2 — Serve the HTML reports

Expose the Reports/ directory via a local HTTP server (required for browser JS features):

$env:GROQ_API_KEY="your api key"
echo $env:GROQ_API_KEY
cd Reports
python -m http.server 8080

This makes reports available at:

http://127.0.0.1:8080/

3. Terminal 3 — Run the scanner with AI enabled

python learning_grade_web_scanner.py https://sagarbiswas-multihat.github.io/ \
  --confirm \
  --ai-enabled \
  --ai-server http://127.0.0.1:5000/api/ai-chat

After the scan completes, note the generated HTML file name:

Reports/scan_summary_XXX.html

🌐 Open in Edge / Chrome

Open the report in your browser:

http://127.0.0.1:8080/scan_summary_XXX.html

You can now:

Expand findings
Open the AI Help Center per issue
Ask about severity, impact, remediation, and verification steps

Usage & Command-line arguments

The scanner exposes a small CLI. All arguments are documented below.

Usage: learning_grade_web_scanner.py [OPTIONS] base_url

Positional arguments:
  base_url              Base URL to scan (must include scheme, e.g., https://example.com)

Optional arguments:
  -> `--threads` / `-t` — number of worker threads
  -> `--delay` / `-d` — minimum delay (seconds) between requests to the same host (politeness)
  -> `--timeout` — request timeout in seconds
  -> `--depth` — either a single integer (e.g. `3`) or a range (`0-2`), controls crawl depth and min-depth for analysis
  -> `--output` / `-o` — output directory
  -> `--header` — repeatable. Allows passing `Header: Value` or `Header=Value` pairs
  -> `--cookie` — repeatable. Pass `name=value`
  -> `--ai-enabled` — embed AI help UI into the HTML report
  -> `--ai-server` — set the AI server endpoint used by the client UI (e.g. `http://127.0.0.1:5000/api/ai-chat`)
  -> `--ai-test-ui` — generate a local test page with seeded issues (no scanning)

Notes on arguments:

base_url must include the URL scheme (http:// or https://). The scanner will only crawl the same host as the base URL.
--threads controls parallelism. Higher values speed up scans but may increase load on the target.
--delay is important: it enforces a minimum delay per host between requests. Use higher values when scanning production systems.
--depth limits how deep the crawler will follow links and can also be a range to control analysis depth.

More explanation on --depth

Depth is measured in link-hops from the base URL:

`0`   — scan the base URL only (seeded at depth 0)
`1`   — include the base URL and any pages linked directly from it
`2`   — include pages linked from the base page and pages linked from those pages
`3`   — include pages linked from the previous level (i.e., pages whose shortest link-distance from the base is 3)
1-2   — crawl links as needed, but only run analysis on pages whose depth is between 1 and 2 (inclusive)
`N`   — include any page whose shortest link-distance from the base URL is <= `N`

Important clarification (for 1-2): The crawler may still visit shallower pages (e.g., depth 0) to discover links, but security checks are only performed on pages at depth 1 and 2.

Implementation note: the scanner enqueues the base URL at depth 0 and enqueues discovered links with depth + 1. URLs with depth > max_depth are skipped. When a range is used (e.g., --depth 1-2), crawling still reaches deeper pages to discover links, but analysis only runs when depth >= min_depth and depth <= max_depth.

Example crawl levels (illustrative):

Depth 0
└── https://example.com (base URL)

Depth 1
├── https://example.com/about
├── https://example.com/login
└── https://example.com/blog

Depth 2
├── https://example.com/blog/post-1
├── https://example.com/blog/post-2
└── https://example.com/about/team

Depth 3
└── https://example.com/blog/post-1/comments

|---------------------------------------------------------------------|
|                              Depth 1-2                              |
|---------------------------------------------------------------------|
|    -> Depth 0 -> Depth 1 -> Depth 2                                 |
|                                                                     |
|        --------------------------------------------------           |
|        |      Depth 0 (crawled only, not analyzed)      |           |
|        |      └── https://example.com                   |           |
|        |                                                |           |
|        |      Depth 1 (crawled + analyzed)              |           |
|        |      ├── https://example.com/about             |           |
|        |      ├── https://example.com/login             |           |
|        |      └── https://example.com/blog              |           |
|        |                                                |           |
|        |      Depth 2 (crawled + analyzed)              |           |
|        |      ├── https://example.com/blog/post-1       |           |
|        |      ├── https://example.com/blog/post-2       |           |
|        |      └── https://example.com/about/team        |           |
|        |----------------------------------------------- |           |
|                                                                     |
|---------------------------------------------------------------------|

Note: Most real-world vulnerabilities live at: depth 0–2. eg, Landing pages, Forms, Dashboards, API endpoints, Blog posts, Admin panels (if exposed).

Very deep pages are often: Pagination, Archives, Comment pages, User-generated content, Repetitive templates.

Scanning them adds noise, not value.

--header and --cookie let you pass authentication context for authorized, authenticated scans (no login automation).
--confirm is intentionally required to remind you of legal/ethical constraints. The script will refuse to run without it.

Examples

Scan a single site with default settings (safe defaults):

python learning_grade_web_scanner.py https://example.com --confirm

Scan with more threads and a longer delay (polite):

python learning_grade_web_scanner.py https://example.com --threads 8 --delay 2.0 --confirm

Scan shallowly (only the base URL):

python learning_grade_web_scanner.py https://example.com --depth 0 --confirm

Scan a depth range (only analyze pages at depth 1–2, but still crawl to reach them):

python learning_grade_web_scanner.py https://example.com --depth 1-2 --confirm

Run with a depth range (only analyze pages at or deeper than min_depth):

python learning_grade_web_scanner.py https://example.com --confirm --depth 1-2

Provide custom headers or cookies (helpful for authenticated pages; no login automation is included):

python learning_grade_web_scanner.py https://example.local --confirm --header "Authorization: Bearer TOKEN" --cookie "session=abcd1234"

Generate the AI test UI HTML (no scanning):

python learning_grade_web_scanner.py https://example.local --ai-enabled --ai-test-ui

Enable the AI UI in the generated report (you must also provide --ai-server pointing to your AI proxy endpoint):

python learning_grade_web_scanner.py https://example.com --confirm --ai-enabled --ai-server http://127.0.0.1:5000/api/ai-chat

Scan with custom headers and cookies for authenticated testing (no login automation):

python learning_grade_web_scanner.py https://example.com \
  --header "Authorization: Bearer <token>" \
  --cookie "sessionid=abcd1234" \
  --confirm

Specify a custom output folder:

python learning_grade_web_scanner.py https://example.com -o ./reports --confirm

Output files & report format

After a successful run, the scanner writes two primary files to the output directory:

scan_report_<timestamp>.json — The raw findings array in JSON format. Each item contains:
- url — The URL where the issue was observed
- issue — A short title for the issue (e.g., "Missing Security Headers")
- details — A small JSON object with more context (e.g., which headers were missing)
scan_summary_<timestamp>.html — A compact, shareable HTML summary table suitable for quick review.

These files are intended as starting points for manual verification. See the project code to extend or reformat outputs (CSV, Markdown, or integration with bug trackers).

Internals & design decisions

Queue-based crawling: avoids creating unbounded thread objects and keeps crawling predictable. Worker threads call crawl_worker and read from a single Queue.
Per-host rate limiting: last_access timestamps are tracked for each host to apply --delay politely.
robots.txt: the scanner fetches and respects robots.txt where available. If it cannot fetch robots.txt, it proceeds but logs the condition.
Non-destructive testing: tests are GET-only and intentionally conservative. XSS probes insert unique markers and search for contextual reflection, and SQLi checks look for common DB error messages.
Session reuse: requests.Session is used to speed up scanning and correctly handle cookies/headers.
Sanitization layer before embedding data into HTML/JS
AI integration isolated behind a server proxy (no client secrets)

Limitations to be aware of:

The scanner does not replace manual testing or professional tools like Burp Suite or ZAP. It is a learning project and should be used as such.
Detection is heuristic-based and will produce false positives and possibly false negatives. Every finding must be manually validated.

Extending the scanner

Examples of useful extensions:

Add authenticated scanning (session handling, login form automation)
Add payload plugins and a plugin API for community rules
Integrate more advanced XSS and SQLi tests (time-based, boolean-based, context-aware DOM analysis)
Add severity scoring and fingerprinting to deduplicate findings
Add unit tests and GitHub Actions for CI (linting + lightweight tests)
Adding new checks inside analyze_page()
Creating new helper methods similar to check_reflected_xss()
Extending the HTML report UI for additional metadata
Replacing the AI backend with another provider (OpenAI, local LLM, etc.)
To add checks, extend the analyze_page method and add helper functions similar to check_reflected_xss or check_sqli_errors.

If you'd like, I can add any of the above as follow-up work.

Safety & legal notes (please read)

Do not run this scanner against third-party systems without explicit written permission.
Use --delay to reduce the load on target servers and avoid interference with production.
This project is provided for educational purposes. The author accepts no responsibility for misuse.
The scanner seeds a queue with the base_url and uses multiple worker threads to fetch pages politely (respecting robots.txt and per-host crawl delays).
The crawler collects same-site links and enqueues them for deeper crawling up to max_depth.
Pages at or deeper than min_depth are analyzed for headers, cookies, HTTP methods, reflected XSS evidence, and error messages pointing to potential SQL injection.
Findings are appended to an in-memory report and written as JSON + an interactive HTML summary (sanitized) at the end.
If crawling seems slow, increase --threads or adjust --delay, but be careful to remain polite to the target site.

Limitations:

This is not a replacement for a professional scanner. It is a learning tool and demonstrates heuristic-based checks that can produce false positives and false negatives.
All probes are intentionally non-destructive, but scanning web applications can still trigger logging, rate limiting, or other side effects.
The XSS and SQLi checks are simple heuristics; deeper interactive testing (POST, authenticated flows, parameterized analysis) is out of scope.

Contributing

Contributions are welcome. If you add features, please:

Keep the default behavior safe and non-destructive
Add tests for new features
Update README usage examples and argument descriptions

Contact / Acknowledgements

This README was generated to accompany a learning-grade scanner implementation. If you want more help integrating an LLM-based remediation assistant safely (production-ready proxy patterns, rate-limiting, or example Nginx configurations), ask for a focused guide and I will provide one.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Reports		Reports
__pycache__		__pycache__
README.md		README.md
app.py		app.py
learning_grade_web_scanner.py		learning_grade_web_scanner.py
requirements.txt		requirements.txt
tempCodeRunnerFile.py		tempCodeRunnerFile.py

SagarBiswas-MultiHAT/AI_Web_Vulnerability_Scanner

Folders and files

Latest commit

History

Repository files navigation