A portfolio-ready, learning-grade web vulnerability scanner and lightweight AI-assisted report viewer. This project demonstrates a polite, non-destructive approach to crawling and finding common web security issues (security headers, insecure cookie flags, reflected XSS heuristics, and basic error-based SQL injection indicators). It ships with a small Flask-based AI proxy intended to power the in-report AI assistance (optional).
⚠️ Important — Authorization & EthicsThis tool is meant only for authorized security testing, education, and labs. Always obtain explicit written permission before scanning any domain you do not own or are not authorized to test. The scanner requires a
--confirmflag to run as an extra safety step.
Pictures
After running python app.py
After running learning_grade_web_scanner.py
After running cd Reports, python -m http.server 8080
JSON Report Example
HTML Report Example
AI Help Center
withOut Anonymize before send
with Anonymize before send
- Key features
- What this scanner does (and doesn't)
- Requirements
- Files
- Installation
- Usage & Command-line arguments
- Examples
- Output files & report format
- Internals & design decisions
- Extending the scanner
- Safety & legal notes
- Contributing
- Contact / Acknowledgements
- Queue-based polite crawler (no recursive thread spawning).
Quick explanation — Queue-based crawler, Polite crawler & No recursive thread spawning (beginner-friendly)
A queue-based crawler uses a single shared work queue to manage all URLs that need to be visited.
How it works (conceptually):
- Start with the base URL → put it into the queue
- Worker threads repeatedly:
- Take one URL from the queue
- Fetch the page
- Extract links
- Add new, allowed URLs back into the same queue
- Repeat until the queue is empty or the depth limit is reached
What this means:
- Each worker processes one URL at a time
- There is central control over what gets scanned
- Crawl order, depth, and limits remain predictable
Key point:
All crawling tasks go through one controlled pipeline (the queue).
A polite crawler is designed not to stress or harm servers.
In this scanner, politeness includes:
- ⏱ Per-host rate limiting (
--delay) - 🤖 Respecting
robots.txt - 🌱 GET-only requests (non-destructive)
- 🚫 No brute-force or payload floods
- 🧵 Controlled number of threads
Instead of hammering a site, the scanner behaves more like a careful human using a browser.
This explains what the crawler deliberately does NOT do.
✘ Bad design (recursive thread spawning):
| Thread A visits URL A |
| └── spawns Thread B for link B |
| └── spawns Thread C for link C |
| └── spawns Thread D for link D |
A thread is a lightweight unit of execution that allows a program to do work in parallel (In parallel means multiple tasks are executed at the same time, instead of one after another.).
Problems with this approach:
- Unbounded thread growth
- Loss of concurrency control
- Servers get overwhelmed
- Scanner runs out of memory or sockets
- Hard to enforce delays and crawl depth
This pattern is called recursive thread spawning — every discovered link creates new threads.
[ Queue ] ↓ Worker Thread Pool (fixed size) ↓ Fetch → Extract → Enqueue (back to Queue)
A fixed pool equal to the --threads setting (default: 10). Threads are created once at startup and reused — no new thread per discovered link.
- Threads are created once
- Thread count is fixed (
--threads) - No thread creates another thread
- Discovered links are treated as data, not new execution contexts
Core idea:
No recursive thread spawning = threads do not create threads
| Aspect | Queue-based crawler | Recursive spawning |
|---|---|---|
| Thread control | ✅ Fixed & predictable | ❌ Unbounded |
| Rate limiting | ✅ Enforceable | ❌ Difficult |
| Server safety | ✅ Polite | ❌ Aggressive |
| Memory safety | ✅ Stable | ❌ Risky |
| Debugging | ✅ Easier | ❌ Chaotic |
| Legal / ethical safety | ✅ Much safer | ❌ Risky |
This is why professional tools and search engine crawlers use queue-based designs.
Result: safer scans, predictable behavior, ethical crawling, and easier extensibility.
- Per-host rate limiting (configurable
--delay) to avoid hammering a server robots.txtawareness (the scanner checks and respects rules where available).
robots.txt
robots.txt: A website rule file that tells scanners which URLs they should avoid.
| robots.txt status → Scanner behavior |
|---|
robots.txt exists → Scanner follows the rules (Disallow / Allow) |
| robots.txt is missing → Scanner treats the whole site as accessible |
| robots.txt is empty → Same as missing; no restrictions are applied |
📍 there is no robots.txt in the repository. The scanner only fetches/respects robots.txt from target sites at runtime;
- ThreadPool-style workers for scalable crawling and analysis
Quick explanation — ThreadPool & scalable crawling and analysis (beginner-friendly)
ThreadPool-style workers means the scanner creates a fixed number of worker threads once and reuses them to process many URLs, instead of creating new threads again and again.
A ThreadPool is simply a group of pre-created worker threads that wait for work.
How it works in this scanner:
- The scanner starts with a fixed number of workers (
--threads)
A fixed pool equal to the --threads setting (default: 10). Threads are created once at startup and reused — no new thread per discovered link.
- Workers stay idle until a URL appears in the shared queue
- Each worker:
- Takes one URL from the queue
- Fetches the page (GET only)
- Runs all analyses (headers, cookies, XSS, SQL error checks)
- Reports findings
- Goes back to the pool to process the next URL
No new threads are created during scanning
Scalability here does not mean infinite threads.
It means:
- You can safely increase or decrease workers using
--threads - The scanner handles more URLs without losing control
- Performance improves predictably, not randomly
Examples:
--threads 4→ slower, very gentle on the server--threads 8→ balanced and recommended--threads 16→ faster, still controlled and polite
Each worker does both jobs:
- Crawling (fetching pages)
- Analysis:
- Checking security headers
- Inspecting cookies
- Detecting reflected XSS markers
- Scanning responses for SQL error patterns
So the same worker repeats this cycle:
fetch → analyze → report → repeat
This keeps scanning controlled, consistent, and safe.
✘ Without ThreadPool (bad design):
- New thread created for every URL
- Hard to limit concurrency
- Easy to overload the target
- Unstable performance and high memory usage
✓ With ThreadPool (this scanner):
- Fixed number of reusable workers
- Predictable concurrency
- Easy rate limiting
- Stable memory and network usage
ThreadPool = a fixed team of workers sharing a task list
They don’t hire new workers every time work appears —
they just keep working until the task list is empty.
Result: safer scans, polite crawling, and predictable performance.
- Session reuse and connection pooling (via
requests.Session)
Quick explanation — Session reuse & connection pooling (beginner-friendly)
Session reuse and connection pooling mean the scanner keeps a small, persistent HTTP client state and reuses network connections, instead of opening a brand-new connection for every request.
In simple terms:
The scanner behaves like a browser that stays open, not like one that restarts for every page.
A session is a persistent context that remembers information across multiple HTTP requests.
Using requests.Session() allows the scanner to reuse:
- Cookies (
Set-Cookie) - Headers (User-Agent, custom headers)
- TCP connections
- TLS (HTTPS) handshakes
Without a session, every request starts from scratch and forgets previous responses.
Beginner analogy:
A session is like keeping one browser tab open, instead of opening a brand-new browser for every page you visit.
Connection pooling means:
- Open a small number of network connections
- Keep them alive
- Reuse them for multiple requests to the same host
✘ Without pooling:
Connect → Request → Close
Connect → Request → Close
Connect → Request → Close
✓ With pooling (used by this scanner):
Connect once → Request → Request → Request → Close later
This behavior is handled automatically by requests.Session().
✘ Without session & pooling (bad practice):
- New TCP connection per request
- Repeated TLS handshakes
- Slower scans
- Higher load on the target server
- Wasted CPU, memory, and sockets
✓ With session reuse & pooling (this scanner):
- Faster requests
- Lower server load
- More realistic, browser-like behavior
- Efficient and predictable resource usage
- Better handling of cookies and redirects
Session reuse improves accuracy, not just performance:
- Cookie flags (
Secure,HttpOnly,SameSite) persist across requests - Some security headers only appear after earlier responses
- Behavior looks closer to a real user, not a naive bot
- Reduces false negatives during analysis
Professional tools such as:
- Burp Suite
- OWASP ZAP
- Web browsers
- Search engine crawlers
all rely on session reuse and connection pooling for correctness and performance.
Ultra-short version:
A session keeps HTTP state; connection pooling reuses network connections.
-
Non-destructive, GET-only probing for:
- Missing security headers (Content-Security-Policy, X-Content-Type-Options, X-Frame-Options, Referrer-Policy, Strict-Transport-Security)
- Insecure
Set-Cookieflags (Secure, HttpOnly, SameSite) - Dangerous HTTP methods reported in
Allowheader (PUT, DELETE, TRACE, CONNECT) - Reflected XSS detection using contextual reflection checks
- Error-based SQL detection by searching for common database error messages
-
Depth range support for analysis (e.g.,
--depth 1-2) -
Optional custom headers and cookies for authenticated testing (no automated login flows)
-
JSON report + a simple HTML summary generated in the output directory
-
HTML report generator with embedded client-side UI for filtering, previewing, and an "AI Help Center"
-
Optional AI backend integration (server proxy) to provide remediation guidance from an LLM
-
--ai-test-ui to generate a seeded test HTML for UI validation without running a scan
-
Safety gate: requires
--confirmto run
Does:
- Crawl a single host (the host of the provided base URL) up to a configurable depth
- Crawl to reach deeper pages while only analyzing within a configurable depth range
- Respect
robots.txtentries where available - Perform non-destructive GET-only probes designed to identify possible issues
- Produce human-readable reports suitable for learning and triage
- supports authenticated scanning without login automation.
- Optionally embeds an AI-assisted remediation helper (via server proxy)
Does not:
- Perform automated login flows — you can extend it to do so
- Perform aggressive or destructive tests (no POST/PUT/DELETE payloads by default)
- Brute-force authentication or credentials, Automatically log into applications.
- Perform POST-based fuzzing or state-changing requests
- Guarantee the existence of a vulnerability — the scanner flags possible issues for manual verification
- Python 3.9+ recommended (3.10+ tested)
- Virtual environment recommended
Python dependencies (install via pip):
pip install -r requirements.txt
requirements.txt should include at minimum:
- `requests`
- `beautifulsoup4`
- `flask`
- `flask-cors`
- `groq`
(Optionally add python-dotenv or similar if you want environment-based configuration.)
learning_grade_web_scanner.py— main scanner implementation and CLIapp.py— minimal Flask AI proxy that forwards context to a Groq clientReports/— default output directory for JSON and HTML results
- Clone the repository:
git clone https://github.com/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner.git
cd Web_Vulnerability_Scanner- Create and activate a virtual environment (recommended):
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# Windows CMD
.\.venv\Scripts\activate.bat
# macOS / Linux
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txtThis workflow uses three terminals: one for the AI backend, one to serve reports over HTTP, and one to run the scanner.
DM me for a free API key (used by the example Groq-based AI backend).
The included app.py is a minimal Flask endpoint that demonstrates how the embedded report can call an AI service. It expects an environment variable GROQ_API_KEY for the Groq client used in the example. The scanner embeds the AI server URL into the generated HTML when --ai-enabled --ai-server are passed.
Set the API key and run the server:
$env:GROQ_API_KEY="your api key"
echo $env:GROQ_API_KEY
python app.pyThe AI server will listen on:
http://127.0.0.1:5000/api/ai-chat
Expose the Reports/ directory via a local HTTP server (required for browser JS features):
$env:GROQ_API_KEY="your api key"
echo $env:GROQ_API_KEY
cd Reports
python -m http.server 8080This makes reports available at:
http://127.0.0.1:8080/
python learning_grade_web_scanner.py https://sagarbiswas-multihat.github.io/ \
--confirm \
--ai-enabled \
--ai-server http://127.0.0.1:5000/api/ai-chatAfter the scan completes, note the generated HTML file name:
Reports/scan_summary_XXX.html
Open the report in your browser:
http://127.0.0.1:8080/scan_summary_XXX.html
You can now:
- Expand findings
- Open the AI Help Center per issue
- Ask about severity, impact, remediation, and verification steps
The scanner exposes a small CLI. All arguments are documented below.
Usage: learning_grade_web_scanner.py [OPTIONS] base_url
Positional arguments:
base_url Base URL to scan (must include scheme, e.g., https://example.com)
Optional arguments:
-> `--threads` / `-t` — number of worker threads
-> `--delay` / `-d` — minimum delay (seconds) between requests to the same host (politeness)
-> `--timeout` — request timeout in seconds
-> `--depth` — either a single integer (e.g. `3`) or a range (`0-2`), controls crawl depth and min-depth for analysis
-> `--output` / `-o` — output directory
-> `--header` — repeatable. Allows passing `Header: Value` or `Header=Value` pairs
-> `--cookie` — repeatable. Pass `name=value`
-> `--ai-enabled` — embed AI help UI into the HTML report
-> `--ai-server` — set the AI server endpoint used by the client UI (e.g. `http://127.0.0.1:5000/api/ai-chat`)
-> `--ai-test-ui` — generate a local test page with seeded issues (no scanning)
Notes on arguments:
base_urlmust include the URL scheme (http://orhttps://). The scanner will only crawl the same host as the base URL.--threadscontrols parallelism. Higher values speed up scans but may increase load on the target.--delayis important: it enforces a minimum delay per host between requests. Use higher values when scanning production systems.--depthlimits how deep the crawler will follow links and can also be a range to control analysis depth.
`0` — scan the base URL only (seeded at depth 0)
`1` — include the base URL and any pages linked directly from it
`2` — include pages linked from the base page and pages linked from those pages
`3` — include pages linked from the previous level (i.e., pages whose shortest link-distance from the base is 3)
1-2 — crawl links as needed, but only run analysis on pages whose depth is between 1 and 2 (inclusive)
`N` — include any page whose shortest link-distance from the base URL is <= `N`
Important clarification (for 1-2): The crawler may still visit shallower pages (e.g., depth 0) to discover links, but security checks are only performed on pages at depth 1 and 2.
Implementation note: the scanner enqueues the base URL at depth 0 and enqueues discovered links with depth + 1.
URLs with depth > max_depth are skipped. When a range is used (e.g., --depth 1-2), crawling still reaches deeper pages to discover links, but analysis only runs when depth >= min_depth and depth <= max_depth.
Depth 0
└── https://example.com (base URL)
Depth 1
├── https://example.com/about
├── https://example.com/login
└── https://example.com/blog
Depth 2
├── https://example.com/blog/post-1
├── https://example.com/blog/post-2
└── https://example.com/about/team
Depth 3
└── https://example.com/blog/post-1/comments
|---------------------------------------------------------------------|
| Depth 1-2 |
|---------------------------------------------------------------------|
| -> Depth 0 -> Depth 1 -> Depth 2 |
| |
| -------------------------------------------------- |
| | Depth 0 (crawled only, not analyzed) | |
| | └── https://example.com | |
| | | |
| | Depth 1 (crawled + analyzed) | |
| | ├── https://example.com/about | |
| | ├── https://example.com/login | |
| | └── https://example.com/blog | |
| | | |
| | Depth 2 (crawled + analyzed) | |
| | ├── https://example.com/blog/post-1 | |
| | ├── https://example.com/blog/post-2 | |
| | └── https://example.com/about/team | |
| |----------------------------------------------- | |
| |
|---------------------------------------------------------------------|
Note: Most real-world vulnerabilities live at: depth 0–2. eg, Landing pages, Forms, Dashboards, API endpoints, Blog posts, Admin panels (if exposed).
Very deep pages are often: Pagination, Archives, Comment pages, User-generated content, Repetitive templates.
Scanning them adds noise, not value.
--headerand--cookielet you pass authentication context for authorized, authenticated scans (no login automation).--confirmis intentionally required to remind you of legal/ethical constraints. The script will refuse to run without it.
Scan a single site with default settings (safe defaults):
python learning_grade_web_scanner.py https://example.com --confirmScan with more threads and a longer delay (polite):
python learning_grade_web_scanner.py https://example.com --threads 8 --delay 2.0 --confirmScan shallowly (only the base URL):
python learning_grade_web_scanner.py https://example.com --depth 0 --confirmScan a depth range (only analyze pages at depth 1–2, but still crawl to reach them):
python learning_grade_web_scanner.py https://example.com --depth 1-2 --confirmRun with a depth range (only analyze pages at or deeper than min_depth):
python learning_grade_web_scanner.py https://example.com --confirm --depth 1-2Provide custom headers or cookies (helpful for authenticated pages; no login automation is included):
python learning_grade_web_scanner.py https://example.local --confirm --header "Authorization: Bearer TOKEN" --cookie "session=abcd1234"Generate the AI test UI HTML (no scanning):
python learning_grade_web_scanner.py https://example.local --ai-enabled --ai-test-uiEnable the AI UI in the generated report (you must also provide --ai-server pointing to your AI proxy endpoint):
python learning_grade_web_scanner.py https://example.com --confirm --ai-enabled --ai-server http://127.0.0.1:5000/api/ai-chatScan with custom headers and cookies for authenticated testing (no login automation):
python learning_grade_web_scanner.py https://example.com \
--header "Authorization: Bearer <token>" \
--cookie "sessionid=abcd1234" \
--confirmSpecify a custom output folder:
python learning_grade_web_scanner.py https://example.com -o ./reports --confirmAfter a successful run, the scanner writes two primary files to the output directory:
-
scan_report_<timestamp>.json— The raw findings array in JSON format. Each item contains:url— The URL where the issue was observedissue— A short title for the issue (e.g., "Missing Security Headers")details— A small JSON object with more context (e.g., which headers were missing)
-
scan_summary_<timestamp>.html— A compact, shareable HTML summary table suitable for quick review.
These files are intended as starting points for manual verification. See the project code to extend or reformat outputs (CSV, Markdown, or integration with bug trackers).
- Queue-based crawling: avoids creating unbounded thread objects and keeps crawling predictable. Worker threads call
crawl_workerand read from a singleQueue. - Per-host rate limiting:
last_accesstimestamps are tracked for each host to apply--delaypolitely. - robots.txt: the scanner fetches and respects
robots.txtwhere available. If it cannot fetch robots.txt, it proceeds but logs the condition. - Non-destructive testing: tests are GET-only and intentionally conservative. XSS probes insert unique markers and search for contextual reflection, and SQLi checks look for common DB error messages.
- Session reuse:
requests.Sessionis used to speed up scanning and correctly handle cookies/headers. - Sanitization layer before embedding data into HTML/JS
- AI integration isolated behind a server proxy (no client secrets)
Limitations to be aware of:
- The scanner does not replace manual testing or professional tools like Burp Suite or ZAP. It is a learning project and should be used as such.
- Detection is heuristic-based and will produce false positives and possibly false negatives. Every finding must be manually validated.
Examples of useful extensions:
- Add authenticated scanning (session handling, login form automation)
- Add payload plugins and a plugin API for community rules
- Integrate more advanced XSS and SQLi tests (time-based, boolean-based, context-aware DOM analysis)
- Add severity scoring and fingerprinting to deduplicate findings
- Add unit tests and GitHub Actions for CI (linting + lightweight tests)
- Adding new checks inside analyze_page()
- Creating new helper methods similar to check_reflected_xss()
- Extending the HTML report UI for additional metadata
- Replacing the AI backend with another provider (OpenAI, local LLM, etc.)
- To add checks, extend the
analyze_pagemethod and add helper functions similar tocheck_reflected_xssorcheck_sqli_errors.
If you'd like, I can add any of the above as follow-up work.
- Do not run this scanner against third-party systems without explicit written permission.
- Use
--delayto reduce the load on target servers and avoid interference with production. - This project is provided for educational purposes. The author accepts no responsibility for misuse.
- The scanner seeds a queue with the
base_urland uses multiple worker threads to fetch pages politely (respectingrobots.txtand per-host crawl delays). - The crawler collects same-site links and enqueues them for deeper crawling up to
max_depth. - Pages at or deeper than
min_depthare analyzed for headers, cookies, HTTP methods, reflected XSS evidence, and error messages pointing to potential SQL injection. - Findings are appended to an in-memory report and written as JSON + an interactive HTML summary (sanitized) at the end.
- If crawling seems slow, increase
--threadsor adjust--delay, but be careful to remain polite to the target site.
- This is not a replacement for a professional scanner. It is a learning tool and demonstrates heuristic-based checks that can produce false positives and false negatives.
- All probes are intentionally non-destructive, but scanning web applications can still trigger logging, rate limiting, or other side effects.
- The XSS and SQLi checks are simple heuristics; deeper interactive testing (POST, authenticated flows, parameterized analysis) is out of scope.
Contributions are welcome. If you add features, please:
- Keep the default behavior safe and non-destructive
- Add tests for new features
- Update README usage examples and argument descriptions
This README was generated to accompany a learning-grade scanner implementation. If you want more help integrating an LLM-based remediation assistant safely (production-ready proxy patterns, rate-limiting, or example Nginx configurations), ask for a focused guide and I will provide one.








