A high-performance scanner for detecting Australian Personal Information (PI) in GitHub repositories, designed for enterprise compliance with Australian privacy regulations.
- Australian PI Detection: Specialized detection for TFN, ABN, Medicare numbers, BSB codes, ACN, driver licenses, passports, and credit cards
- Banking Domain Intelligence: AST-based analysis for Java, Scala, and Python with banking-specific risk assessment
- Two-Phase Architecture: Pattern detection followed by optional AI-powered validation for 100% accuracy
- Local LLM Integration: Code-aware validation using LM Studio for superior false positive reduction
- Repository Structure Analysis: Intelligent risk zone mapping based on file paths and code patterns
- Smart Progress Tracking: Real-time progress indicators with accurate time estimates
- Secure Output: Configurable masking levels to protect sensitive data in reports
- Enterprise Ready: Non-interactive mode for CI/CD integration with comprehensive reporting
- Go 1.21+ (for building from source)
- GitHub token with repository read access
- (Optional) LM Studio for AI-powered validation
# Pull the latest image
docker pull ghcr.io/macattak/pi-scanner:latest
# Run with GitHub token
docker run --rm -e GITHUB_TOKEN=$GITHUB_TOKEN \
ghcr.io/macattak/pi-scanner:latest https://github.com/example/repo
# Run with local output directory
docker run --rm -e GITHUB_TOKEN=$GITHUB_TOKEN \
-v $(pwd)/output:/home/scanner/output \
ghcr.io/macattak/pi-scanner:latest https://github.com/example/repoDownload the latest release from the releases page.
# macOS/Linux
curl -LO https://github.com/MacAttak/pi-scanner/releases/download/v1.2.0/pi-scanner-$(uname -s | tr '[:upper:]' '[:lower:]')-$(uname -m).tar.gz
tar -xzf pi-scanner-*.tar.gz
chmod +x pi-scanner
sudo mv pi-scanner /usr/local/bin/# Clone the repository
git clone https://github.com/MacAttak/pi-scanner.git
cd pi-scanner
# Build the binary
go build -o bin/pi-scanner ./cmd/pi-scanner
# Or use Make
make buildThe scanner provides a guided experience through two phases:
- Pattern-based scanning - Fast detection using regex patterns
- AI validation (optional) - Reduce false positives using LLM
# Interactive guided scan
pi-scanner https://github.com/example/repo
# The scanner will:
# 1. Clone and scan the repository for PI patterns
# 2. Save a masked report to ./reports/
# 3. Show you a summary of findings
# 4. Ask if you want to validate findings with AIFor automation and CI/CD pipelines:
# Pattern scan only (no AI validation)
pi-scanner https://github.com/example/repo --no-input
# Automatic high-risk validation
pi-scanner https://github.com/example/repo --no-input --validate=high
# Validate all findings
pi-scanner https://github.com/example/repo --no-input --validate=allControl how PI data appears in reports:
# Partial masking (default) - Shows partial values like 123****82
pi-scanner https://github.com/example/repo --masking=partial
# Full masking - Complete redaction
pi-scanner https://github.com/example/repo --masking=full
# No masking - Shows full values (use with caution!)
pi-scanner https://github.com/example/repo --masking=noneThe scanner can use a local LLM to validate findings and reduce false positives:
- Download and install LM Studio
- Download a recommended model (e.g.,
qwen2.5-coder-7b-instruct) - Start the local server (usually on port 1234)
# Test if LLM service is available
pi-scanner llm-checkDuring interactive scanning, you'll be presented with validation options:
π Would you like to validate these findings with AI?
This can significantly reduce false positives.
1) Validate all findings (329 items) - Est. 10-15 minutes
2) Validate HIGH + MEDIUM only (28 items) - Est. 1-2 minutes
3) Validate HIGH + CRITICAL only (5 items) - Est. < 1 minute
4) Skip validation
All scan results are saved to the ./reports/ directory with the following structure:
reports/
βββ 20250628_140000_owner_repo/
βββ phase1_pattern_scan.json # Pattern scan results
βββ phase2_llm_validated.json # AI validation results (if performed)
βββ summary.txt # Human-readable summary
The PI Scanner is available as a Docker image from GitHub Container Registry.
# Pull specific version
docker pull ghcr.io/macattak/pi-scanner:1.2.0
# Run scan with GitHub token
docker run --rm -e GITHUB_TOKEN=$GITHUB_TOKEN \
ghcr.io/macattak/pi-scanner:latest https://github.com/example/repo
# Save reports to local directory
docker run --rm -e GITHUB_TOKEN=$GITHUB_TOKEN \
-v $(pwd)/reports:/home/scanner/output \
ghcr.io/macattak/pi-scanner:latest https://github.com/example/repo
# Run with custom config
docker run --rm -e GITHUB_TOKEN=$GITHUB_TOKEN \
-v $(pwd)/config.yaml:/etc/pi-scanner/config/config.yaml:ro \
ghcr.io/macattak/pi-scanner:latest https://github.com/example/repoversion: '3.8'
services:
pi-scanner:
image: ghcr.io/macattak/pi-scanner:latest
environment:
- GITHUB_TOKEN=${GITHUB_TOKEN}
volumes:
- ./reports:/home/scanner/output
- ./config.yaml:/etc/pi-scanner/config/config.yaml:ro
command: https://github.com/example/repo --no-input --validate=high- name: PI Security Scan
run: |
pi-scanner ${{ github.event.repository.html_url }} \
--no-input \
--validate=high \
--masking=full- name: PI Security Scan (Docker)
run: |
docker run --rm \
-e GITHUB_TOKEN=${{ secrets.GITHUB_TOKEN }} \
-v ${{ github.workspace }}/reports:/home/scanner/output \
ghcr.io/macattak/pi-scanner:latest \
${{ github.event.repository.html_url }} \
--no-input --validate=high --masking=fullGITHUB_TOKEN- Required for accessing private repositoriesNO_COLOR- Disable colored outputCI- Automatically enables non-interactive mode
# Show detailed progress and debugging information
pi-scanner https://github.com/example/repo --verbose# Use a different LLM endpoint
pi-scanner llm-check --endpoint http://localhost:8080/v1 --model codellama-7bSee CONTRIBUTING.md for development setup and guidelines.
MIT License - see LICENSE for details.