Skip to content

Conversation

obiwan04kanobi
Copy link
Contributor

Description

Implements SQLite-based result caching as discussed in #2219 to improve performance and reduce rate limiting when performing multiple username lookups.

Changes

Core Caching Implementation

  • New module: sherlock_project/cache.py - SQLite cache manager with TTL support
  • Modified: sherlock_project/sherlock.py - Integrated cache into main sherlock function
  • Cache location: ~/.sherlock/cache.db (automatic directory creation)
  • Default TTL: 24 hours (86400 seconds, configurable)

CLI Arguments Added

  • --no-cache - Disable caching completely (don't read or write to cache)
  • --force-check - Ignore cached results and force fresh checks for all sites
  • --cache-duration SECONDS - Customize cache expiration time (default: 86400)

Cache Management Utility

  • New CLI tool: sherlock-cache with subcommands:
    • stats - Show cache statistics (total/valid/expired entries, cache path)
    • clear - Clear cache entries (all, by username, or by site)
    • cleanup - Remove only expired entries

Testing & Documentation

  • New tests: tests/test_cache.py - Comprehensive cache functionality tests
  • Updated docs: docs/README.md - Added cache usage and management section
  • All tests passing: 5/5 cache tests pass, existing tests unaffected

Usage Examples

# Default behavior (uses cache)
sherlock username

# Ignore cache and force fresh checks
sherlock username --force-check

# Disable caching completely
sherlock username --no-cache

# Custom cache duration (1 hour)
sherlock username --cache-duration 3600

# Cache management
sherlock-cache stats              # Show statistics
sherlock-cache clear              # Clear all cache
sherlock-cache clear --username user123
sherlock-cache cleanup            # Remove expired only

Performance Impact

  • First run: Same speed (queries all sites, writes to cache)
  • Subsequent runs: Near-instant for cached results (~100x faster)
  • Storage: Minimal (~1-5KB per username depending on results)
  • No breaking changes: Cache is opt-out via --no-cache

Implementation Details

Cache Strategy

  • Stores CLAIMED and AVAILABLE status results
  • Does not cache UNKNOWN, ILLEGAL, or WAF statuses
  • Automatic cleanup of expired entries on each run
  • Cache key: (username, site) tuple (prevents conflicts)

Database Schema

CREATE TABLE results (
    username TEXT NOT NULL,
    site TEXT NOT NULL,
    status TEXT NOT NULL,
    url TEXT,
    timestamp INTEGER NOT NULL,
    PRIMARY KEY (username, site)
);
CREATE INDEX idx_timestamp ON results(timestamp);

Testing

# Run cache tests
poetry run pytest tests/test_cache.py -v

# Run all tests
poetry run pytest -v

Test Results: ✅ All 5 cache tests passing

  • test_cache_set_and_get - Basic cache operations
  • test_cache_expiration - TTL functionality
  • test_cache_clear_all - Clear entire cache
  • test_cache_clear_username - Clear by username
  • test_cache_stats - Statistics generation

Related Issue

Closes #2219

Checklist

  • Informative PR title
  • Added SQLite caching with configurable TTL
  • Added --no-cache flag
  • Added --force-check flag
  • Added sherlock-cache management utility
  • Comprehensive test suite
  • Updated documentation
  • All tests passing
  • No breaking changes
  • Follows project conventions

Code of Conduct

  • I agree to follow this project's Code of Conduct

Implements SQLite-based result caching to improve performance and reduce rate limiting. Results are cached for 24 hours by default and stored in ~/.sherlock/cache.db.

Features:
- Automatic caching of username lookup results with configurable TTL
- --no-cache flag to disable caching completely
- --force-check flag to ignore cached results and force fresh lookups
- --cache-duration flag to customize cache expiration (default: 86400s)
- sherlock-cache CLI utility for cache management (stats, clear, cleanup)
- Comprehensive test suite for cache functionality

Technical Details:
- Cache stored in SQLite database at ~/.sherlock/cache.db
- Automatic cleanup of expired entries on each run
- Caches both CLAIMED and AVAILABLE status results
- Thread-safe database operations
- Zero dependencies (uses built-in sqlite3 module)

Resolves sherlock-project#2219
@obiwan04kanobi
Copy link
Contributor Author

This PR is part of Hacktoberfest 2025 contribution efforts.

Copy link
Member

@ppfeister ppfeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, have any checks been performed for the possibility of SQLi?

This was not a complete review, as that'll take some time -- just some comments at first glance.

[PEP 8] (second bullet point, primarily, with a blank line separator between sections and two lines after the final import)

@ppfeister ppfeister self-assigned this Oct 5, 2025
@ppfeister ppfeister added the enhancement New feature or request label Oct 5, 2025
obiwan04kanobi and others added 2 commits October 6, 2025 19:11
Addresses all feedback from PR sherlock-project#2608 review by @ppfeister

Security Hardening:
- Implement SQL injection protection via parameterized queries in all
  database operations (get, set, clear, cleanup_expired, get_stats)
- Add comprehensive input validation (null bytes, control characters,
  length limits) to prevent injection attacks
- Implement path traversal protection restricting cache to ~/.sherlock
- Add URL validation (max 2048 chars, no null bytes)
- Store cache_duration per entry to prevent TTL drift across runs

Code Quality (PEP 8 Compliance):
- Fix import ordering: stdlib → third-party → local with blank line
  separators in cache.py, cache_cli.py, and sherlock.py
- Replace Any type hints with specific unions (str|int, QueryStatus)
- Remove shebang and __main__ block from cache_cli.py to prevent
  unsupported direct script execution

Testing Improvements:
- Replace file-based tests with unittest.mock (no disk I/O)
- Remove time.sleep() calls, use mocked timestamps instead
- Add security-specific tests (SQL injection, path traversal, null bytes)
- Verify parameterized query usage in all database operations
- Follow maintainer's testing patterns from feat/better_waf branch
- Fix unused variable linting warnings (F841)

Database Migration:
- Add automatic schema migration for existing cache databases
- Detect and handle old schema missing cache_duration column
- Gracefully drop and recreate incompatible cache tables

Platform Compatibility:
- Verify Windows compatibility (Path.home() behavior documented)
- Test Docker container build and execution
- Confirm cross-platform path separator handling

Test Results:
- Linting: ✓ All checks passed
- Cache tests: ✓ 14/14 passed
- Docker build: ✓ Verified with act
- Integration tests: 38/39 passed (1 flaky external site WAF)
@obiwan04kanobi
Copy link
Contributor Author

Additionally, have any checks been performed for the possibility of SQLi?

This was not a complete review, as that'll take some time -- just some comments at first glance.

[PEP 8] (second bullet point, primarily, with a blank line separator between sections and two lines after the final import)

✅ Fixed - All database operations now use parameterized queries with ? placeholders. Added comprehensive input validation (null bytes, control characters, length limits). Tests verify parameterized query usage.

@ppfeister ppfeister added this to the v0.17.0 milestone Oct 7, 2025
Copy link
Member

@ppfeister ppfeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attached a couple additional thoughts

Appreciate your patience and cooperation on this one. As a larger architectural change, I'd rather make sure it's right the first time rather than have things unexpectedly break.

@ppfeister
Copy link
Member

Additional nice-to-have ----

If we could configure cache settings by environment variable, that would be extremely useful in terms of automations and containerized environments

i.e.
SHERLOCK_CACHE_DISABLE=True (setting --skip-cache or whatever we call it)
SHERLOCK_CACHE_PATH=/xyz/something.sqlite3 (accepting either a full path with db name or a directory)
SHERLOCK_CACHE_TTL=3600

(can't imagine a need for the others)

Easy enough for this PR or should we break that out into a separate RFE?

Addresses all review comments from @ppfeister on PR sherlock-project#2608

CLI Changes:
- Rename --no-cache → --skip-cache (clearer semantics)
- Rename --force-check → --ignore-cache (removes ambiguity)
- Fix argument names: args.skip_cache, args.ignore_cache

Cache Path Improvements:
- Use platformdirs for OS-specific cache locations
  - Linux/macOS: ~/.cache/sherlock/cache.sqlite3 (XDG spec)
  - Windows: %LOCALAPPDATA%\sherlock\cache.sqlite3
- Change extension .db → .sqlite3
- Support SHERLOCK_CACHE_PATH environment variable

Database Migration:
- Implement PRAGMA user_version for schema versioning
- Extract migration logic to _migrate_schema() function
- Support incremental migrations from version 0 → 1

Concurrency Fix:
- Move cache writes from per-check to post-run bulk insert
- Add set_batch() method for efficient bulk caching
- Prevents race conditions

Environment Variable Support:
- SHERLOCK_CACHE_DISABLE: Disable caching entirely
- SHERLOCK_CACHE_PATH: Custom cache location
- SHERLOCK_CACHE_TTL: Custom TTL in seconds

Dependencies:
- Add platformdirs for cross-platform cache directory detection

Tests:
- All cache tests passing (14/14)
- Update mocks to use user_cache_dir
- Fix test_stats_calculation mock values
- Remove unused pathlib.Path import

Known Issues:
- test_probes.py::AllMyLinks test is flaky (site returns WAF)
- This is an external dependency issue, not related to cache

Test Results:
- Cache tests: 14/14 passed ✓
- Integration tests: 38/39 passed (97.4%)
- Linting: passed ✓
@obiwan04kanobi
Copy link
Contributor Author

Additional nice-to-have ----

If we could configure cache settings by environment variable, that would be extremely useful in terms of automations and containerized environments

i.e. SHERLOCK_CACHE_DISABLE=True (setting --skip-cache or whatever we call it) SHERLOCK_CACHE_PATH=/xyz/something.sqlite3 (accepting either a full path with db name or a directory) SHERLOCK_CACHE_TTL=3600

(can't imagine a need for the others)

Easy enough for this PR or should we break that out into a separate RFE?

✅ Implemented all three environment variables!

  • SHERLOCK_CACHE_DISABLE: Disable caching (accepts: true/1/yes)
  • SHERLOCK_CACHE_PATH: Custom cache location (full path or directory)
  • SHERLOCK_CACHE_TTL: Custom TTL in seconds

All tested and working. Perfect for Docker/CI environments!

Example:

SHERLOCK_CACHE_DISABLE=true sherlock testuser
SHERLOCK_CACHE_PATH=/tmp/cache SHERLOCK_CACHE_TTL=3600 sherlock testuser

@obiwan04kanobi
Copy link
Contributor Author

Easy enough - all done!

@ppfeister
Copy link
Member

ppfeister commented Oct 7, 2025

I don't have the time to validate this tonight but the hacktoberfest-accepted label has been added so we don't have to rush testing & merging

This is looking good so far though. Great addition to the project imo.

@obiwan04kanobi
Copy link
Contributor Author

No rush, just wanted to let you know it was done.

@matheusfelipeog
Copy link
Collaborator

~ Leaving this comment just to let you know that I also plan to review this. I’ll take a closer look at it soon.

@matheusfelipeog matheusfelipeog self-assigned this Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adopt sqlite with proper caching

3 participants