Welcome! Elvis is designed for everyone. You don’t need to know how to code. Just follow the step-by-step guides and diagrams below to get started quickly.
Procedure RunElvis()
Begin
Read seed URLs from srv/urls.txt
For each URL:
Fetch job listings
Extract company and location using SED/AWK
Deduplicate and validate results
Write output to home/calllist.txt
If --append-history is set:
Append new companies to history
End If
End
End Procedure
flowchart TD
A[Start] --> B[Read seed URLs]
B --> C[Fetch job listings]
C --> D[Extract company/location]
D --> E[Deduplicate & validate]
E --> F[Write calllist.txt]
F --> G{Append history?}
G -- Yes --> H[Update company_history.txt]
G -- No --> I[Done]
Procedure ValidateCallList()
Begin
If home/calllist.txt does not exist or is empty then
Log error and exit
End If
For each row in calllist.txt:
Check format and required fields
If invalid, log error
End For
If all rows valid then
Print "Validation successful"
Else
Print "Validation failed"
End If
End
End Procedure
C4Context
Person(user, "User", "Runs Elvis and reviews call lists")
System(elvis, "Elvis", "POSIX shell web scraper")
Container(bin, "bin/elvis.sh", "Shell Script", "Entrypoint orchestrator")
Container(dataInput, "lib/data_input.sh", "Shell Script", "Fetch extract job data")
Container(processor, "lib/processor.sh", "Shell Script", "Normalizes and deduplicates")
Container(validator, "lib/validate_calllist.sh", "Shell Script", "Validates output")
ContainerDb(output, "home/calllist.txt", "Text File", "Final call list output")
Rel(user, elvis, "Runs")
Rel(elvis, bin, "Orchestrates")
Rel(bin, dataInput, "Invokes")
Rel(dataInput, processor, "Sends extracted data")
Rel(processor, validator, "Sends processed data")
Rel(validator, output, "Writes validated call list")
Elvis is a POSIX shell-based web scraper that generates daily call lists of Australian companies from job boards (e.g., Seek). It is built for reliability, transparency, and easy customization using POSIX utilities only.
Start here! Use the flowchart below to find the best onboarding for your needs.
flowchart TD
A[Start Here] --> B{What do you want to do?}
B --> C[Just use Elvis to get call lists]
B --> D[Understand how Elvis works]
B --> E[Contribute code or docs]
C --> F[Non-Technical Onboarding]
D --> G[Technical Onboarding]
E --> H[Contributor Onboarding]
- Non-Technical Onboarding: Quick start for using Elvis.
- Technical Onboarding: Learn the architecture and internals.
- Contributor Onboarding: Start contributing code or docs.
See the Onboarding Guide for step-by-step help.
mindmap
root((Elvis))
Usage
"Call List"
"Seed URL"
"User Agent"
Architecture
"POSIX Shell"
"Modular Scripts"
"Config in etc/elvisrc"
Compliance
"robots.txt"
"Ethical scraping"
Processing
"Deduplication"
"Validation"
"Parser"
- Call List: The output file with extracted job leads.
- Seed URL: A starting web address for scraping.
- Parser: A script that extracts information from web pages.
- Deduplication: Removing duplicate entries from results.
- POSIX Shell: A standard command-line environment for Unix systems.
- User Agent: A string that identifies the tool to websites.
- robots.txt: A file that tells scrapers what’s allowed.
- Compliance: Following legal and ethical scraping rules.
See the full Glossary in the Wiki.
- Overview
- Features
- Getting Started
- Configuration
- Usage & Validation
- Project Directory Tree
- Wiki
- Additional Documentation
- Roadmap
- Contributing
- Support & Community
- License
- Acknowledgements
The Elvis Wiki is your beginner-friendly guide to using, configuring, and understanding Elvis. It is organized for non-technical users and covers:
- Tutorials: Step-by-step guides for newcomers
- How-to Guides: Practical instructions for common tasks
- Reference: Technical details, configuration, and file structure
- Explanation: Background, design, and rationale
- Project Overview, Directory Structure, Workflow, FAQ, and Glossary
Start here: Elvis Wiki Home
Tip: regenerate an up-to-date TOC with:
grep '^#' README.md | sed 's/^#*/- /'
Elvis fetches job listings from configured seed URLs, extracts company names and
locations using modular AWK/SED parsers, deduplicates results (history- aware),
validates output format, and writes a daily home/calllist.txt for sales
outreach.
- POSIX-only (sh, awk, sed, grep, find, cksum, curl) — runs on Linux, BSD, macOS (with POSIX tools), WSL2, and Cygwin.
- Config-driven (
etc/elvisrc) for reproducible runs and deployment. All configuration, paths, toggles, and limits are sourced only frometc/elvisrc. - Robust validation: Seed and UA files are checked for presence, non-emptiness, and well-formed entries before any scraping begins. Malformed or missing input is logged with actionable, standardized error messages.
- Respects
robots.txtwhen enabled (VERIFY_ROBOTS=true). - User-Agent rotation and UA-based retry logic for robustness.
- Backoff and retry strategies (configurable) with CAPTCHA detection.
- Pagination support and fallback parsing: Modular SED-first extraction, AWK fallback, and pattern-matching fallback maximise coverage. All extraction failures and fallbacks are logged with context.
- Case-insensitive deduplication with optional history append and audit
patches in
var/spool/. - Validation and default handler: Output is validated for format, uniqueness, and cleanliness. All validation failures are logged to both stderr and the main log file.
- Comprehensive test suite: Tests cover malformed input, error/fallback paths, and all validation logic for reliability.
- Test hooks (
TEST_UA_FILE,TEST_URLS_FILE,TEST_SIMULATE_403) for CI.
- POSIX shell and standard utilities (see
PORTABILITY.md). curl(required for fetching web pages).
git clone https://github.com/yourusername/elvis.git
cd elvis
chmod +x bin/elvis.sh lib/*.sh
bin/elvis.shRun with --append-history to append newly discovered companies to
srv/company_history.txt (the default is not to append; change via
APPEND_HISTORY_DEFAULT in etc/elvisrc). When history is updated, Elvis
writes a company_history-YYYYMMDDTHHMMSS.patch to var/spool/ for auditing.
All runtime configuration is in etc/elvisrc. Notable keys (see USAGE.md):
BACKOFF_SEQUENCE— space-separated backoff seconds (e.g.,1 2 4).EXTRA_403_RETRIES— extra UA-rotation retries for HTTP 403 responses.CAPTCHA_PATTERNS— regex to detect CAPTCHA pages (e.g.,captcha|recaptcha).PAGE_NEXT_MARKER— marker used to find "Next" page controls.OUTPUT_LIMIT— optional integer to restrict the number of output rows.LOG_ROTATE_DAYS— days before rotating logs.
Testing / CI hooks:
TEST_UA_FILE,TEST_URLS_FILE— override UA and seed URLs for deterministic tests.TEST_SIMULATE_403=true— simulate 403 responses to exercise UA-rotation logic.
- Run
bin/elvis.shto generatehome/calllist.txt. - Run
bin/elvis.sh --append-historyto append new companies to history. - Validate output manually:
lib/validate_calllist.sh. - Logs:
var/log/elvis.log(rotated perLOG_ROTATE_DAYS).
Generate a tree with:
find . -type d | sed 's|[^/]*/| |g'Key folders:
bin/— entrypoint (elvis.sh).lib/— modular scripts (AWK/SED and helper sh scripts).etc/— configuration (elvisrc).srv/— seeds and UA files (urls.txt,ua.txt,company_history.txt).var/— logs, spool files, and cached sources.docs/— additional documentation and demo images.tests/— test harness and fixtures.
USAGE.md— detailed usage, configuration keys, and notes for testing.CHANGELOG.md— recent changes and documentation updates.PORTABILITY.md— rationale and implementation notes for POSIX portability.docs/man/elvis.1— comprehensive man page (see below)
You can view the manual with:
man ./docs/man/elvis.1To install for your user:
sh scripts/build_manpage.sh install --user
man elvisOr system-wide (may require sudo):
sh scripts/build_manpage.sh install
man elvisTo uninstall:
sh scripts/build_manpage.sh uninstall [--user]docs/tutorials/— step-by-step tutorials (Quick Start, Add a parser).docs/how-to-guides/— short actionable guides for common tasks.docs/reference/— configuration and internal script references.docs/explanation/— design rationale and conceptual documents.docs/documentation-guide/feature-documentation-template.md— template to document new features.
See the docs/ folder for more content and examples.
- Pseudocode (PDL): Include a short PDL pseudocode fragment after any explanatory text that defines algorithms or procedures. Follow the Cal Poly PDL Standard (https://users.csc.calpoly.edu/~jdalbey/SWE/pdl_std.html).
- Diagrams: Use PlantUML for UML-style diagrams and Mermaid for
flowcharts. Embed diagrams in fenced code blocks using
plantuml/mermaid. - Tone: Keep documentation mobile-first, simple, and accessible to non-technical readers.
- Add more site-specific parsers for additional job boards.
- Improve test coverage and CI workflows (automated linting/format checks).
- Add optional packaging/release automation for pre-built artifacts.
- Collect example screenshots and usage GIFs for the
docs/demo.png.
Please see CONTRIBUTING.md for guidelines on reporting
issues, proposing changes, and submitting pull requests. Also review
CODE_OF_CONDUCT.md, SECURITY.md, and SUPPORT.md for community and security
policies.
Basic expectations:
- Prefer small, well-scoped changes with tests where applicable.
- Keep changes POSIX-compatible and update documentation when behaviour changes.
- Report bugs and request features via GitHub Issues:
https://github.com/yourusername/elvis/issues. - For quick questions, open a discussion or PR and link relevant tests/fixtures.
This project is licensed under the GNU Affero General Public License v3.0.
- Unix Filesystem Layout
- Awk
- Sed
- Contributors and testers who keep portability and simplicity in focus.