Skip to content

A POSIX shell-based web scraper that generates daily call lists of Australian companies from job boards like Seek. Designed for reliability, transparency, and easy customization, following Unix philosophy and best practices for open source projects.

License

Notifications You must be signed in to change notification settings

2MuchC0ff33/elvis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Elvis: Australian Sales Lead Call List Scraper

🟦 For Non-Technical Users: No coding required! 🟦

Welcome! Elvis is designed for everyone. You don’t need to know how to code. Just follow the step-by-step guides and diagrams below to get started quickly.


How Elvis Works (At a Glance)

Procedure RunElvis()
  Begin
    Read seed URLs from srv/urls.txt
    For each URL:
      Fetch job listings
      Extract company and location using SED/AWK
    Deduplicate and validate results
    Write output to home/calllist.txt
    If --append-history is set:
      Append new companies to history
    End If
  End
End Procedure
flowchart TD
  A[Start] --> B[Read seed URLs]
  B --> C[Fetch job listings]
  C --> D[Extract company/location]
  D --> E[Deduplicate & validate]
  E --> F[Write calllist.txt]
  F --> G{Append history?}
  G -- Yes --> H[Update company_history.txt]
  G -- No --> I[Done]
Loading

Pseudocode: Validating Output

Procedure ValidateCallList()
  Begin
    If home/calllist.txt does not exist or is empty then
      Log error and exit
    End If
    For each row in calllist.txt:
      Check format and required fields
      If invalid, log error
    End For
    If all rows valid then
      Print "Validation successful"
    Else
      Print "Validation failed"
    End If
  End
End Procedure

Mermaid: Elvis Main Pipeline

Mermaid: Elvis System Architecture (C4 Container Diagram)

C4Context
  Person(user, "User", "Runs Elvis and reviews call lists")
  System(elvis, "Elvis", "POSIX shell web scraper")
  Container(bin, "bin/elvis.sh", "Shell Script", "Entrypoint orchestrator")
  Container(dataInput, "lib/data_input.sh", "Shell Script", "Fetch extract job data")
  Container(processor, "lib/processor.sh", "Shell Script", "Normalizes and deduplicates")
  Container(validator, "lib/validate_calllist.sh", "Shell Script", "Validates output")
  ContainerDb(output, "home/calllist.txt", "Text File", "Final call list output")

  Rel(user, elvis, "Runs")
  Rel(elvis, bin, "Orchestrates")
  Rel(bin, dataInput, "Invokes")
  Rel(dataInput, processor, "Sends extracted data")
  Rel(processor, validator, "Sends processed data")
  Rel(validator, output, "Writes validated call list")
Loading


Build Status Release License: AGPL v3

Elvis is a POSIX shell-based web scraper that generates daily call lists of Australian companies from job boards (e.g., Seek). It is built for reliability, transparency, and easy customization using POSIX utilities only.


Onboarding: Choose Your Path

Start here! Use the flowchart below to find the best onboarding for your needs.

flowchart TD
  A[Start Here] --> B{What do you want to do?}
  B --> C[Just use Elvis to get call lists]
  B --> D[Understand how Elvis works]
  B --> E[Contribute code or docs]
  C --> F[Non-Technical Onboarding]
  D --> G[Technical Onboarding]
  E --> H[Contributor Onboarding]
Loading

See the Onboarding Guide for step-by-step help.

Glossary (Quick Reference)

Elvis Project Concepts (Mindmap)

mindmap
  root((Elvis))
    Usage
      "Call List"
      "Seed URL"
      "User Agent"
    Architecture
      "POSIX Shell"
      "Modular Scripts"
      "Config in etc/elvisrc"
    Compliance
      "robots.txt"
      "Ethical scraping"
    Processing
      "Deduplication"
      "Validation"
      "Parser"
Loading
  • Call List: The output file with extracted job leads.
  • Seed URL: A starting web address for scraping.
  • Parser: A script that extracts information from web pages.
  • Deduplication: Removing duplicate entries from results.
  • POSIX Shell: A standard command-line environment for Unix systems.
  • User Agent: A string that identifies the tool to websites.
  • robots.txt: A file that tells scrapers what’s allowed.
  • Compliance: Following legal and ethical scraping rules.

See the full Glossary in the Wiki.


Table of Contents

Wiki

The Elvis Wiki is your beginner-friendly guide to using, configuring, and understanding Elvis. It is organized for non-technical users and covers:

  • Tutorials: Step-by-step guides for newcomers
  • How-to Guides: Practical instructions for common tasks
  • Reference: Technical details, configuration, and file structure
  • Explanation: Background, design, and rationale
  • Project Overview, Directory Structure, Workflow, FAQ, and Glossary

Start here: Elvis Wiki Home

Tip: regenerate an up-to-date TOC with:

grep '^#' README.md | sed 's/^#*/- /'

Overview

Elvis fetches job listings from configured seed URLs, extracts company names and locations using modular AWK/SED parsers, deduplicates results (history- aware), validates output format, and writes a daily home/calllist.txt for sales outreach.


Features

  • POSIX-only (sh, awk, sed, grep, find, cksum, curl) — runs on Linux, BSD, macOS (with POSIX tools), WSL2, and Cygwin.
  • Config-driven (etc/elvisrc) for reproducible runs and deployment. All configuration, paths, toggles, and limits are sourced only from etc/elvisrc.
  • Robust validation: Seed and UA files are checked for presence, non-emptiness, and well-formed entries before any scraping begins. Malformed or missing input is logged with actionable, standardized error messages.
  • Respects robots.txt when enabled (VERIFY_ROBOTS=true).
  • User-Agent rotation and UA-based retry logic for robustness.
  • Backoff and retry strategies (configurable) with CAPTCHA detection.
  • Pagination support and fallback parsing: Modular SED-first extraction, AWK fallback, and pattern-matching fallback maximise coverage. All extraction failures and fallbacks are logged with context.
  • Case-insensitive deduplication with optional history append and audit patches in var/spool/.
  • Validation and default handler: Output is validated for format, uniqueness, and cleanliness. All validation failures are logged to both stderr and the main log file.
  • Comprehensive test suite: Tests cover malformed input, error/fallback paths, and all validation logic for reliability.
  • Test hooks (TEST_UA_FILE, TEST_URLS_FILE, TEST_SIMULATE_403) for CI.

Getting Started

Prerequisites

  • POSIX shell and standard utilities (see PORTABILITY.md).
  • curl (required for fetching web pages).

Install & Quick Start

git clone https://github.com/yourusername/elvis.git
cd elvis
chmod +x bin/elvis.sh lib/*.sh
bin/elvis.sh

Run with --append-history to append newly discovered companies to srv/company_history.txt (the default is not to append; change via APPEND_HISTORY_DEFAULT in etc/elvisrc). When history is updated, Elvis writes a company_history-YYYYMMDDTHHMMSS.patch to var/spool/ for auditing.


Configuration

All runtime configuration is in etc/elvisrc. Notable keys (see USAGE.md):

  • BACKOFF_SEQUENCE — space-separated backoff seconds (e.g., 1 2 4).
  • EXTRA_403_RETRIES — extra UA-rotation retries for HTTP 403 responses.
  • CAPTCHA_PATTERNS — regex to detect CAPTCHA pages (e.g., captcha|recaptcha).
  • PAGE_NEXT_MARKER — marker used to find "Next" page controls.
  • OUTPUT_LIMIT — optional integer to restrict the number of output rows.
  • LOG_ROTATE_DAYS — days before rotating logs.

Testing / CI hooks:

  • TEST_UA_FILE, TEST_URLS_FILE — override UA and seed URLs for deterministic tests.
  • TEST_SIMULATE_403=true — simulate 403 responses to exercise UA-rotation logic.

Usage & Validation

  • Run bin/elvis.sh to generate home/calllist.txt.
  • Run bin/elvis.sh --append-history to append new companies to history.
  • Validate output manually: lib/validate_calllist.sh.
  • Logs: var/log/elvis.log (rotated per LOG_ROTATE_DAYS).

Project Directory Tree

Generate a tree with:

find . -type d | sed 's|[^/]*/|  |g'

Key folders:

  • bin/ — entrypoint (elvis.sh).
  • lib/ — modular scripts (AWK/SED and helper sh scripts).
  • etc/ — configuration (elvisrc).
  • srv/ — seeds and UA files (urls.txt, ua.txt, company_history.txt).
  • var/ — logs, spool files, and cached sources.
  • docs/ — additional documentation and demo images.
  • tests/ — test harness and fixtures.

Additional Documentation

  • USAGE.md — detailed usage, configuration keys, and notes for testing.
  • CHANGELOG.md — recent changes and documentation updates.
  • PORTABILITY.md — rationale and implementation notes for POSIX portability.
  • docs/man/elvis.1 — comprehensive man page (see below)

Man Page

You can view the manual with:

man ./docs/man/elvis.1

To install for your user:

sh scripts/build_manpage.sh install --user
man elvis

Or system-wide (may require sudo):

sh scripts/build_manpage.sh install
man elvis

To uninstall:

sh scripts/build_manpage.sh uninstall [--user]

Diátaxis docs (organized)

  • docs/tutorials/ — step-by-step tutorials (Quick Start, Add a parser).
  • docs/how-to-guides/ — short actionable guides for common tasks.
  • docs/reference/ — configuration and internal script references.
  • docs/explanation/ — design rationale and conceptual documents.
  • docs/documentation-guide/feature-documentation-template.md — template to document new features.

See the docs/ folder for more content and examples.

Documentation Standards (short)

  • Pseudocode (PDL): Include a short PDL pseudocode fragment after any explanatory text that defines algorithms or procedures. Follow the Cal Poly PDL Standard (https://users.csc.calpoly.edu/~jdalbey/SWE/pdl_std.html).
  • Diagrams: Use PlantUML for UML-style diagrams and Mermaid for flowcharts. Embed diagrams in fenced code blocks using plantuml / mermaid.
  • Tone: Keep documentation mobile-first, simple, and accessible to non-technical readers.

Roadmap

  • Add more site-specific parsers for additional job boards.
  • Improve test coverage and CI workflows (automated linting/format checks).
  • Add optional packaging/release automation for pre-built artifacts.
  • Collect example screenshots and usage GIFs for the docs/demo.png.

Contributing

Please see CONTRIBUTING.md for guidelines on reporting issues, proposing changes, and submitting pull requests. Also review CODE_OF_CONDUCT.md, SECURITY.md, and SUPPORT.md for community and security policies.

Basic expectations:

  • Prefer small, well-scoped changes with tests where applicable.
  • Keep changes POSIX-compatible and update documentation when behaviour changes.

Support & Community

  • Report bugs and request features via GitHub Issues: https://github.com/yourusername/elvis/issues.
  • For quick questions, open a discussion or PR and link relevant tests/fixtures.

License

This project is licensed under the GNU Affero General Public License v3.0.


Acknowledgements

About

A POSIX shell-based web scraper that generates daily call lists of Australian companies from job boards like Seek. Designed for reliability, transparency, and easy customization, following Unix philosophy and best practices for open source projects.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •