From 04c9dadec11c7811d2779dd4ad0055243bf8e75c Mon Sep 17 00:00:00 2001 From: 2MuchCoff33 <65397884+2MuchC0ff33@users.noreply.github.com> Date: Mon, 5 Jan 2026 15:12:20 +0800 Subject: [PATCH 01/48] Remove obsolete test scripts to streamline the testing suite - Deleted test_calllist_output.sh: Integration test for calllist CSV generation. - Deleted test_config_defaults.sh: Verification of default configuration values. - Deleted test_end_sequence_dry_run.sh: Smoke test for end_sequence.sh dry-run functionality. - Deleted test_fetch_behaviour.sh: Tests for fetch.sh behaviors including robots.txt and CAPTCHA detection. - Deleted test_geography_seed_check.sh: Validation of .com.au domains in seeds.csv. - Deleted test_load_fetch_config.sh: Verification of fetch configuration loading from INI files. - Deleted test_log_rotate.sh: Smoke test for log_rotate.sh dry-run. - Deleted test_on_err_writes_status.sh: Verification of error handling in scripts. - Deleted test_prereqs.sh: Check for essential tools availability. - Deleted test_update_readme.sh: Smoke test for README update script. - Deleted unit_archive_cleanup_summarise.sh: Tests for archiving and cleanup scripts. - Deleted unit_fetch_ua_403.sh: Tests for User-Agent rotation and 403 retry behavior. - Deleted unit_load_config.sh: Tests for loading environment and configuration. - Deleted unit_normalize_split_extract.sh: Tests for normalization and extraction scripts. - Deleted unit_paginate_sleep_marker.sh: Tests for pagination with sleep commands. - Deleted unit_parse_html_json.sh: Tests for parsing HTML and embedded JSON. - Deleted unit_pick_pagination_extract.sh: Tests for pagination and seed extraction. - Deleted unit_prepare_log.sh: Tests for log preparation script. - Deleted unit_retry_heal.sh: Tests for retry and healing mechanisms. - Deleted unit_validate_env.sh: Tests for environment variable validation. --- archive/.env.example | 101 - archive/.github/copilot-instructions.md | 238 --- .../instructions/markdown.instructions.md | 87 - .../instructions/shell.instructions.md | 166 -- .../.github/prompts/boost-prompt.prompt.md | 42 - archive/.github/workflows/.gitkeep | 0 archive/CHANGELOG.md | 39 - archive/README.md | 1669 ----------------- archive/TODO.md | 348 ---- archive/audit.txt | 17 - archive/bin/elvis-run | 103 - archive/companies_history.txt | 1 - archive/configs/fetch.ini | 30 - archive/configs/seek-pagination.ini | 103 - archive/cron/elvis.cron | 0 .../data/calllists/calllist_2025-12-27.csv | 17 - archive/data/seeds/seeds.csv | 2 - archive/data/ua.txt | 142 -- archive/docs/man/elvis.1 | 222 --- archive/docs/runbook.md | 679 ------- archive/examples/sample_calllist.csv | 0 archive/failer.count | 1 - archive/project.conf | 91 - archive/results.csv | 17 - archive/scripts/archive.sh | 24 - archive/scripts/build-man.sh | 36 - archive/scripts/choose_dork.sh | 55 - archive/scripts/cleanup.sh | 9 - archive/scripts/dedupe.sh | 25 - archive/scripts/dedupe_status.sh | 39 - archive/scripts/deduper.sh | 77 - archive/scripts/end_sequence.sh | 147 -- archive/scripts/enrich.sh | 9 - archive/scripts/enrich_status.sh | 45 - archive/scripts/fetch.sh | 226 --- archive/scripts/get_transaction_data.sh | 51 - archive/scripts/init-help.sh | 28 - archive/scripts/lib/archive.sh | 96 - archive/scripts/lib/cleanup.sh | 72 - archive/scripts/lib/deduper.awk | 31 - archive/scripts/lib/error.sh | 93 - archive/scripts/lib/extract_seeds.awk | 26 - archive/scripts/lib/heal.sh | 105 -- archive/scripts/lib/http_utils.sh | 213 --- archive/scripts/lib/is_dup_company.sh | 28 - archive/scripts/lib/load_config.sh | 35 - archive/scripts/lib/load_env.sh | 28 - archive/scripts/lib/load_fetch_config.sh | 31 - archive/scripts/lib/load_seeds.sh | 24 - archive/scripts/lib/load_seek_pagination.sh | 26 - archive/scripts/lib/normalize.awk | 44 - archive/scripts/lib/paginate.sh | 162 -- archive/scripts/lib/parse_seek_json3.awk | 96 - archive/scripts/lib/parser.awk | 48 - archive/scripts/lib/pick_pagination.sh | 17 - archive/scripts/lib/pick_random.awk | 7 - archive/scripts/lib/prepare_log.sh | 15 - archive/scripts/lib/rand_fraction.awk | 3 - archive/scripts/lib/rand_int.awk | 3 - archive/scripts/lib/split_records.sh | 18 - archive/scripts/lib/summarise.sh | 64 - archive/scripts/lib/ua_utils.sh | 42 - archive/scripts/lib/validate_env.sh | 22 - archive/scripts/lib/validator.awk | 55 - archive/scripts/log_rotate.sh | 61 - archive/scripts/log_status.sh | 41 - archive/scripts/parse.sh | 61 - archive/scripts/run.sh | 25 - archive/scripts/set_status.sh | 111 -- archive/scripts/summarise.sh | 9 - archive/scripts/update_config_examples.sh | 61 - archive/scripts/update_readme.sh | 168 -- archive/scripts/validate.sh | 59 - archive/summary.txt | 74 - .../tests/integration_get_transaction_data.sh | 49 - archive/tests/integration_set_status.sh | 29 - archive/tests/run-tests.sh | 1008 ---------- archive/tests/test_archive_smoke.sh | 33 - archive/tests/test_calllist_output.sh | 43 - archive/tests/test_config_defaults.sh | 35 - archive/tests/test_end_sequence_dry_run.sh | 23 - archive/tests/test_fetch_behaviour.sh | 163 -- archive/tests/test_geography_seed_check.sh | 31 - archive/tests/test_load_fetch_config.sh | 46 - archive/tests/test_log_rotate.sh | 20 - archive/tests/test_on_err_writes_status.sh | 38 - archive/tests/test_prereqs.sh | 20 - archive/tests/test_update_readme.sh | 16 - .../tests/unit_archive_cleanup_summarise.sh | 51 - archive/tests/unit_fetch_ua_403.sh | 91 - archive/tests/unit_load_config.sh | 21 - archive/tests/unit_normalize_split_extract.sh | 24 - archive/tests/unit_paginate_sleep_marker.sh | 73 - archive/tests/unit_parse_html_json.sh | 38 - archive/tests/unit_pick_pagination_extract.sh | 28 - archive/tests/unit_prepare_log.sh | 15 - archive/tests/unit_retry_heal.sh | 48 - archive/tests/unit_validate_env.sh | 30 - 98 files changed, 8933 deletions(-) delete mode 100644 archive/.env.example delete mode 100644 archive/.github/copilot-instructions.md delete mode 100644 archive/.github/instructions/markdown.instructions.md delete mode 100644 archive/.github/instructions/shell.instructions.md delete mode 100644 archive/.github/prompts/boost-prompt.prompt.md delete mode 100644 archive/.github/workflows/.gitkeep delete mode 100644 archive/CHANGELOG.md delete mode 100644 archive/README.md delete mode 100644 archive/TODO.md delete mode 100644 archive/audit.txt delete mode 100644 archive/bin/elvis-run delete mode 100644 archive/companies_history.txt delete mode 100644 archive/configs/fetch.ini delete mode 100644 archive/configs/seek-pagination.ini delete mode 100644 archive/cron/elvis.cron delete mode 100644 archive/data/calllists/calllist_2025-12-27.csv delete mode 100644 archive/data/seeds/seeds.csv delete mode 100644 archive/data/ua.txt delete mode 100644 archive/docs/man/elvis.1 delete mode 100644 archive/docs/runbook.md delete mode 100644 archive/examples/sample_calllist.csv delete mode 100644 archive/failer.count delete mode 100644 archive/project.conf delete mode 100644 archive/results.csv delete mode 100644 archive/scripts/archive.sh delete mode 100644 archive/scripts/build-man.sh delete mode 100644 archive/scripts/choose_dork.sh delete mode 100644 archive/scripts/cleanup.sh delete mode 100644 archive/scripts/dedupe.sh delete mode 100644 archive/scripts/dedupe_status.sh delete mode 100644 archive/scripts/deduper.sh delete mode 100644 archive/scripts/end_sequence.sh delete mode 100644 archive/scripts/enrich.sh delete mode 100644 archive/scripts/enrich_status.sh delete mode 100644 archive/scripts/fetch.sh delete mode 100644 archive/scripts/get_transaction_data.sh delete mode 100644 archive/scripts/init-help.sh delete mode 100644 archive/scripts/lib/archive.sh delete mode 100644 archive/scripts/lib/cleanup.sh delete mode 100644 archive/scripts/lib/deduper.awk delete mode 100644 archive/scripts/lib/error.sh delete mode 100644 archive/scripts/lib/extract_seeds.awk delete mode 100644 archive/scripts/lib/heal.sh delete mode 100644 archive/scripts/lib/http_utils.sh delete mode 100644 archive/scripts/lib/is_dup_company.sh delete mode 100644 archive/scripts/lib/load_config.sh delete mode 100644 archive/scripts/lib/load_env.sh delete mode 100644 archive/scripts/lib/load_fetch_config.sh delete mode 100644 archive/scripts/lib/load_seeds.sh delete mode 100644 archive/scripts/lib/load_seek_pagination.sh delete mode 100644 archive/scripts/lib/normalize.awk delete mode 100644 archive/scripts/lib/paginate.sh delete mode 100644 archive/scripts/lib/parse_seek_json3.awk delete mode 100644 archive/scripts/lib/parser.awk delete mode 100644 archive/scripts/lib/pick_pagination.sh delete mode 100644 archive/scripts/lib/pick_random.awk delete mode 100644 archive/scripts/lib/prepare_log.sh delete mode 100644 archive/scripts/lib/rand_fraction.awk delete mode 100644 archive/scripts/lib/rand_int.awk delete mode 100644 archive/scripts/lib/split_records.sh delete mode 100644 archive/scripts/lib/summarise.sh delete mode 100644 archive/scripts/lib/ua_utils.sh delete mode 100644 archive/scripts/lib/validate_env.sh delete mode 100644 archive/scripts/lib/validator.awk delete mode 100644 archive/scripts/log_rotate.sh delete mode 100644 archive/scripts/log_status.sh delete mode 100644 archive/scripts/parse.sh delete mode 100644 archive/scripts/run.sh delete mode 100644 archive/scripts/set_status.sh delete mode 100644 archive/scripts/summarise.sh delete mode 100644 archive/scripts/update_config_examples.sh delete mode 100644 archive/scripts/update_readme.sh delete mode 100644 archive/scripts/validate.sh delete mode 100644 archive/summary.txt delete mode 100644 archive/tests/integration_get_transaction_data.sh delete mode 100644 archive/tests/integration_set_status.sh delete mode 100644 archive/tests/run-tests.sh delete mode 100644 archive/tests/test_archive_smoke.sh delete mode 100644 archive/tests/test_calllist_output.sh delete mode 100644 archive/tests/test_config_defaults.sh delete mode 100644 archive/tests/test_end_sequence_dry_run.sh delete mode 100644 archive/tests/test_fetch_behaviour.sh delete mode 100644 archive/tests/test_geography_seed_check.sh delete mode 100644 archive/tests/test_load_fetch_config.sh delete mode 100644 archive/tests/test_log_rotate.sh delete mode 100644 archive/tests/test_on_err_writes_status.sh delete mode 100644 archive/tests/test_prereqs.sh delete mode 100644 archive/tests/test_update_readme.sh delete mode 100644 archive/tests/unit_archive_cleanup_summarise.sh delete mode 100644 archive/tests/unit_fetch_ua_403.sh delete mode 100644 archive/tests/unit_load_config.sh delete mode 100644 archive/tests/unit_normalize_split_extract.sh delete mode 100644 archive/tests/unit_paginate_sleep_marker.sh delete mode 100644 archive/tests/unit_parse_html_json.sh delete mode 100644 archive/tests/unit_pick_pagination_extract.sh delete mode 100644 archive/tests/unit_prepare_log.sh delete mode 100644 archive/tests/unit_retry_heal.sh delete mode 100644 archive/tests/unit_validate_env.sh diff --git a/archive/.env.example b/archive/.env.example deleted file mode 100644 index e3bd5c5..0000000 --- a/archive/.env.example +++ /dev/null @@ -1,101 +0,0 @@ -# .env.example — Project Elvis -# Public example of environment variables used by the project. -# Do NOT store secrets here; values should be placeholders only. - -# --- Paths & files --- -SEEDS_FILE=data/seeds/seeds.csv # Path to seed list (CSV) -OUTPUT_DIR=data/calllists # Directory where daily CSVs are written -HISTORY_FILE=companies_history.txt # Persistent company history file (one name per line) -LOG_FILE=logs/log.txt # Main log file for each run -CSV_PREFIX=calllist # Output CSV filename prefix (calllist_YYYY-MM-DD.csv) -CSV_DATE_FORMAT=%F # Date format used in CSV filename (strftime-compatible) - -# --- Run & behaviour --- -RUN_MODE=production # 'production' or 'dry-run' -DRY_RUN=false # if true, do not write outputs or history -MIN_LEADS=25 # Target minimum leads per run -LOG_LEVEL=info # log verbosity: debug, info, warn, error -# Network log for curl responses/retries -NETWORK_LOG=logs/network.log - -# --- Fetching & reliability --- -FETCH_TIMEOUT=15 # per-request timeout in seconds -FETCH_RETRIES=3 # number of retries per URL -BACKOFF_BASE=5 # base backoff seconds (used with multiplier) -BACKOFF_MULTIPLIER=2.0 # exponential backoff multiplier -RANDOM_DELAY_MIN=1.2 # min per-request random delay in seconds -RANDOM_DELAY_MAX=4.8 # max per-request random delay in seconds -VERIFY_ROBOTS=true # whether to check robots.txt before fetching - -# --- User-Agent / anti-bot settings --- -UA_ROTATE=true # whether to rotate User-Agent strings -USER_AGENT= # default User-Agent string (leave empty if using UA list) -UA_LIST_PATH=data/ua.txt # path to file with UA strings (one per line) - -# --- Networking / proxies --- -HTTP_PROXY= # optional HTTP proxy (leave blank to disable) -HTTPS_PROXY= # optional HTTPS proxy (leave blank to disable) -# Curl command override (useful for tests and environments without curl) -CURL_CMD=curl -# Optional test overrides used by the test suite (do not set in production) -# FETCH_SCRIPT= # path to a mock fetch script used by pagination tests -# SLEEP_CMD= # command to invoke for sleeps (defaults to system sleep) -# CAPTCHA & 403 handling -CAPTCHA_PATTERNS=captcha|recaptcha|g-recaptcha -RETRY_ON_403=true -EXTRA_403_RETRIES=2 -# HTTP header defaults -ACCEPT_HEADER=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 -ACCEPT_LANGUAGE=en-AU,en;q=0.9 - -# --- Notifications (optional) --- -NOTIFY_EMAIL= # email address for optional notifications -NOTIFY_API_KEY= # placeholder for notification service API key (keep secret) - -# --- Data quality & formatting --- -PHONE_NORMALISE=true # if true, normalise Australian mobile numbers (+614...) to 04... -EMAIL_REGEX="[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}" # validation pattern - -# --- Optional integrations (fill only if used) --- -# SENTRY_DSN= # example for an error tracking DSN (keep secret) -# GITHUB_TOKEN= # optional token for CI integration (keep secret) - -# ShellCheck helper (Cygwin/Windows users) -# If you use ShellCheck installed under Windows (e.g., via Scoop/Chocolatey), -# set SHELLCHECK to the POSIX path of the Windows executable so the project's -# wrapper and tests can call it correctly from Cygwin. Example: -# -# export SHELLCHECK="/cygdrive/c/Users//scoop/apps/shellcheck/0.11.0/shellcheck.exe" -# -# Alternatively, install a native ShellCheck in your environment and enable -# it with the workspace setting in VS Code: "shellcheck.extraArgs": ["-x"]. - -# End of file -# Added by update_config_examples.sh -SEEK_PAGINATION_CONFIG=configs/seek-pagination.ini -# Added by update_config_examples.sh -SNAPSHOT_DIR=.snapshots -# Added by update_config_examples.sh -BACKOFF_SEQUENCE=5,20,60 -# Added by update_config_examples.sh -DEFAULT_PAGINATION_MODEL=PAG_START # PAG_START or PAG_PAGE -# Added by update_config_examples.sh -PAGE_NEXT_MARKER=data-automation="page-next" -# Added by update_config_examples.sh -OFFSET_STEP=22 -# Added by update_config_examples.sh -OFFSET_PARAM=start -# Added by update_config_examples.sh -PAGE_PARAM=page -# Added by update_config_examples.sh -DELAY_MIN=1.2 -# Added by update_config_examples.sh -DELAY_MAX=4.8 -# Added by update_config_examples.sh -MAX_PAGES=200 -# Added by update_config_examples.sh -MAX_OFFSET=10000 -# Added by update_config_examples.sh -ROTATE_WEEKLY=true -# Optional focused fetch config -FETCH_CONFIG=configs/fetch.ini diff --git a/archive/.github/copilot-instructions.md b/archive/.github/copilot-instructions.md deleted file mode 100644 index d8cceeb..0000000 --- a/archive/.github/copilot-instructions.md +++ /dev/null @@ -1,238 +0,0 @@ -# Copilot / AI Agent Instructions — elvis - -These instructions help an AI coding agent be immediately productive in this -repository. Reference files: [`README.md`](../README.md) (primary -specification), [`docs/runbook.md`](../docs/runbook.md), and -[`companies_history.txt`](../companies_history.txt). - ---- - -## Quick project summary - -- Purpose: Produce a daily CSV call list of Australian companies with at least - one contact (phone or email) by scraping public job listing pages (primary - source: Seek Australia). -- Key files and outputs: - - `seeds.csv` — seed listing URLs and dork templates (see `data/seeds/`) - - `companies_history.txt` — one company name per line; used for - case-insensitive historical dedupe (see [`is_dup_company`](../README.md)) - - `calllist_YYYY-MM-DD.csv` — daily output (overwritten each run) - - `log.txt` — per-run logs (timestamp, seeds, pages, listings, - warnings/errors) - - `.snapshots/` — local snapshot and patch storage used by the mini VCS (see - README examples) - ---- - -## What to know up front (high-value conventions) - -- Company deduplication: **case-insensitive on `company_name` only**; do NOT - normalise punctuation, suffixes, or whitespace. Same name across different - locations is still a duplicate. -- Required output row fields: `company_name` (required), `prospect_name`, - `title`, `phone`, `email`, `location`. Skip any listing missing - `company_name`. -- Contact requirement: Final call list rows must have **at least one valid - contact** (phone or email) after manual enrichment. -- Phone normalisation: digits-only. Convert `+61` mobile prefixes to `0` (e.g. - `+61412...` => `0412...`). -- Follow the project's PDL and helper modules described in - [`README.md`](../README.md), such as [`fetch_with_backoff`](../README.md) and - pagination helpers (`pick_pagination`) when implementing fetchers and - paginators. - ---- - -## Updated additions (from the revised README) - -1. Mini VCS integration (POSIX utilities) - - - The project uses a lightweight, POSIX-friendly mini VCS for data artefacts - and generated outputs. - - Tools and workflows to use: - - Create snapshots: `tar -czf .snapshots/snap-.tar.gz ` and - record checksums (e.g. `sha1sum`). - - Generate patches: - `diff -uNr base/ new/ > .snapshots/patches/.patch`. - - Apply patches: `patch -p0 < .snapshots/patches/.patch`. - - Verify with `sha1sum -c` and `cmp` as needed. - - See the `Mini VCS Integration` and Snapshot examples in - [`README.md`](../README.md). - - When adding automation for snapshots, ensure `.snapshots/` is in - `.gitignore` and that checksums and an index are maintained. - -2. Manuals and roff typesetting - - There is now guidance to author manuals with `roff`/`man` macros and to - render with `nroff`/`groff`. - - Recommended files live under `docs/man/` (example: - [`docs/man/elvis.1`](../docs/man/elvis.1)). - - Helpful commands: - - View locally: `nroff -man docs/man/elvis.1 | less -R` - - Render UTF‑8: `groff -Tutf8 -man docs/man/elvis.1 | less -R` - - Produce PDF (if groff present): - `groff -Tpdf -man docs/man/elvis.1 > docs/man/elvis.pdf` - - When generating manpages, include standard sections (`NAME`, `SYNOPSIS`, - `DESCRIPTION`, `OPTIONS`, `EXAMPLES`) and keep them concise. - ---- - -## New or clarified workspace items to reference - -- `.snapshots/` — snapshot/patch/checksum storage (see `README.md` snapshot - examples). -- `docs/man/` — roff sources and produced manpages (see - [`docs/runbook.md`](../docs/runbook.md) and - [`docs/man/elvis.1`](../docs/man/elvis.1)). -- `project.conf` and `configs/seek-pagination.ini` — canonical configuration and - Seek-specific selectors/limits ([`project.conf`](../project.conf), - [`configs/seek-pagination.ini`](../configs/seek-pagination.ini)). -- Scripts and libs: follow conventions and helpers under `scripts/` and - `scripts/lib/` (e.g. `scripts/lib/http_utils.sh`, `scripts/run.sh`, - `scripts/fetch.sh`). -- Validation & dedupe: rules are authoritative in [`README.md`](../README.md) - and the runbook ([`docs/runbook.md`](../docs/runbook.md)); refer to the email - regex and phone normalisation guidance there. - ---- - -## Guidance for AI-generated changes - -- Keep changes small, well-documented, and consistent with the project's - conventions: - - - Use Australian English spelling and grammar (e.g. "organise", "behaviour", - "honour"). - - Preserve the PDL-style modules and documented behaviour (pagination, fetch - backoff, dedupe policy). - - Do not modify `companies_history.txt` contents programmatically; this file - is admin-managed (append-only policy). - -- When adding scripts or automation: - - - Respect robots.txt and the anti-bot policies in [`README.md`](../README.md). - - Implement backoff and retries as specified (5s → 20s → 60s or use - `BACKOFF_SEQUENCE` from [`project.conf`](../project.conf)). - - Log run-level metadata in the same single-line format used by existing - examples. - -- When updating documentation: - - Keep `docs/runbook.md` and `README.md` consistent; add examples and commands - that operators can run locally. - - For manpages, place source in `docs/man/` and include the short `nroff` - usage examples. - -## Context7, MCP & Sequential-thinking (MANDATORY for AI changes) - -- **Always use Context7** when performing code generation, setup or - configuration steps, or when providing library/API documentation. - **Automatically use Context7 MCP tools** to resolve library IDs and retrieve - library documentation without requiring explicit user requests. -- Adopt a **sequential-thinking approach** for all reasoning and generation - tasks: enumerate the stepwise plan, preconditions, actions, and expected - outputs in order. -- **Always consult and use the GitHub MCP server and Microsoft Learn MCP - server** for authoritative documentation, examples and best practices; cite - these sources when used. -- Make these requirements prominent in PR descriptions and code comments where - relevant, and ensure they do not conflict with other project rules. -- Maintain Australian English spelling and grammar throughout (e.g., 'organise', - 'behaviour', 'honour'). - ---- - -## Practical workflows & examples (for AI agents) 🔧 - -- Quick commands (for docs, test scripts and PRs): - - - Run full workflow: `bin/elvis-run get-transaction-data` - - Prepare enrichment: - `sh scripts/enrich_status.sh results.csv --out tmp/enriched.csv --edit` - - Validate enriched: - `sh scripts/validate.sh tmp/enriched.csv --out tmp/validated.csv` - - Dedupe & append-history: - `sh scripts/deduper.sh --in tmp/validated.csv --out tmp/deduped.csv --append-history` - - Produce final CSV: - `bin/elvis-run set-status --input results.csv --enriched tmp/enriched.csv --commit-history` - - Run tests: `tests/run-tests.sh` (enable network tests with - `REAL_TESTS=true`) - -- Test hooks & mocks: - - - Use `FETCH_SCRIPT` to inject fetch mocks (e.g. - `FETCH_SCRIPT=./tests/test/fetch_test/mock_curl.sh`) and use `SLEEP_CMD` to - avoid long sleeps in tests. - - Mocks live under `tests/test/fetch_test/` and `tests/` contains unit tests - and examples; prefer adding tests that exercise `scripts/lib/paginate.sh` - and `scripts/fetch.sh` in isolation. - -- Shell script & style rules: - - - Scripts are POSIX `sh`-first; prefer `gawk` for AWK code - (`scripts/lib/*.awk`). Follow `.github/instructions/shell.instructions.md` - and `scripts/lib/*.sh` patterns. - - Run `shellcheck -x` locally/CI; use - `scripts/lib/shellcheck-cygwin-wrapper.sh` on Windows/Cygwin when needed. - -- Config & change guidance: - - - Edit `project.conf` for defaults and `.env.example` for env overrides when - adding keys. Scripts load settings with precedence: `.env` → `project.conf` - → built-in defaults. - - Add tests for config changes (see `tests/test_load_fetch_config.sh`). - -- Data & policy rules (do not break): - - - `companies_history.txt` is append-only and _must not_ be modified - programmatically without operator consent; prefer using `--append-history` - flows. - - Honour `robots.txt`, never attempt automated CAPTCHA solving, and do not - scrape search engine results automatically. - -- Mini VCS & snapshots: - - - Snapshot data before large changes: - `ts=$(date -u +%Y%m%dT%H%M%SZ); tar -czf .snapshots/snap-$ts.tar.gz companies_history.txt data/seeds configs && sha1sum .snapshots/snap-$ts.tar.gz > .snapshots/checksums/snap-$ts.sha1` - - Keep `.snapshots/` in `.gitignore` and add a short index entry to - `.snapshots/index` for auditability. - -- Tests & PR expectations: - - Add deterministic tests using mocks; avoid long sleeps by overriding - `SLEEP_CMD`. - - Keep changes small, add tests, update `README.md`/`docs/runbook.md` and - include brief commands or examples in PR descriptions. - ---- - -## Tone & merging instructions - -- If a `.github/copilot-instructions.md` already exists, merge carefully: - preserve project-specific guidance and update validation rules or examples. -- Maintain a clear, structured, and developer-friendly tone in any additions. -- Keep entries short and actionable; include one-liners for commands and links - to relevant files. - ---- - -## Quick links (workspace references) - -- [README.md](../README.md) -- [docs/runbook.md](../docs/runbook.md) -- [configs/seek-pagination.ini](../configs/seek-pagination.ini) -- [project.conf](../project.conf) -- [.snapshots/](../.snapshots/) -- [docs/man/elvis.1](../docs/man/elvis.1) -- [companies_history.txt](../companies_history.txt) -- [scripts/run.sh](../scripts/run.sh) -- [scripts/fetch.sh](../scripts/fetch.sh) -- [scripts/lib/http_utils.sh](../scripts/lib/http_utils.sh) - ---- - -If you'd like, I can: - -- Add a short `scripts/build-man.sh` example to `scripts/` to validate/generate - manpages, or -- Draft a small `scripts/snapshot.sh` that implements the mini VCS snapshot + - checksum steps. - ---- diff --git a/archive/.github/instructions/markdown.instructions.md b/archive/.github/instructions/markdown.instructions.md deleted file mode 100644 index 51dff2e..0000000 --- a/archive/.github/instructions/markdown.instructions.md +++ /dev/null @@ -1,87 +0,0 @@ ---- -description: "Documentation and content creation standards" -applyTo: "**/*.md" ---- - -## Markdown Content Rules - -The following markdown content rules are enforced in the validators: - -1. **Headings**: Use appropriate heading levels (H2, H3, etc.) to structure your - content. Do not use an H1 heading, as this will be generated based on the - title. -2. **Lists**: Use bullet points or numbered lists for lists. Ensure proper - indentation and spacing. -3. **Code Blocks**: Use fenced code blocks for code snippets. Specify the - language for syntax highlighting. -4. **Links**: Use proper markdown syntax for links. Ensure that links are valid - and accessible. -5. **Images**: Use proper markdown syntax for images. Include alt text for - accessibility. -6. **Tables**: Use markdown tables for tabular data. Ensure proper formatting - and alignment. -7. **Line Length**: Limit line length to 400 characters for readability. -8. **Whitespace**: Use appropriate whitespace to separate sections and improve - readability. -9. **Front Matter**: Include YAML front matter at the beginning of the file with - required metadata fields. - -## Formatting and Structure - -Follow these guidelines for formatting and structuring your markdown content: - -- **Headings**: Use `##` for H2 and `###` for H3. Ensure that headings are used - in a hierarchical manner. Recommend restructuring if content includes H4, and - more strongly recommend for H5. -- **Lists**: Use `-` for bullet points and `1.` for numbered lists. Indent - nested lists with two spaces. -- **Code Blocks**: Use triple backticks to create fenced code blocks. Specify the language after the opening backticks for syntax highlighting (e.g., `csharp`). -- **Links**: Use `[link text](https://example.com)` for links. Ensure that the - link text is descriptive and the URL is valid. -- **Images**: Use `![alt text](https://example.com/image.jpg)` for images. - Include a brief description of the image in the alt text. -- **Tables**: Use `|` to create tables. Ensure that columns are properly aligned - and headers are included. -- **Line Length**: Break lines at 80 characters to improve readability. Use soft - line breaks for long paragraphs. -- **Whitespace**: Use blank lines to separate sections and improve readability. - Avoid excessive whitespace. - -## Validation Requirements - -Ensure compliance with the following validation requirements: - -- **Front Matter**: Include the following fields in the YAML front matter: - - - `post_title`: The title of the post. - - `author1`: The primary author of the post. - - `post_slug`: The URL slug for the post. - - `microsoft_alias`: The Microsoft alias of the author. - - `featured_image`: The URL of the featured image. - - `categories`: The categories for the post. These categories must be from the - list in /categories.txt. - - `tags`: The tags for the post. - - `ai_note`: Indicate if AI was used in the creation of the post. - - `summary`: A brief summary of the post. Recommend a summary based on the - content when possible. - - `post_date`: The publication date of the post. - -- **Content Rules**: Ensure that the content follows the markdown content rules - specified above. -- **Formatting**: Ensure that the content is properly formatted and structured - according to the guidelines. -- **Validation**: Run the validation tools to check for compliance with the - rules and guidelines. - -## Context7, MCP & Sequential-thinking (MANDATORY for documentation) - -- When creating or updating documentation, **always use Context7** for any code - snippets, setup instructions or library/API references. **Automatically use - Context7 MCP tools** to resolve library IDs and obtain authoritative - docs/examples without needing an explicit user request. -- Apply a **sequential-thinking** style in documentation: outline prerequisites, - stepwise procedures, expected outcomes and verification steps in order. -- **Consult the GitHub MCP server and Microsoft Learn MCP server** for canonical - examples and guidance and reference them when used. -- Write all documentation using Australian English spelling and grammar and - ensure there are no conflicting instructions in front matter or content. diff --git a/archive/.github/instructions/shell.instructions.md b/archive/.github/instructions/shell.instructions.md deleted file mode 100644 index 97d72d9..0000000 --- a/archive/.github/instructions/shell.instructions.md +++ /dev/null @@ -1,166 +0,0 @@ ---- -description: - "Shell scripting best practices and conventions for bash, sh, and other shells" -applyTo: "**/*.sh" ---- - -# Shell Scripting Guidelines - -Instructions for writing clean, safe, and maintainable shell scripts for bash, -sh, zsh, and other shells. - -## General Principles - -- Generate code that is clean, simple, and concise -- Ensure scripts are easily readable and understandable -- Add comments where helpful for understanding how the script works -- Generate concise and simple echo outputs to provide execution status -- Avoid unnecessary echo output and excessive logging -- Use shellcheck for static analysis when available -- Assume scripts are for automation and testing rather than production systems - unless specified otherwise -- Prefer safe expansions: double-quote variable references (`"$var"`), use - `${var}` for clarity, and avoid `eval` -- Use modern Bash features (`[[ ]]`, `local`, arrays) when portability - requirements allow; fall back to POSIX constructs only when needed -- Choose reliable parsers for structured data instead of ad-hoc text processing - -## Error Handling & Safety - -- Always enable `set -euo pipefail` to fail fast on errors, catch unset - variables, and surface pipeline failures -- Validate all required parameters before execution -- Provide clear error messages with context -- Use `trap` to clean up temporary resources or handle unexpected exits when the - script terminates -- Declare immutable values with `readonly` (or `declare -r`) to prevent - accidental reassignment -- Use `mktemp` to create temporary files or directories safely and ensure they - are removed in your cleanup handler - -## Script Structure - -- Start with a clear shebang: `#!/bin/bash` unless specified otherwise -- Include a header comment explaining the script's purpose -- Define default values for all variables at the top -- Use functions for reusable code blocks -- Create reusable functions instead of repeating similar blocks of code -- Keep the main execution flow clean and readable - -## Working with JSON and YAML - -- Prefer dedicated parsers (`jq` for JSON, `yq` for YAML—or `jq` on JSON - converted via `yq`) over ad-hoc text processing with `grep`, `awk`, or shell - string splitting -- When `jq`/`yq` are unavailable or not appropriate, choose the next most - reliable parser available in your environment, and be explicit about how it - should be used safely -- Validate that required fields exist and handle missing/invalid data paths - explicitly (e.g., by checking `jq` exit status or using `// empty`) -- Quote jq/yq filters to prevent shell expansion and prefer `--raw-output` when - you need plain strings -- Treat parser errors as fatal: combine with `set -euo pipefail` or test command - success before using results -- Document parser dependencies at the top of the script and fail fast with a - helpful message if `jq`/`yq` (or alternative tools) are required but not - installed - -## Context7, MCP & Sequential-thinking (MANDATORY for generated shell code) - -- When authoring or generating shell scripts, **always use Context7** for code - generation, setup or configuration tasks. **Automatically use Context7 MCP - tools** to resolve library IDs and fetch authoritative documentation and - examples without requiring explicit user requests. -- Adopt a **sequential-thinking** approach: explicitly list preconditions, - stepwise actions, expected outputs, and postconditions before producing script - code. -- **Always consult the GitHub MCP server and the Microsoft Learn MCP server** - for authoritative CLI examples, safe usage patterns, and best practices; - prefer these sources and cite them inline. -- Ensure these requirements are integrated with existing safety guidance (e.g., - `set -euo pipefail`, traps, `mktemp`) and be explicit about any assumptions or - prerequisites in the script header. - -```bash -#!/bin/bash - -# ============================================================================ -# Script Description Here -# ============================================================================ - -set -euo pipefail - -cleanup() { - # Remove temporary resources or perform other teardown steps as needed - if [[ -n "${TEMP_DIR:-}" && -d "$TEMP_DIR" ]]; then - rm -rf "$TEMP_DIR" - fi -} - -trap cleanup EXIT - -# Default values -RESOURCE_GROUP="" -REQUIRED_PARAM="" -OPTIONAL_PARAM="default-value" -readonly SCRIPT_NAME="$(basename "$0")" - -TEMP_DIR="" - -# Functions -usage() { - echo "Usage: $SCRIPT_NAME [OPTIONS]" - echo "Options:" - echo " -g, --resource-group Resource group (required)" - echo " -h, --help Show this help" - exit 0 -} - -validate_requirements() { - if [[ -z "$RESOURCE_GROUP" ]]; then - echo "Error: Resource group is required" - exit 1 - fi -} - -main() { - validate_requirements - - TEMP_DIR="$(mktemp -d)" - if [[ ! -d "$TEMP_DIR" ]]; then - echo "Error: failed to create temporary directory" >&2 - exit 1 - fi - - echo "============================================================================" - echo "Script Execution Started" - echo "============================================================================" - - # Main logic here - - echo "============================================================================" - echo "Script Execution Completed" - echo "============================================================================" -} - -# Parse arguments -while [[ $# -gt 0 ]]; do - case $1 in - -g|--resource-group) - RESOURCE_GROUP="$2" - shift 2 - ;; - -h|--help) - usage - ;; - *) - echo "Unknown option: $1" - exit 1 - ;; - esac -done - -# Execute main function -main "$@" - -``` diff --git a/archive/.github/prompts/boost-prompt.prompt.md b/archive/.github/prompts/boost-prompt.prompt.md deleted file mode 100644 index cb8fb3a..0000000 --- a/archive/.github/prompts/boost-prompt.prompt.md +++ /dev/null @@ -1,42 +0,0 @@ ---- -agent: agent -description: "Interactive prompt refinement workflow: interrogates scope, deliverables constraints; copies final markdown to clipboard; never writes code. Requires the Joyride extension." ---- - -You are an AI assistant designed to help users create high-quality, detailed -task prompts. DO NOT WRITE ANY CODE. - -Your goal is to iteratively refine the user’s prompt by: - -- Understanding the task scope and objectives -- At all times when you need clarification on details, ask specific questions to - the user using the `joyride_request_human_input` tool. -- Defining expected deliverables and success criteria -- Perform project explorations, using available tools, to further your - understanding of the task -- Clarifying technical and procedural requirements -- Organizing the prompt into clear sections or steps -- Ensuring the prompt is easy to understand and follow -- **Always use Context7** when performing code generation, setup or - configuration steps, or when providing library/API documentation. - **Automatically use Context7 MCP tools** to resolve library IDs and retrieve - library documentation without requiring explicit user requests. -- Use a **sequential-thinking approach** for all reasoning and generation tasks: - enumerate steps, preconditions, actions and expected outputs in order. -- **Always consult and use the GitHub MCP server and Microsoft Learn MCP - server** for authoritative documentation, examples, and best practices; cite - sources used. -- Use Australian English spelling and grammar for all prompt text. - -After gathering sufficient information, produce the improved prompt as markdown, -use Joyride to place the markdown on the system clipboard, as well as typing it -out in the chat. Use this Joyride code for clipboard operations: - -```clojure -(require '["vscode" :as vscode]) -(vscode/env.clipboard.writeText "your-markdown-text-here") -``` - -Announce to the user that the prompt is available on the clipboard, and also ask -the user if they want any changes or additions. Repeat the copy + chat + ask -after any revisions of the prompt. diff --git a/archive/.github/workflows/.gitkeep b/archive/.github/workflows/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/archive/CHANGELOG.md b/archive/CHANGELOG.md deleted file mode 100644 index 56e2f73..0000000 --- a/archive/CHANGELOG.md +++ /dev/null @@ -1,39 +0,0 @@ -# CHANGELOG - -All notable changes to this project will be documented in this file. - -## Unreleased - -- feat(config): add focused `configs/fetch.ini` and - `scripts/lib/load_fetch_config.sh` to centralise fetch, CAPTCHA and 403 - handling; scripts now load fetch config if present and will use `project.conf` - / `.env` values when available -- chore: update `.env.example`, `scripts/fetch.sh`, `scripts/lib/http_utils.sh`, - `scripts/lib/paginate.sh` and docs to reflect configuration centralisation - -## 23 December 2025 - -- docs: consolidated README into a single commit and added comprehensive project - plan (history rewritten and squashed for clarity) - -## 9 December 2025 - -- docs: Added new "Orchestration Flow" section detailing the full stepwise - scraping, validation, enrichment, and output process from seeds to CSV, based - on improved analysis of Seek.com.au behaviour. - -## 8 December 2025 - -- docs: All sections rewritten for selector stability and modern Seek.com.au - markup, plus attention to Australian spelling, idiom and norms. - -## 6 December 2025 - -- Initial commit (project scaffold) - ---- - -Notes: - -- Keep the `CHANGELOG.md` up to date with each meaningful change. Use brief, - actionable entries and standard prefixes (docs:, feat:, fix:, chore:). diff --git a/archive/README.md b/archive/README.md deleted file mode 100644 index 8d9ab44..0000000 --- a/archive/README.md +++ /dev/null @@ -1,1669 +0,0 @@ -# Comprehensive Project Plan: Australian Sales Lead Call List Scraper - -## Table of Contents - -- [Runbook](docs/runbook.md) -- [CHANGELOG](CHANGELOG.md) - -## 1. Project Objective - -```mermaid -flowchart TD - N[Normalise Seeds] --> S[Split into per-record files] - S --> A[Load Seeds] - A --> B[Detect Route & Pagination] - B --> C[Fetch Pages with Backoff - robots.txt, UA rotation, 403/CAPTCHA handling] - C --> D[Parse Job Listings - JSON/HTML extractors] - D --> E[Aggregate Raw Records] - E --> F[Dedupe by Company Name - checks companies_history.txt] - F --> O[Manual Enrichment - Operator] - O --> G[Validate Records - phone normalise, email regex, require contact] - G --> H[Produce Daily CSV Output - min leads check] - H --> I[Append Names to History - manual/optional] - H --> J[Archive & Snapshot - .snapshots] - J --> L[Cleanup temp & Summarise] - H --> K[Log Run Details] - style O fill:#ffe5b4,stroke:#b19400,stroke-width:2px - -``` - -Produce a daily call list of at least 25 unique Australian companies—each record -to include the prospect’s name, position, contact details (mobile and/or email), -and business location. This data is for sales lead generation and business -development. **Company names must always be unique** across days, using company -history for deduplication. - ---- - -## 2. Data Requirements - -**Required data fields:** - -- Company Name (must be unique, case-insensitive) -- Lead/Prospect Name -- Position/Title -- Location (state/region preferred) -- Mobile phone (normalised, digits only, e.g. 0412345678) -- Email (any domain) -- _Note_: Skip records if all contact details are missing. - -### Data Model & Validation (rules to guarantee consistency) - -#### Fields to extract from listing pages - -- `company_name` (string) -- `title` (string) -- `location` (string) -- `summary` (string; optional) -- `job_id` (string; internal use) - -> Note: Contact info (phone/email) is not expected on listing cards. Contacts -> are added **later** via manual enrichment from public sources. - -#### Validation rules - -```mermaid -flowchart TD - A[Record found in listing] --> B{Company Name Present?} - B -- No --> X[Skip Record] - B -- Yes --> C{Already in Today’s List?} - C -- Yes --> X - C -- No --> D{Exists in companies_history.txt?} - D -- Yes --> X - D -- No --> E[Manual Contact Enrichment] - E --> F{At least one contact present?} - F -- No --> X - F -- Yes --> G[Save to CSV, Append to History] -``` - -- **Company required:** Skip any row missing `company_name`. -- **Company dedupe:** Case-insensitive deduplication of `company_name` only (no - normalisation of whitespace/punctuation/suffixes). -- **Location does not break dedupe:** Same `company_name` with different - locations is considered a duplicate for exclusion. -- **Contact presence (final call list):** Each final CSV row must include at - least one valid contact (phone or email) after enrichment. - -### Manual enrichment (operator) - -Operators should prepare an editable enrichment file, complete contact fields, -then run validation and finalise the daily calllist. Example commands: - -- Prepare editable file: - `sh scripts/enrich_status.sh results.csv --out tmp/enriched.csv --edit` -- Validate edited file: - `sh scripts/validate.sh tmp/enriched.csv --out tmp/validated.csv` -- Finalise and optionally append to history: - `sh scripts/set_status.sh --input results.csv --enriched tmp/enriched.csv --commit-history` - -#### Regex validation - -- **Email:** `[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}` -- **Phone:** digits only; convert `+61` mobiles to `0`-prefixed local (e.g., - `+61412…` → `0412…`) - -#### Historical exclusion - -- Maintain `companies_history.txt` (one name per line). -- Before adding a row to today’s CSV, check case-insensitive membership against - history; if present → skip. -- On acceptance, append new company names to history (manual or scripted). -- **Note:** Appending to history is **optional** — to append automatically use: - - `sh scripts/deduper.sh --in --out --append-history` - - or - `sh scripts/set_status.sh --input results.csv --enriched tmp/enriched.csv --commit-history` - Otherwise update `companies_history.txt` manually after operator review. - -```pdl -MODULE is_dup_company(company) -- returns TRUE or FALSE -PURPOSE: - Determine whether a company name (case-insensitive) exists in history. -INPUTS: - company : string -OUTPUTS: - TRUE if company is present in history (case-insensitive) - FALSE otherwise -ASSUMPTIONS: - Historical values should not be modified; lowercasing is used only for comparison. -ALGORITHM: - 1. let lc := lowercase(company) -- for comparison only - 2. if companies_history.txt contains company (case-insensitive match) then - return TRUE - 3. else if companies_history_lower.txt contains lc (exact match) then - return TRUE - 4. else - return FALSE -NOTES: - - Keep stored history values unchanged. Use lowercased copies only for comparisons. -``` - ---- - -## 3. Data Sources - -- **Primary (automatic):** [Seek Australia](https://www.seek.com.au/) — - server-rendered job listing pages (search/listing pages only). Automatic - scraping is limited to Seek listing pages; do not fetch job detail pages or - non-listing endpoints automatically. -- **Supplementary (manual only):** - [DuckDuckGo Lite](https://lite.duckduckgo.com/lite) and - [Google](https://www.google.com/) — used for manual Google/DuckDuckGo dork - queries during operator enrichment; **do not** automatically scrape search - engine result pages. -- Only scrape public web pages; **never** scrape private profiles (LinkedIn, - Facebook etc.) or any site that disallows scraping under robots.txt or site - terms of service. -- **Fetcher & politeness:** By default the fetcher honours `robots.txt` - (`VERIFY_ROBOTS=true`). See `scripts/fetch.sh` and `scripts/lib/http_utils.sh` - for implementation and configuration (overrides via `.env` or `project.conf`). -- **Fetcher behaviour (implementation notes):** exponential backoff - (`BACKOFF_SEQUENCE`, default `5,20,60`), User-Agent rotation (`UA_ROTATE` / - `UA_LIST_PATH` or `data/ua.txt`), special-case HTTP 403 handling - (`RETRY_ON_403`, `EXTRA_403_RETRIES`), compressed transfer and browser-like - headers to reduce 403s, and CAPTCHA detection which is logged and causes the - route to be skipped. Do not attempt automated CAPTCHA solving. - ---- - -## 4. Geographic, Language & Domain Limitation - -- Australian businesses only (.com.au websites/domains) -- All content in English (preferably en_AU.UTF-8) -- Seed job searches to cover all major Australian capitals and regions (see - Appendix) - ---- - -## 5. Success Criteria, KPIs & Acceptance - -- **Daily target:** At least 25 unique companies (company names - case-insensitive, no repeats checked against company history) -- Each row must have at least one valid contact detail (phone or email) -- Missing/incomplete company names: skip -- No duplicate companies across different days (per historical exclusion) -- If fewer than 25 leads are found, save the CSV regardless and record a warning - in the logs -- Project “passes” if daily lists have valid contacts and no duplicate companies - from the past - ---- - -## 6. Volume, Frequency & Retention - -- Minimum 25 leads per run -- Data refreshed daily -- Each new call list overwrites the previous day’s file - (‘calllist_YYYY-MM-DD.csv’), history file is permanent - (`companies_history.txt`) - ---- - -## 7. Storage, Output Format & Encoding - -- Output: UTF-8, CSV — one line per company/lead -- Filename: `calllist_YYYY-MM-DD.csv` (overwrites daily) -- History file: `companies_history.txt` (one company per line, maintained - manually) -- Do not include source URLs, timestamps, or data lineage in the CSV -- **CSV Example:** - - ```csv - company_name,prospect_name,title,phone,email,location - XYZ Pty Ltd,John Smith,Managing Director,0412345678,email@xyz.com.au,Perth, WA - ABC Ltd,Mary Jane,Owner,0498765432,test@abc.com.au,Darwin, NT - ``` - ---- - -## 8. Tools & Tech Stack - -```mermaid -graph LR - Shell[POSIX Shell Scripts] -- controls --> CurlCoreutils["Curl + Coreutils"] - Shell -- uses --> DiffPatchTarCmpEdCp["Diff + Patch, Tar + Cmp + Ed + Cp"] - Shell -- can trigger --> Cron - Shell -- for docs/review --> Roff -``` - -### Essential - -- Bourne Shell (`sh`) for scripting -- `curl` for transferring data using URLS -- `coreutils` for command line utilities (e.g., `cp`, `mv`, `find`, `touch`, - `ln`) -- `diff`, `patch`, `tar`, `cmp`, and `ed` for manual version control -- `tar` for efficient snapshots and restores -- **`gawk` / `awk`** — used by `scripts/lib/*.awk` for parsing and extraction - (prefer `gawk` where available) -- **`sed`, `grep`** — text processing utilities used widely across the pipeline - -### Developer tooling (recommended) - -- `shellcheck` — recommended for local linting and CI (`shellcheck -x` is used - by tests when available) - -### Optional / Data files - -- `data/ua.txt` or `configs/user_agents.txt` — optional User-Agent lists used - when `UA_ROTATE=true` - -### Notes - -- Prefer `gawk` for the AWK scripts; some AWK dialect differences may affect - parsing on very old systems. For Windows, run on Cygwin/WSL or a - POSIX-compatible environment to ensure full tool compatibility. - -### Non-Essential - -- `roff` or `nroff` (UNIX docs/manpages) -- `cron` for automation and task scheduling - -**Cross-platform**: Linux, BSD, macOS, and Windows. - ---- - -## Creating Manuals with roff and nroff 📖 - -### Overview - -`roff` is the original Unix typesetting system used to write and format manual -pages. The `man` macro package (roff macros) provides a concise way to structure -sections like NAME, SYNOPSIS, DESCRIPTION, OPTIONS and EXAMPLES. Use `nroff` to -format roff sources for plain terminal viewing; use `groff` (GNU troff) when you -need richer output (UTF‑8, PostScript, PDF, HTML). - -### Basic workflow & commands - -- Create source pages under `docs/man/` (e.g., `docs/man/elvis.1`). -- View locally with `nroff` (terminal): - -```sh -nroff -man docs/man/elvis.1 | less -R -``` - -- View a local file using `man` (some systems support `-l` for local files): - -```sh -man -l docs/man/elvis.1 -``` - -- Render UTF‑8 output with `groff` (if installed): - -```sh -groff -Tutf8 -man docs/man/elvis.1 | less -R -``` - -- Produce a PDF with `groff` (if available): - -```sh -groff -Tpdf -man docs/man/elvis.1 > docs/man/elvis.pdf -``` - -- Install manpages system‑wide (example for `man1` section): - -```sh -mkdir -p /usr/local/share/man/man1 -cp docs/man/elvis.1 /usr/local/share/man/man1/ -compress -f /usr/local/share/man/man1/elvis.1 # or gzip elvis.1 -mandb || true # update mancache (may require root) -``` - -### Best practices - -- Keep roff sources in `docs/man/` and name files with the proper section suffix - (e.g., `.1` for user commands, `.8` for admin/system tools). -- Use standard macro sections: `.TH`, `.SH NAME`, `.SH SYNOPSIS`, - `.SH DESCRIPTION`, `.SH OPTIONS`, `.SH EXAMPLES`, `.SH FILES`, `.SH AUTHOR`, - `.SH BUGS`. -- Keep the NAME and SYNOPSIS concise and accurate — these are used by `man` and - search tools. -- Add a simple `scripts/build-man.sh` that runs `nroff`/`groff` checks and - optionally produces PDF/UTF‑8 text for review. -- When packaging or installing, place generated pages in the appropriate `manN` - directory and update the man database with `mandb` where available. - -### Minimal roff example (docs/man/elvis.1) - -```roff -.TH ELVIS 1 "2025-12-24" "elvis 0.1" "User Commands" -.SH NAME -elvis \- produce daily Australian sales lead call lists -.SH SYNOPSIS -.B elvis -\fIOPTIONS\fR -.SH DESCRIPTION -.PP -elvis fetches listings, extracts companies and writes `calllist_YYYY-MM-DD.csv`. -.SH EXAMPLES -.TP -.B elvis -r -Run the full scraping run in dry-run mode. -``` - ---- - ---- - -## 9. Scraping Method & Strategy - -- Use `grep`, `sed`, `awk`, `curl`, `tr`, `sort`, `uniq`, `date`, and `printf` -- Shell scripts to control fetch/parse/validate/deduplicate/report -- Helper binaries are allowed - -When building your scraping run, start with a diverse collection of filtered -listing URLs (see Filtered Seeds below) to cover job types, regions, work -styles, and more—with no headless browser or form simulation required. - -Parsing & extraction (implementation) - -- The pipeline is **AWK-first**: `scripts/lib/parse_seek_json3.awk` and - `scripts/lib/parser.awk` extract job fields from listing HTML and embedded - JSON. Prefer stable `data-automation` attributes (for example - `data-automation="jobTitle"`, `jobCompany`, `jobLocation`) over brittle CSS - class names when authoring selectors. -- The AWK extractor is intentionally robust and fast; the codebase includes - secondary fallbacks which are for contingency only. - -Pagination & routing - -- The project implements **route-aware pagination** (see - `scripts/lib/pick_pagination.sh` and `scripts/lib/paginate.sh`). Use - `PAG_START` (offset) or `PAG_PAGE` (page-number) models as detected by - `pick_pagination.sh`, and stop iterating when the `PAGE_NEXT_MARKER` is - absent. Override `PAGE_NEXT_MARKER` at runtime or in the Seek INI if the - site's pagination markup changes. - -Fetcher behaviour & politeness (implementation) - -- The fetcher honours `robots.txt` by default (`VERIFY_ROBOTS=true`), supports - UA rotation (`UA_ROTATE` + `data/ua.txt` / `configs/user_agents.txt`), detects - CAPTCHAs and skips the route, and implements special-case HTTP 403 handling - (`RETRY_ON_403` + `EXTRA_403_RETRIES`) with an exponential backoff - (`BACKOFF_SEQUENCE`). See `scripts/fetch.sh` and `scripts/lib/http_utils.sh` - for tuning parameters and behaviour. - -Testing & debugging hooks - -- Use `FETCH_SCRIPT` (env) to provide a mock fetcher and override - `PAGE_NEXT_MARKER` to exercise pagination offline; tests in `tests/` contain - examples and mocks for `fetch.sh` and `paginate.sh`. - -- **Google-dorking (manual):** CLI scripts generate Google or DuckDuckGo - queries, which are opened in lynx), never automatically scraped - - Limit domains to .com.au - - Use flexible dorks (e.g. name/company/job/location/contact) for best results - - Example dork: `"Jane Smith" "email" OR "phone" OR "mobile" site:.com.au` -- Appendix includes dork and seed templates - ---- - -## 10. Data Validation, Deduplication & Cleaning - -- Company name deduplication: case-insensitive matching only (no normalisation) -- Company + different location = considered duplicate for exclusion -- Do not normalise suffixes/whitespace/punctuation -- Skip rows missing company name -- Require at least one valid contact (phone or email) -- Email validation: `[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}` -- Phone validation: digits only, convert +61 to 0-prefix - -Validation enforcement (how) - -- Validation is performed by `scripts/validate.sh` which runs - `scripts/lib/validator.awk`. It: - - Requires header columns: - `company_name,prospect_name,title,phone,email,location`. - - Requires `company_name` and at least one contact (phone or email). - - Normalises phone numbers (convert `+61` → `0`, remove non-digits). - - Validates emails against `[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}`. - - Invalid rows are skipped and printed to stderr in the form: - `INVALID `. - - Valid rows are written to the validated output (for example - `tmp/validated.csv` when run via `set_status.sh`). - -Deduplication & history (how) - -- Deduplication is done by `scripts/deduper.sh` / `scripts/lib/deduper.awk`. It: - - Performs case-insensitive dedupe on `company_name` only (location does NOT - break dedupe). - - Compares against `companies_history.txt` (a lowercased copy is used for - comparisons). - - Has an `--append-history` option to add newly accepted companies to history - (used by `set_status.sh` via `--commit-history`). - - Use `scripts/lib/is_dup_company.sh "Company Name"` for single-name checks. - -Operator workflow pointers - -- Example commands: - - Validate: `sh scripts/validate.sh tmp/enriched.csv --out tmp/validated.csv` - - Dedup: - `sh scripts/deduper.sh --in tmp/validated.csv --out tmp/deduped.csv --append-history` -- `set_status.sh` orchestrates these steps and writes the final file to - `data/calllists/calllist_YYYY-MM-DD.csv`; it will still write the CSV even if - fewer than `MIN_LEADS` and will log a "low leads" warning. - -Testing note - -- Unit tests and examples exist under `tests/` (mock files demonstrate - validation/dedupe/append behaviours). - ---- - -## 11. Pacing, Anti-Bot & Reliability Policy - -To minimise disruptions and respect rate-limit expectations: - -- **Randomised delays:** Sleep a random amount between requests (e.g., 1.2–4.8 - seconds) to avoid a machine-like cadence. -- **Exponential backoff & retries:** - - Up to 3 retries per URL - - Backoff schedule: 5s → 20s → 60s - - Stop after the 3rd failure; log the error and move on. - - Special-case HTTP 403: by default `RETRY_ON_403=true` and the fetcher will - add `EXTRA_403_RETRIES` (default `2`), rotate User-Agent, and retry with - backoff. The fetcher also sends browser-like headers (`Accept`, - `Accept-Language`, `Referer`) and enables compressed transfer to reduce the - chance of 403 responses. Set `RETRY_ON_403=false` to disable this behaviour. -- **User-Agent rotation:** Cycle a vetted pool of UA strings; avoid suspicious - or outdated UAs. By default the project will use `data/ua.txt` (if present) as - the UA list; set `UA_LIST_PATH` to override. Lines in the UA list are cleaned - (surrounding quotes removed, whitespace trimmed). Use `ALLOW_BOTS=true` to - allow known crawler UAs (not recommended). -- Do not use proxies or offshore scraping APIs -- **CAPTCHA detection:** If CAPTCHA text or known markers appear, log the event, - skip this route, and **do not** attempt automated solving. -- **Timeouts:** Set connection and read timeouts (e.g., 10–15 seconds) to avoid - hanging. -- **Respect robots.txt and ToS:** Only operate on listing pages and public - endpoints suitable for automated access. - -**Implementation notes & config** - -The fetcher uses the following env/config variables you can tune: -`VERIFY_ROBOTS`, `BACKOFF_SEQUENCE`, `FETCH_TIMEOUT`, `RETRY_ON_403`, -`EXTRA_403_RETRIES`, `UA_ROTATE`, `UA_LIST_PATH`, `PAGE_NEXT_MARKER`, -`DELAY_MIN`, `DELAY_MAX`, `MAX_PAGES`, `MAX_OFFSET`, `SLEEP_CMD`, and -`NETWORK_LOG`. When `VERIFY_ROBOTS=true` the fetcher checks `/robots.txt` -with a simple Disallow-prefix test; if blocked the fetch exits with code 2 and -logs `ERROR: blocked by robots.txt`. CAPTCHA/recaptcha markers (e.g., -`captcha|recaptcha|g-recaptcha`) are detected, logged as `WARN` and treated as a -fetch failure (route skipped). HTTP `403` responses trigger UA rotation and -`EXTRA_403_RETRIES` when `RETRY_ON_403=true` and are logged to `NETWORK_LOG` for -analysis. - -**Test hooks & debugging** - -- Use `FETCH_SCRIPT` to provide a mock fetcher (useful for pagination/fetch - tests) and override `SLEEP_CMD` to avoid long sleeps in tests. -- `NETWORK_LOG` entries are tab-delimited records: - `TIMESTAMP\tURL\tATTEMPT\tHTTP_CODE\tBYTES` (see `logs/network.log`); example: - `2025-12-09T09:31:07Z\thttps://example/jobs\t1\t403\t12345`. -- Optional: consider adding a configurable `CAPTCHA_PATTERNS` variable for - fine-tuning detection (future enhancement). - -> **Outcome:** A conservative, respectful scraper that avoids throttling and -> reduces maintenance due to anti-bot defences. - -**Shell backoff snippet (example):** - -```pdl -MODULE fetch_with_backoff(url) -- returns html_text or FAILURE -PURPOSE: - Try to fetch the given URL up to three times, using exponential backoff on failure. -INPUTS: - url : string -OUTPUTS: - html_text on success - FAILURE after three unsuccessful attempts -ALGORITHM: - 1. For attempt from 1 to 3 do - a. Try to fetch url with a 15 second timeout - b. If fetch succeeds then - return retrieved html_text - c. Else - If attempt == 1 then wait 5 seconds - If attempt == 2 then wait 20 seconds - If attempt == 3 then wait 60 seconds - 2. If all attempts fail then return FAILURE -NOTES: - - Use timeouts and record/log failed attempts for audit. -``` - ---- - -## 12. Error Handling, Logging & Monitoring - -- Script logs all runs to `log.txt` - - Include: timestamp, queried URLs, search terms - - Number of unique records found - - Errors/warnings (CAPTCHA, timeout etc.) - - Warn if fallback (textual) “Next” detection was triggered or if duplicate - pages were detected during pagination. - - Add record-level debugging if ‘verbose’ enabled - - Retain/rotate logs weekly (policy TBC) -- No external monitoring or alerting required - -### Logging & Change Resilience - -```mermaid -flowchart TD - A[Run Starts] --> B[Write Log: Start Details] - B --> C[Log Seed Processing] - C --> D[Log Valid/Skipped Records] - D --> E{Weekly Rotation?} - E -- Yes --> F[Rotate Logs] - E -- No --> G[Continue Logging] - F --> G -``` - -Record enough context to investigate issues and site changes: - -### Per run - -- Timestamp (start/end) -- Seed URL (and derived pagination scheme) -- Total pages fetched for the seed -- Total listings parsed for the seed - -- Number of valid output rows emitted -- Warnings and errors (timeouts, retries, fallback “Next” detection) - -**Network & Failure Artefacts** - -- **NETWORK_LOG** (default: `logs/network.log`) records fetch attempts as - tab-delimited rows: `TIMESTAMP\tURL\tATTEMPT\tHTTP_CODE\tBYTES`. Example: - `2025-12-09T09:31:07Z\thttps://example/jobs\t1\t403\t12345`. -- Special entries for quick triage: - - `403-retry` — when HTTP 403 triggers additional retries (useful to track UA - rotation effects). - - `ROBOTSBLOCK` — recorded when `robots.txt` disallows the route; includes the - first matching Disallow rule for auditability. -- Failure marker and preserved artifacts: - - `tmp/last_failed.status` is written when the `on_err` handler runs; use this - as a first check for recent failures. - - `.snapshots/failed/` contains preserved artifacts when auto-heal/preserve is - used (see `scripts/lib/heal.sh`). - -**Troubleshooting a failed fetch** - -- Inspect `logs/network.log` for `403` or `ROBOTSBLOCK` entries and check the - bytes/status recorded. -- Check `logs/log.txt` for `WARN`/`ERROR` lines and `tmp/last_failed.status` for - a failure marker. -- To reproduce safely and quickly, use the test fetch stub: - `sh tests/test_fetch_behaviour.sh` or set `FETCH_SCRIPT` to a mock and - override `SLEEP_CMD` to avoid long sleeps. -- Use `LOG_LEVEL=DEBUG` for verbose logs and try rotating UA (`UA_LIST_PATH`) or - tuning `BACKOFF_SEQUENCE`/`EXTRA_403_RETRIES`. - -#### Weekly rotation - -- Rotate logs weekly (policy TBD). -- Keep a summary index mapping date → seed → (pages, listings, status). - -#### Change detection - -- If automation attributes change or “Next” detection falls back to text: - - Emit a `WARN` entry including the exact snippet around pagination. - - Tag the seed with `ATTR_CHANGE=true` so audits can find it later. - -> **Goal:** Fast root‑cause analysis when Seek adjusts markup or pagination -> behavior. - -**Log line example:** - -```log -2025-12-09T09:31:07Z seed=/jobs?keywords=admin&where=Perth%2C+WA model=offset pages=6 listings=132 ok=true warn=fallback_next=false errors=0 -``` - ---- - -## 13. Security, Privacy & Compliance - -```mermaid -mindmap - root((Risk & Compliance)) - Rate Limiting - Respect Delay - No proxies - Backoff on Error - Privacy - Only public info - Honour removal requests - Robots.txt - Only allowed routes - Never profiles/details - CAPTCHA - Log, skip, never bypass - Audit - Structured logs - Weekly rotation -``` - -- Only collect public information — no restricted/private data -- Do not scrape any site or page excluded by robots.txt or ToS -- Strictly observe Australian privacy law/ethical norms -- Admin can manually remove any person/company details from history if requested - -### Compliance & Ethics - -- **Robots.txt & ToS:** Always review site policies. Operate only on listing - pages and public endpoints intended for automated access. -- **CAPTCHA & anti-bot:** If encountered, log and skip; do not bypass. -- **Privacy:** Collect only public information. Respect removal requests for - persons or companies in history or outputs. -- **Minimal footprint:** Avoid concurrent flood; prefer serialised or lightly - parallelised requests with conservative pacing. -- **Auditability:** Keep logs structured and retained for accountability. - -### Implementation notes (fetcher & audit) 🔧 - -- **Robots checks & audit (implementation note)** - - - The fetcher runs a conservative `robots.txt` check when - `VERIFY_ROBOTS=true`. - - If the route is disallowed the fetcher exits with status code `2` and writes - a `ROBOTSBLOCK` entry to `NETWORK_LOG` in the form: - `TIMESTAMP\tURL\tATTEMPT\tROBOTSBLOCK\t`. - - Operators: inspect `logs/network.log` for `ROBOTSBLOCK`, review the site's - `robots.txt`, and **do not** set `VERIFY_ROBOTS=false` without - legal/operator approval. - -- **CAPTCHA detection** 🛑 - - - `CAPTCHA_PATTERNS` controls detection (default: - `captcha|recaptcha|g-recaptcha`). On detection the fetcher logs a - `WARN: CAPTCHA or human check detected` message and treats the page as a - failure (no automated solving). - - The fetcher now writes a `CAPTCHA` diagnostic entry to `NETWORK_LOG` to aid - auditing (recommended enhancement implemented). - -- **403 handling & UA rotation** 🔁 - - - On HTTP `403` and when `RETRY_ON_403=true`, the fetcher adds - `EXTRA_403_RETRIES`, rotates `User-Agent` (from `UA_LIST_PATH`) and logs a - `403-retry` line in `NETWORK_LOG`. Tune `RETRY_ON_403` and - `EXTRA_403_RETRIES` in `project.conf` as needed. - -- **Network log format** 📑 - - - `NETWORK_LOG` (default: `logs/network.log`) is tab-delimited: - `TIMESTAMP\tURL\tATTEMPT\tHTTP_CODE\tBYTES`. - - Special values for the HTTP_CODE field include `403-retry`, `ROBOTSBLOCK`, - and `CAPTCHA` (implemented). - -- **Operator checklist** ✅ - - If robots or CAPTCHA events occur frequently: 1) inspect `logs/network.log` - and `logs/log.txt` (grep for `WARN`, `ERROR`); 2) check UA list and site - rules; 3) pause route and escalate to legal/ops when necessary. - ---- - -## 14. Retention & Admin Control - -- Daily call list is always overwritten -- Company history file (`companies_history.txt`) always retained and added via - admin/manual only - -- **Snapshot & verification** 🔐 - - - Create a snapshot before making administrative changes (for example, before - appending to `companies_history.txt`): - `ts=$(date -u +%Y%m%dT%H%M%SZ); tar -czf .snapshots/snap-$ts.tar.gz companies_history.txt data/seeds configs && sha1sum .snapshots/snap-$ts.tar.gz > .snapshots/checksums/snap-$ts.sha1` - - Verify a snapshot: `sha1sum -c .snapshots/checksums/snap-.sha1` (exit - code 0 = OK). - -- **History append policy** 📝 - - - `companies_history.txt` is administrative and append‑only by policy. Prefer - manual review and snapshot before appending; to append via tools use: - `sh scripts/deduper.sh --in tmp/validated.csv --out tmp/deduped.csv --append-history` - or run: - `sh scripts/set_status.sh --input results.csv --enriched tmp/enriched.csv --commit-history` - -- **Preserved failure artefacts** 🧭 - - - On failure `on_err` writes `tmp/last_failed.status` and `heal.sh` preserves - debugging tarballs under `.snapshots/failed/`. Inspect the latest tarball - for preserved logs and status files. - -- **Log rotation & retention** 🔁 - - - Use `scripts/log_rotate.sh --dry-run` to preview, or schedule weekly with - `--keep-weeks N`. Keep checksums and the `.snapshots/index` for - auditability. - -- **Risk / policy note** ⚠️ - - Keep `.snapshots/` in `.gitignore`, and never automate appending to - `companies_history.txt` without review — always snapshot and review diffs. - -### Mini VCS Integration 🔧 - -To keep a simple, auditable history of important project files (for example -`companies_history.txt`, `data/seeds/`, and configuration files) we use a -lightweight, POSIX-friendly "mini VCS" based on standard utilities already -available in POSIX environments. - -**Goals:** keep snapshots, generate small patches, verify integrity, and make -restores straightforward without requiring a full Git install. - -What it uses: - -- Snapshot archives: `tar` (+ `gzip` / `xz` if available) -- Diffs and patches: `diff -u` and `patch -p0` -- File comparison: `cmp`, `md5sum`/`sha1sum` -- Small edits & scripted automation: `ed`, `sed`, `awk` (when needed) -- Filesystem utilities: `cp`, `mv`, `find`, `touch`, `ln`, `mkdir` - -The `.snapshots/` directory - -- Location: `.snapshots/` (at project root) — included in `.gitignore` if you - use Git for code but want lightweight, local snapshots kept separately. -- Contents: - - `snap-YYYY-MM-DDTHHMMSS.tar.gz` — full snapshots of selected paths - - `patches/` — `snapname.patch` (unified diffs generated between snapshots) - - `checksums/` — `snap-YYYY-MM-DDTHHMMSS.sha1` for quick integrity checks - - `index` — a simple text index mapping snapshot names to descriptions - -Basic workflow (conceptual): - -1. Create a snapshot: `tar -czf .snapshots/snap-<ts>.tar.gz ` and - write a checksum. -2. When changes are made, create a patch: - `diff -u old/ new/ > .snapshots/patches/<name>.patch`. -3. Apply a patch: `patch -p0 < .snapshots/patches/<name>.patch` to a - working copy. -4. Restore from snapshot: - `tar -xzf .snapshots/snap-<ts>.tar.gz -C <target>`. - -Mermaid diagram — Mini VCS workflow - -```mermaid -flowchart LR - A[Create Snapshot\n(.snapshots/snap-.tar.gz)] --> B[Store checksum\n(.snapshots/checksums/*.sha1)] - B --> C[Detect Changes\n(compare with previous snapshot)] - C --> D[Generate Patch\n(.snapshots/patches/.patch)] - D --> E[Apply Patch\n(patch -p0 < patchfile)] - A --> F[Restore Snapshot\n(tar -xzf .snapshots/snap-<ts>.tar.gz -C target)] - E --> G[Record in index/log] -``` - -Practical commands & examples - -- Create snapshot (full): - -```sh -# create snapshot of important paths -ts=$(date -u +%Y%m%dT%H%M%SZ) -tar -czf .snapshots/snap-$ts.tar.gz companies_history.txt data/seeds configs && sha1sum .snapshots/snap-$ts.tar.gz > .snapshots/checksums/snap-$ts.sha1 -``` - -- Generate a patch between two extracted snapshots (or working tree): - -```sh -diff -uNr old/ new/ > .snapshots/patches/changes-$ts.patch -``` - -- Apply a patch to a working copy: - -```sh -patch -p0 < .snapshots/patches/changes-$ts.patch -``` - -- Verify snapshot integrity: - -```sh -sha1sum -c .snapshots/checksums/snap-$ts.sha1 -``` - -Additional helper utilities (recommended): - -- `find` — select paths to snapshot by pattern -- `xargs` — batch operations -- `gzip`/`xz` — compress snapshots -- `md5sum`/`sha1sum` — checksums -- `ln` — maintain latest snapshot symlink: `.snapshots/latest` → `snap-...` - -Polyglot pseudocode (POSIX-friendly & portable) - -```pdl -MODULE create_snapshot(paths[], description) -PURPOSE: - Create a timestamped tarball snapshot of 'paths' and record a checksum and index entry. -INPUTS: - paths[] : array of file/directory paths - description : short text -OUTPUTS: - snapshot_name : string (e.g., snap-YYYYMMDDTHHMMSS.tar.gz) -ALGORITHM: - 1. ts := utc_timestamp() - 2. snapshot_name := 'snap-' + ts + '.tar.gz' - 3. tar -czf .snapshots/ + snapshot_name paths[] - 4. checksum := sha1sum .snapshots/ + snapshot_name - 5. write checksum to .snapshots/checksums/snap- + ts + '.sha1' - 6. append "snapshot_name | ts | description" to .snapshots/index - 7. create or update symlink .snapshots/latest → snapshot_name - 8. return snapshot_name - -MODULE generate_patch(base_dir, new_dir, patch_name) -PURPOSE: - Produce a unified diff between two trees and store it in .snapshots/patches. -INPUTS: - base_dir : directory for base - new_dir : directory for new - patch_name : output patch filename -OUTPUTS: - path to generated patch -ALGORITHM: - 1. diff -uNr base_dir new_dir > .snapshots/patches/ + patch_name - 2. return .snapshots/patches/ + patch_name - -MODULE apply_patch(patch_file, target_dir) -PURPOSE: - Apply a stored patch to a working copy -INPUTS: - patch_file : path to patch - target_dir : directory to apply patch in -ALGORITHM: - 1. cd target_dir - 2. patch -p0 < patch_file - 3. verify with 'git status' or 'cmp' / 'sha1sum' as suitable - -MODULE restore_snapshot(snapshot_name, target_dir) -PURPOSE: - Restore a named snapshot into target_dir -ALGORITHM: - 1. tar -xzf .snapshots/ + snapshot_name -C target_dir - 2. verify checksum with sha1sum -c .snapshots/checksums/snap-.sha1 -``` - -Notes & policy - -- This mini VCS is **not** a replacement for a distributed VCS like Git for - source code, but it is a practical, auditable tool to track snapshots and - patches for generated data (call lists, seeds, and history files) in - environments where installing Git may be impractical. -- Keep `.snapshots/` in `.gitignore` if you use Git for source code to avoid - storing large archives in the repository. -- Use checksums and an index file for basic auditability. - ---- - -## 15. Scheduling & Automation - -- Scraper script is triggered manually for now -- Cron scheduling (Unix/BSD/macOS/Windows) after MVP is accepted - ---- - -## 16. Project Acceptance Criteria - -- At least 25 unique companies per CSV file per day (case-insensitive, not in - history) -- Each row contains at least one valid contact (phone/email) -- No duplicates across daily runs -- Less than 25 allowed as partial, write a warning to logs -- Output format, scripts, logs match this project scope and description - ---- - -## 17. MVP / First Steps - -- Write initial Shell scripts and helpers -- Create `data/seeds/seeds.csv` (Seek listing URLs + dork templates). Add a - `seed_id` column to enable per-seed overrides in - `configs/seek-pagination.ini`. -- Create and manage `companies_history.txt` (admin initiates) -- Document everything, structure logs for future audit - -## Project Structure - - - -A generated project scaffold (updated by `scripts/update_readme.sh`) — do not -edit manually. - -```mermaid -flowchart TB - %% Top-level project layout (folders & key files) - subgraph ROOT["."] - direction TB - editorconfig[".editorconfig"] - gitattributes[".gitattributes"] - gitignore[".gitignore"] - envfile[".env"] - configs_root["project.conf (primary) / seek-pagination.ini"] - license["LICENSE"] - readme["README.md"] - seeds["seeds.csv"] - history["companies_history.txt"] - - subgraph BIN["bin/"] - bin_run["elvis-run"] - end - - subgraph SCRIPTS["scripts/"] - run_sh["run.sh"] - fetch_sh["fetch.sh"] - parse_sh["parse.sh"] - dedupe_sh["dedupe.sh"] - validate_sh["validate.sh"] - enrich_sh["enrich.sh"] - subgraph LIB["scripts/lib/"] - http_utils["http_utils.sh"] - end - end - - subgraph CONFIGS["configs/"] - seek_ini["seek-pagination.ini"] - end - - subgraph DOCS["docs/"] - runbook["runbook.md"] - subgraph MAN["docs/man/"] - manpage["elvis.1"] - end - end - - subgraph DATA["data/"] - calllists["calllists/"] - seeds_data["seeds/"] - end - - logs["logs/"] - tmp["tmp/"] - examples["examples/"] - github[".github/"] - cron["cron/"] - tests["tests/"] - end -``` - -```text - -. -├── audit.txt -├── bin -│ ├── elvis-run -├── CHANGELOG.md -├── companies_history.txt -├── configs -│ ├── seek-pagination.ini -│ ├── user_agents.txt -├── cron -│ ├── elvis.cron -├── data -│ ├── calllists -│ ├── seeds -│ ├── ua.txt -├── docs -│ ├── man -│ ├── runbook.md -├── examples -│ ├── sample_calllist.csv -│ ├── sample_seeds.csv -├── failer.count -├── LICENSE -├── logs -│ ├── log.txt -│ ├── network.log -├── project.conf -├── README.md -├── results.csv -├── scripts -│ ├── archive.sh -│ ├── choose_dork.sh -│ ├── cleanup.sh -│ ├── dedupe.sh -│ ├── dedupe_status.sh -│ ├── deduper.sh -├── summary.txt -├── tests -│ ├── run-tests.sh -│ ├── test_update_readme.sh -├── tmp -│ ├── cleanup.status -├── TODO.md - -``` - - - -### Commands - -- `bin/elvis-run` — master orchestrator (see `bin/elvis-run help`) -- `scripts/archive.sh` — scripts/archive.sh -- `scripts/choose_dork.sh` — scripts/choose_dork.sh -- `scripts/cleanup.sh` — scripts/cleanup.sh -- `scripts/dedupe.sh` — scripts/dedupe.sh -- `scripts/dedupe_status.sh` — scripts/dedupe_status.sh -- `scripts/deduper.sh` — scripts/deduper.sh -- `scripts/end_sequence.sh` — scripts/end_sequence.sh -- `scripts/enrich.sh` — scripts/enrich.sh -- `scripts/enrich_status.sh` — scripts/enrich_status.sh -- `scripts/fetch.sh` — scripts/fetch.sh -- `scripts/get_transaction_data.sh` — scripts/get_transaction_data.sh -- `scripts/init-help.sh` — scripts/init-help.sh -- `scripts/log_status.sh` — scripts/log_status.sh -- `scripts/parse.sh` — scripts/parse.sh -- `scripts/run.sh` — scripts/run.sh -- `scripts/set_status.sh` — scripts/set_status.sh -- `scripts/summarise.sh` — scripts/summarise.sh -- `scripts/update_config_examples.sh` — scripts/update_config_examples.sh -- `scripts/update_readme.sh` — scripts/update_readme.sh -- `scripts/validate.sh` — scripts/validate.sh - -## Configuration and Precedence - -- **Canonical config file:** `project.conf` (key=value) — used for _non-secret_ - operational defaults. -- **Secrets & runtime overrides:** environment variables / `.env` (highest - precedence). -- **Site-specific behaviour:** `configs/seek-pagination.ini` — pagination model, - selectors, and per-seed overrides. -- **Seed manifest:** `data/seeds/seeds.csv` with header - `seed_id,location,base_url`. Use the `seed_id` to reference per-seed overrides - in `seek-pagination.ini`. - -Precedence rule (applies to scripts): - -1. Environment variables (`.env` / runtime) — highest priority -2. `project.conf` — operator/deployment defaults -3. Built-in script defaults — fallback - -Notes: - -- Prefer `project.conf` for operational tuning (timeouts, retries, limits). Keep - secrets in `.env` or a secret manager. -- `config.ini` is deprecated in favour of `project.conf`; old content is - preserved in `config.ini` for reference. -- Scripts should log the source (env/project.conf/default) for each key used to - aid auditing. - -Example (per-seed override): - -- In `data/seeds/seeds.csv`: `seed_id=seek_fifo_perth` -- In `configs/seek-pagination.ini` `[overrides]` add: - -```ini - # seek_fifo_perth - - # model = PAG_PAGE - - # page_param = page - -``` - -This design keeps site logic and selectors separated (`seek-pagination.ini`), -while operational defaults are easy for operators to manage (`project.conf`). - -> Notes: -> -> - Keep secrets out of Git (`.env` should be listed in `.gitignore`). -> - Use `scripts/lib/*.sh` for shared utilities; keep scripts small and -> testable. -> - Place generated outputs under `data/` or `data/calllists/` and add ignore -> patterns. - -## Orchestration Flow (from Seeds to Final CSV) - -```mermaid -sequenceDiagram - participant Script - participant Operator as Manual Operator - Operator->>Script: Initiate Run - Script->>Script: Load seeds.csv - Script->>Script: For each seed, detect pagination model - Script->>Script: Fetch & parse each page/listing - Script->>Script: Aggregate and dedupe by company - Script->>Script: Validate rows (company, contact) - Script->>Operator: Await manual enrichment (add contact info) - Operator->>Script: Add contacts, approve rows - Script->>Script: Append company to history - Script->>Script: Emit calllist_YYYY-MM-DD.csv - Script->>Script: Log summary, rotate logs -``` - -1. **Load seeds:** Read `seeds.csv` (one URL per line). -2. **Route detection:** For each seed, pick pagination model (`start` vs - `page`). -3. **Paginate:** - - Fetch each page with backoff/timeouts. - - Parse listings using stable selectors. - - Stop when "Next" is absent (primary) or text fallback says so. -4. **Aggregate:** Append parsed rows to an in-memory or temporary store. -5. **Validate & dedupe:** - - Drop rows missing `company_name`. - - Case-insensitive dedupe `company_name` against today’s set and - `companies_history.txt`. -6. **Enrich contacts (manual):** - - Add `phone` and/or `email` from public sources. - - Validate with regex; skip if both missing. -7. **Emit CSV:** - - `calllist_YYYY-MM-DD.csv` (UTF-8). - - Overwrite daily; keep the history file permanent. -8. **Log & rotate:** - - Write run summaries; note any fallback detection. - - Rotate logs weekly (policy TBD). - ---- - -## Seek.com.au — Route-aware pagination (concise) - -Overview - -- Seek uses two distinct pagination models depending on the URL route. Detect - the model for each seed URL and apply the corresponding pagination logic. -- Always stop when the page’s “Next” control disappears from the returned HTML; - never assume a fixed page count. - -### Pagination models - -```mermaid -flowchart TD - A[Seed URL] --> B{Does URL match /jobs? or /jobs&?} - B -- Yes --> C["PAG_START (offset)"] - B -- No --> D{Does URL contain -jobs/in-?} - D -- Yes --> E["PAG_PAGE (page number)"] - D -- No --> C -``` - -### Model A — Generic search (URLs containing `/jobs?` or `/jobs&`) - -- Mechanism: `start=OFFSET` query parameter, OFFSET increases by 22: - - Page 1 → `start=0` - - Page 2 → `start=22` - - Page k → `start=22*(k-1)` -- Stop condition: the Next control (e.g., - `Next`) is absent from the returned - HTML. -- Rationale: server-side offset pagination for generic searches. - -### Model B — Category / region routes (paths containing `-jobs/in-`) - -- Mechanism: `?page=N` (1-based). Page 1 usually has no `?page` parameter: - - Page 1 → (no `?page`) - - Page 2 → `?page=2` - - Page k → `?page=k` -- Stop condition: the Next link is absent from the pagination component. -- Rationale: page-numbered UX and bookmarkable segments. - -### Minimal Route Detector (PDL-style) - -Use this compact, centralised module to determine the appropriate pagination -model for each Seek listing seed URL. - -```pdl -MODULE pick_pagination(url) -- returns 'PAG_START' or 'PAG_PAGE' -PURPOSE: - Choose which pagination model to use for a seed URL. -INPUTS: - url : string -OUTPUTS: - 'PAG_START' for offset-based pagination - 'PAG_PAGE' for page-number pagination -ALGORITHM: - 1. If url is empty then return 'PAG_START' -- conservative default - 2. Else if url contains '/jobs?' or '/jobs&' then return 'PAG_START' - 3. Else if url contains '-jobs/in-' then return 'PAG_PAGE' - 4. Else return 'PAG_START' -NOTES: - - Keep logic simple and conservative to avoid misrouting. -``` - -#### Usage pattern - -1. Derive the starting URL from your seed -2. Call pick_pagination "$url" to decide whether to loop start or page -3. Use HTML "Next" checks to stop (e.g., grep for data-automation="page-next") - -##### Combined pagination flow (PDL-style) - -- Fetch pages and stop when the pagination control is absent. -- Parsing is delegated to a separate `parse_listings` module. - -```pdl -MODULE run_pagination(initial_url) -PURPOSE: - Detect pagination model and iterate pages, parsing listings until "Next" disappears. -ALGORITHM: - 1. model := pick_pagination(initial_url) - 2. IF model == 'PAG_START' THEN - offset := 0 - LOOP - url := initial_url + "&start=" + offset - html := fetch_with_backoff(url) - IF fetch failed then stop loop and log error - parse_listings(html) -- separate module handles extraction - IF page_has_next(html) is FALSE then stop loop - offset := offset + 22 - wait a short, randomised delay - END LOOP - ELSE -- model == 'PAG_PAGE' - page := 1 - base := initial_url - LOOP - url := base if page == 1 otherwise base + "?page=" + page - html := fetch_with_backoff(url) - IF fetch failed then stop loop and log error - parse_listings(html) - IF page_has_next(html) is FALSE then stop loop - page := page + 1 - wait a short, randomised delay - END LOOP - END IF -NOTES: - - Keep parsing and pagination detection separate for clarity and testability. - - Respect timeouts and backoff on failures. -``` - -### Route-aware Examples (End-to-end crawl flow) - -#### Generic search (`/jobs`) — offset loop (PDL) - -```pdl -MODULE paginate_offset(base_url) -PURPOSE: - Iterate search results using an offset parameter until there is no "Next" control. -ALGORITHM: - 1. offset := 0 - 2. LOOP - url := base_url + "&start=" + offset - html := fetch_with_backoff(url) - IF fetch failed then - log error and STOP - END IF - parse_listings(html) - IF no listings found then - log warning and STOP - END IF - IF page_has_next(html) is FALSE then - log info and STOP - END IF - offset := offset + 22 - wait a short randomised delay - END LOOP -``` - -#### Category/region (`/-jobs/in-`) — page loop (PDL) - -```pdl -MODULE paginate_page_number(base_url) -PURPOSE: - Iterate search results using page numbers (1-based) until there is no "Next" control. -ALGORITHM: - 1. page := 1 - 2. LOOP - IF page == 1 THEN - url := base_url - ELSE - url := base_url + "?page=" + page - END IF - html := fetch_with_backoff(url) - IF fetch failed then - log error and STOP - END IF - parse_listings(html) - IF no listings found then - log warning and STOP - END IF - IF page_has_next(html) is FALSE then - log info and STOP - END IF - page := page + 1 - wait a short randomised delay - END LOOP -``` - -```mermaid -flowchart TD - A[Attempt Fetch] --> B{Success?} - B -- Yes --> C[Continue] - B -- No --> D[Retry: Sleep 5s] - D --> E{Attempt 2 Success?} - E -- Yes --> C - E -- No --> F[Retry: Sleep 20s] - F --> G{Attempt 3 Success?} - G -- Yes --> C - G -- No --> H[Log Error, Skip] -``` - -## Notes & best practices - -- Detect the model per seed URL — misdetection can skip pages or cause infinite - loops. -- Use the presence/absence of the “Next” control in the returned HTML as the - authoritative stop condition. -- Prefer stable selectors and automation attributes when parsing listing content - (`
` roots, `data-automation` attributes, `data-*` ids, and anchor - text). Avoid brittle CSS class names. -- Throttle requests and randomise small sleeps to reduce load and avoid - triggering rate limits. - -- **Job listing/card structure:** - -### Selector Discipline (stable attributes vs brittle CSS) - -Seek’s listing markup provides automation-friendly signals. Prefer these over -CSS class names: - -- **Job card root**: the `
` representing a “normal” job result. -- **Job title**: the anchor text for the title. -- **Company name**: the anchor text for employer. -- **Location**: the anchor text for location. -- **Short description**: the inline summary text. -- **Job identifier**: a `data-*` attribute unique to the listing. - -#### Why avoid CSS class names? - -Class names on modern sites change frequently in A/B tests and refactors. -Automation-oriented attributes and structural tags are more stable and -intentionally readable by scripts. - -#### Parsing guidelines - -- Anchor your extraction to automation markers first; if absent, fall back to - surrounding semantic tags and textual anchors. -- Never rely on inner CSS names like `.style__Card__1a2b` (those are brittle). -- Handle minor whitespace/HTML entity variations safely (normalise text). - -**Outcome:** More resilient scrapers that survive minor refactors without -constant maintenance. - -- Each job is: `
...
` - - - **Title:** `` - - **Company:** `` - - **Location:** `` - - **Short description:** `` - - **Job ID:** `data-job-id` attribute - - Only fields visible here can be automatically gathered. - -- **Contact info (phone/email):** - - - **Not present** in Seek job cards — must be found by operator using dorks, - company sites and public resources. - -- **Search fields:** - - **Keywords**: `` - - **Location**: `` - -**Shell extraction outline (PDL):** - -```pdl -MODULE parse_listings(html_text) -PURPOSE: - Extract structured fields from raw listing HTML using stable markers. -INPUTS: - html_text : string containing page HTML -OUTPUTS: - A list of extracted records with fields: title, company, location, summary, job_id -ALGORITHM: - 1. Split html_text into article chunks at '
' - 2. For each chunk that contains 'data-automation="normalJob"' do - a. title := extract text from marker 'data-automation="jobTitle"' - b. company := extract text from marker 'data-automation="jobCompany"' - c. location := extract text from marker 'data-automation="jobLocation"' - d. summary := extract text from marker 'data-automation="jobShortDescription"' (if present) - e. job_id := extract value of attribute 'data-job-id' (if present) - f. If title is not empty then emit a record with the above fields - 3. Return the collection of records -NOTES: - - Prefer automation attributes where available; fall back to surrounding semantic tags only if necessary. -``` - -### Seek.com.au JavaScript Behaviour & Scraping Approach (Update as of Dec. 2025) - -Although Seek.com.au’s search UI uses dynamic JavaScript features (type-ahead -suggestions, toggle controls, etc.), **the actual job listing pages are -server-rendered and respond to standard URL query parameters** such as -`keywords`, `where`, and `start`. This makes scraping feasible using static -tools. - -**Key points:** - -- **No headless browser required:** - Listing pages can be fetched by constructing query URLs and using static HTTP - requests (e.g. `curl`). All job data and pagination elements appear in the - HTML and can be parsed with shell tools (`grep`, `awk`, `sed`). -- Dynamic UI features (like suggestion dropdowns) are cosmetic and do not affect - the underlying listing pages or endpoints. -- **Stable HTML selectors:** - Listing markup and pagination controls use stable `data-automation` attributes - suitable for parsing and extraction. -- No official API or browser automation is necessary, as long as Seek continues - to render results on the server-side. -- **If Seek ever transitions to client-only rendering (e.g. React hydration - without SSR),** switch to a headless browser or suitable alternative for - interactive/manual extraction. -- **Best practice:** Construct breadth-first collections of filtered seed - listing URLs to avoid simulating the JavaScript search form. - -**Bottom line:** -For this project, **headless browser automation is not required** and static -shell scripting is fully sufficient for daily scraping—future browser automation -is optional and only needed if Seek changes its technical approach. - ---- - -## Appendix: Seed URLs & Google-Dork Examples - -### Seek.com.au Regions/Categories - -| Location | Base URL | -| -------------------------- | --------------------------------------------------------------------------- | -| Perth, WA | | -| Perth, WA (Fly-In Fly-Out) | | -| Perth, WA (Mobilisation) | | -| Perth, WA (Travel) | | -| Darwin, NT | | -| ... | ... (See seeds.csv for full list) | - -See 'Filtered Seeds' below for a breadth-first coverage strategy using -server-rendered URLs with pre-set filters. - -### Seeds & Coverage Checklist - -Use this checklist to ensure breadth and correctness: - -- [ ] Add generic `/jobs` seeds for core keyword+location pairs. -- [ ] Add work type seeds (full-time, part-time, contract, casual). -- [ ] Add remote option seeds (on-site, hybrid, remote). -- [ ] Add salary type and range seeds (annual/monthly/hourly + min/max). -- [ ] Add date listed seeds (1, 3, 7, 14, 31). -- [ ] Add major city/region seeds (capitals + key regions). -- [ ] Add category+region seeds (e.g., FIFO, Engineering, ICT, Healthcare). -- [ ] Ensure each seed is routed to the correct paginator (`start` vs `page`). -- [ ] Verify “Next” detection on the first and last pages; log any changes. -- [ ] Record run totals (seeds visited, pages fetched, listings parsed). - -### Filtered Seeds (breadth-first coverage without JS simulation) - -The search bar UX (type-ahead suggestions, toggles) is JavaScript-driven, but -**listing pages themselves** are addressable with **pre-composed URLs**. -Originating your crawl from filtered listing URLs avoids headless-browser -automation for the search form while still covering the same search space. - -#### Recommended seed types - -- **Work type:** `/jobs/full-time`, `/jobs/part-time`, `/jobs/contract-temp`, - `/jobs/casual-vacation` -- **Remote options:** `/jobs/on-site`, `/jobs/hybrid`, `/jobs/remote` -- **Salary filters (type and range):** - - `salarytype=annual|monthly|hourly` - - `salaryrange=min-max` (e.g., `salaryrange=30000-100000`) -- **Date listed:** `daterange=1|3|7|14|31` (today → monthly) -- **Cities/regions:** `/jobs/in-All-Perth-WA`, `/jobs/in-All-Sydney-NSW`, etc. -- **Category+region:** e.g., `/fifo-jobs/in-Western-Australia-WA`, - `/engineering-jobs/in-All-Melbourne-VIC` - -#### Workflow for seeds - -1. Maintain `seeds.csv` with 1 URL per line, each representing a filtered slice. -2. For each seed: - - Detect route (Batch 1) → choose pagination strategy. - - Crawl until "Next" vanishes (Batch 4). -3. Merge parsed listings; dedupe by company (see Batch 9, Validation). -4. Log coverage (seed → pages visited → number of listings). - -> **Why this works:** These links are server-rendered listing views that present -> enough HTML markers to parse without simulating client-side JS (type-ahead, -> form submissions). - -```pdl -MODULE process_seeds(seed_file) -PURPOSE: - Read seeds from a file and run the pagination process for each seed. -ALGORITHM: - 1. For each line 'seed' in seed_file do - a. call run_pagination(seed) - b. record the seed processing results in logs - 2. End -``` - -### Example Google/DuckDuckGo dorks - -```text -"{Name}" "{Company}" (email OR "mobile number" OR contact OR phone OR mobile OR "email address" OR "contact information") site:.com.au -"{Name}" "{Company}" "contact us" site:.com.au -filetype:pdf "{Company}" "contact" site:.com.au -"{Company}" "contact details" site:.com.au -``` - -### Example Output Row - -```text -company_name,prospect_name,title,phone,email,location -XYZ Pty Ltd,John Smith,Managing Director,0412345678,email@xyz.com.au,Perth, WA -ABC Ltd,Mary Jane,Owner,0498765432,test@abc.com.au,Darwin, NT -Business Name,Henry Smith,CFO,0411111111,henry@business.com.au,Adelaide, SA -``` - ---- - -## Risk Management Summary - -- _Rate limiting & CAPTCHA_: Always pace requests conservatively, rotate UAs, - and manually skip/record if CAPTCHA is hit -- _Data quality_: Strict rules and validation, with manual spot checks - ---- - -## Deliverables - -1. Full requirements document (this file) -2. Seed URLs and dork template file -3. Companies history file (admin-managed) -4. Scripts for CSV extraction, validation and error logging -5. Documentation/manuals for auditing and admin steps - ---- - -## Search Bar & Automation Mapping - -### Seek.com.au - -- **Keywords Field**: - `` -- **Location Field**: - `` -- **Search Button**: - `SEEK` - - JS automation required to trigger searches - -#### Shell example (PDL) - -```pdl -MODULE fetch_url_once(url) -PURPOSE: - Perform a single HTTP GET and return the page content. -INPUTS: - url : string -OUTPUTS: - page content on success - error indicator on failure -ALGORITHM: - 1. Perform an HTTP GET with a reasonable timeout - 2. If successful return the response body - 3. Otherwise return an error -``` - ---- - -### DuckDuckGo Lite Field Mapping - -- **Query Field:** `` -- **Search Button:** `` -- Example: - `http GET 'https://lite.duckduckgo.com/lite/?q=company+email+site:.com.au'` -- Interactive/manual only—never scraped or parsed automatically - ---- - -### Google.com.au Field Mapping - -- **Query Field:** - `