Skip to content

Inconsistent Lead Generation from Seed URLs and User Agents #8

@2MuchC0ff33

Description

@2MuchC0ff33

GitHub Issue: Inconsistent Lead Generation in home/calllist.txt


Description:

Summary

The Elvis POSIX shell-based web scraper intermittently fails to generate a consistent and sufficient number of leads in home/calllist.txt when using seed URLs from srv/urls.txt and user agents from srv/ua.txt. This results in daily call lists with zero or very few entries, as evidenced by audit logs and output files.


Findings

1. Configuration Centralization

  • All configuration is intended to reside in etc/elvisrc (paths, toggles, limits).
  • Some scripts may not fully source or respect these values, risking hard-coded paths or defaults.

2. Seed URL Handling

  • If srv/urls.txt is empty or missing, the run exits early (bin/elvis.sh).
  • No fallback or alert mechanism for insufficient or malformed seeds.
  • No validation of seed coverage (e.g., missing job types, regions, or pagination models).

3. User Agent Rotation

  • UA rotation is enabled via UA_ROTATE=true in elvisrc.
  • If srv/ua.txt is empty or contains invalid entries, requests may fail or be blocked.
  • No robust fallback for UA exhaustion or block detection.

4. Lead Extraction & Validation

  • Extraction relies on modular AWK/SED scripts; parsing failures (due to site changes or selector drift) result in empty candidate rows.
  • Validation (lib/validate_calllist.sh) enforces strict format and minimum unique companies.
  • If validation fails, a placeholder is written by lib/default_handler.sh, masking root causes.

5. Deduplication & History

  • Deduplication is case-insensitive and history-aware.
  • If history grows stale or is not updated, legitimate leads may be skipped.

6. Logging & Error Handling

  • Logs are written and rotated, but error messages may be too generic (e.g., "no_matches").
  • Audit logs show repeated runs with zero leads, but do not pinpoint the failure stage.

Root Causes

  • Empty or malformed seed URLs/user agents.
  • Parsing failures due to site changes or selector drift.
  • Overly strict validation or deduplication logic.
  • Configuration drift (values not sourced or overridden).
  • Lack of granular error reporting and fallback logic.

Recommendations

  1. Seed & UA Validation

    • Add pre-run checks for non-empty, well-formed srv/urls.txt and srv/ua.txt.
    • Log and abort with actionable errors if seeds or UAs are missing/invalid.
  2. Modular Fallbacks

    • Implement fallback logic for empty candidate rows (e.g., try alternate selectors, log parsing errors).
    • Avoid writing generic placeholders; instead, log the specific failure reason.
  3. Configuration Hygiene

    • Ensure all scripts source etc/elvisrc and avoid hard-coded values.
    • Document all config options in docs/USAGE.md and validate at startup.
  4. Lead Extraction Robustness

    • Regularly review and update AWK/SED extraction scripts to match site changes.
    • Add test fixtures for new job board layouts.
  5. Deduplication & History Management

    • Periodically audit srv/company_history.txt for accuracy.
    • Allow manual override or review of deduped entries.
  6. Logging & Error Reporting

    • Enhance logging to capture the exact stage and reason for lead generation failures.
    • Summarize run statistics (seeds processed, UAs used, parsing errors, validation failures).
  7. POSIX Best Practices

    • Use parameter expansion and config sourcing for all paths and toggles.
    • Avoid hard-coded values; rely on config and environment variables.
    • Maintain modularity by keeping extraction, validation, and logging logic in separate scripts.

Labels: automation, config, data-quality, parsing, reliability, qa, utility, ops


Action Items:

  • Implement seed and UA file validation.
  • Improve error reporting and fallback logic in extraction and validation scripts.
  • Audit and refactor scripts to ensure all configuration is sourced from etc/elvisrc.
  • Update documentation to reflect new validation and error handling procedures.
  • Add tests for edge cases (empty/malformed seeds, UA blocks, parsing failures).

Please review, discuss, and assign for resolution. This issue aims to ensure consistent, reliable lead generation and maintain best practices in POSIX shell scripting and modular design.

Metadata

Metadata

Assignees

Labels

automationImplementing automated scripts, workflows, and orchestration (e.g., run.sh, cron).configCreating/updating config files, templates, and examples (e.g., .ini, .conf).data-qualityValidation, deduplication, and ensuring output accuracy (e.g., phone/email rules).opsOperational aspects like scheduling, logging, monitoring, and production readiness.parsingData extraction and parsing logic (e.g., HTML to structured records).qaQuality assurance, manual testing, and final verification runs.reliabilityImproving fetch reliability, retries, timeouts, and error handling.utilityShared libraries, helpers, and reusable code components (e.g., logging libs).

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions