-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
GitHub Issue: Inconsistent Lead Generation in home/calllist.txt
Description:
Summary
The Elvis POSIX shell-based web scraper intermittently fails to generate a consistent and sufficient number of leads in home/calllist.txt when using seed URLs from srv/urls.txt and user agents from srv/ua.txt. This results in daily call lists with zero or very few entries, as evidenced by audit logs and output files.
Findings
1. Configuration Centralization
- All configuration is intended to reside in
etc/elvisrc(paths, toggles, limits). - Some scripts may not fully source or respect these values, risking hard-coded paths or defaults.
2. Seed URL Handling
- If
srv/urls.txtis empty or missing, the run exits early (bin/elvis.sh). - No fallback or alert mechanism for insufficient or malformed seeds.
- No validation of seed coverage (e.g., missing job types, regions, or pagination models).
3. User Agent Rotation
- UA rotation is enabled via
UA_ROTATE=trueinelvisrc. - If
srv/ua.txtis empty or contains invalid entries, requests may fail or be blocked. - No robust fallback for UA exhaustion or block detection.
4. Lead Extraction & Validation
- Extraction relies on modular AWK/SED scripts; parsing failures (due to site changes or selector drift) result in empty candidate rows.
- Validation (
lib/validate_calllist.sh) enforces strict format and minimum unique companies. - If validation fails, a placeholder is written by
lib/default_handler.sh, masking root causes.
5. Deduplication & History
- Deduplication is case-insensitive and history-aware.
- If history grows stale or is not updated, legitimate leads may be skipped.
6. Logging & Error Handling
- Logs are written and rotated, but error messages may be too generic (e.g., "no_matches").
- Audit logs show repeated runs with zero leads, but do not pinpoint the failure stage.
Root Causes
- Empty or malformed seed URLs/user agents.
- Parsing failures due to site changes or selector drift.
- Overly strict validation or deduplication logic.
- Configuration drift (values not sourced or overridden).
- Lack of granular error reporting and fallback logic.
Recommendations
-
Seed & UA Validation
- Add pre-run checks for non-empty, well-formed
srv/urls.txtandsrv/ua.txt. - Log and abort with actionable errors if seeds or UAs are missing/invalid.
- Add pre-run checks for non-empty, well-formed
-
Modular Fallbacks
- Implement fallback logic for empty candidate rows (e.g., try alternate selectors, log parsing errors).
- Avoid writing generic placeholders; instead, log the specific failure reason.
-
Configuration Hygiene
- Ensure all scripts source
etc/elvisrcand avoid hard-coded values. - Document all config options in
docs/USAGE.mdand validate at startup.
- Ensure all scripts source
-
Lead Extraction Robustness
- Regularly review and update AWK/SED extraction scripts to match site changes.
- Add test fixtures for new job board layouts.
-
Deduplication & History Management
- Periodically audit
srv/company_history.txtfor accuracy. - Allow manual override or review of deduped entries.
- Periodically audit
-
Logging & Error Reporting
- Enhance logging to capture the exact stage and reason for lead generation failures.
- Summarize run statistics (seeds processed, UAs used, parsing errors, validation failures).
-
POSIX Best Practices
- Use parameter expansion and config sourcing for all paths and toggles.
- Avoid hard-coded values; rely on config and environment variables.
- Maintain modularity by keeping extraction, validation, and logging logic in separate scripts.
Labels: automation, config, data-quality, parsing, reliability, qa, utility, ops
Action Items:
- Implement seed and UA file validation.
- Improve error reporting and fallback logic in extraction and validation scripts.
- Audit and refactor scripts to ensure all configuration is sourced from
etc/elvisrc. - Update documentation to reflect new validation and error handling procedures.
- Add tests for edge cases (empty/malformed seeds, UA blocks, parsing failures).
Please review, discuss, and assign for resolution. This issue aims to ensure consistent, reliable lead generation and maintain best practices in POSIX shell scripting and modular design.