Inconsistent Lead Generation from Seed URLs and User Agents

## GitHub Issue: Inconsistent Lead Generation in `home/calllist.txt`

---

**Description:**

### Summary

The Elvis POSIX shell-based web scraper intermittently fails to generate a consistent and sufficient number of leads in `home/calllist.txt` when using seed URLs from `srv/urls.txt` and user agents from `srv/ua.txt`. This results in daily call lists with zero or very few entries, as evidenced by audit logs and output files.

---

### Findings

#### 1. **Configuration Centralization**
- All configuration is intended to reside in `etc/elvisrc` (paths, toggles, limits).
- Some scripts may not fully source or respect these values, risking hard-coded paths or defaults.

#### 2. **Seed URL Handling**
- If `srv/urls.txt` is empty or missing, the run exits early (`bin/elvis.sh`).
- No fallback or alert mechanism for insufficient or malformed seeds.
- No validation of seed coverage (e.g., missing job types, regions, or pagination models).

#### 3. **User Agent Rotation**
- UA rotation is enabled via `UA_ROTATE=true` in `elvisrc`.
- If `srv/ua.txt` is empty or contains invalid entries, requests may fail or be blocked.
- No robust fallback for UA exhaustion or block detection.

#### 4. **Lead Extraction & Validation**
- Extraction relies on modular AWK/SED scripts; parsing failures (due to site changes or selector drift) result in empty candidate rows.
- Validation (`lib/validate_calllist.sh`) enforces strict format and minimum unique companies.
- If validation fails, a placeholder is written by `lib/default_handler.sh`, masking root causes.

#### 5. **Deduplication & History**
- Deduplication is case-insensitive and history-aware.
- If history grows stale or is not updated, legitimate leads may be skipped.

#### 6. **Logging & Error Handling**
- Logs are written and rotated, but error messages may be too generic (e.g., "no_matches").
- Audit logs show repeated runs with zero leads, but do not pinpoint the failure stage.

---

### Root Causes

- **Empty or malformed seed URLs/user agents.**
- **Parsing failures due to site changes or selector drift.**
- **Overly strict validation or deduplication logic.**
- **Configuration drift (values not sourced or overridden).**
- **Lack of granular error reporting and fallback logic.**

---

### Recommendations

1. **Seed & UA Validation**
   - Add pre-run checks for non-empty, well-formed `srv/urls.txt` and `srv/ua.txt`.
   - Log and abort with actionable errors if seeds or UAs are missing/invalid.

2. **Modular Fallbacks**
   - Implement fallback logic for empty candidate rows (e.g., try alternate selectors, log parsing errors).
   - Avoid writing generic placeholders; instead, log the specific failure reason.

3. **Configuration Hygiene**
   - Ensure all scripts source `etc/elvisrc` and avoid hard-coded values.
   - Document all config options in `docs/USAGE.md` and validate at startup.

4. **Lead Extraction Robustness**
   - Regularly review and update AWK/SED extraction scripts to match site changes.
   - Add test fixtures for new job board layouts.

5. **Deduplication & History Management**
   - Periodically audit `srv/company_history.txt` for accuracy.
   - Allow manual override or review of deduped entries.

6. **Logging & Error Reporting**
   - Enhance logging to capture the exact stage and reason for lead generation failures.
   - Summarize run statistics (seeds processed, UAs used, parsing errors, validation failures).

7. **POSIX Best Practices**
   - Use parameter expansion and config sourcing for all paths and toggles.
   - Avoid hard-coded values; rely on config and environment variables.
   - Maintain modularity by keeping extraction, validation, and logging logic in separate scripts.

---

**Labels:** `automation`, `config`, `data-quality`, `parsing`, `reliability`, `qa`, `utility`, `ops`

---

**Action Items:**
- [ ] Implement seed and UA file validation.
- [ ] Improve error reporting and fallback logic in extraction and validation scripts.
- [ ] Audit and refactor scripts to ensure all configuration is sourced from `etc/elvisrc`.
- [ ] Update documentation to reflect new validation and error handling procedures.
- [ ] Add tests for edge cases (empty/malformed seeds, UA blocks, parsing failures).

---

*Please review, discuss, and assign for resolution. This issue aims to ensure consistent, reliable lead generation and maintain best practices in POSIX shell scripting and modular design.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent Lead Generation from Seed URLs and User Agents #8

GitHub Issue: Inconsistent Lead Generation in `home/calllist.txt`

Summary

Findings

1. Configuration Centralization

2. Seed URL Handling

3. User Agent Rotation

4. Lead Extraction & Validation

5. Deduplication & History

6. Logging & Error Handling

Root Causes

Recommendations

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Inconsistent Lead Generation from Seed URLs and User Agents #8

Description

GitHub Issue: Inconsistent Lead Generation in home/calllist.txt

Summary

Findings

1. Configuration Centralization

2. Seed URL Handling

3. User Agent Rotation

4. Lead Extraction & Validation

5. Deduplication & History

6. Logging & Error Handling

Root Causes

Recommendations

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

GitHub Issue: Inconsistent Lead Generation in `home/calllist.txt`