-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
TL;DR:
To implement "Head Office Location Detection," complete the sub-issues in this order:
- DuckDuckGo robots.txt compliance (Implement DuckDuckGo robots.txt Compliance Check #16)
- Google-dorking/contact page scraping (Add Google-Dorking and Contact Page Scraping Logic #15)
- Output format/validation (Update Output Format and Validation Scripts #13)
- Manual lookup fallback (Implement “Manual Lookup Needed” Fallback #14)
- Config/docs update (Update Configuration and Documentation #12)
- Testing/QA (Testing and QA for Head Office Location Detection #11)
Compliance comes first, then extraction, output, manual fallback, docs/config, and finally QA.
Feature Name
Head Office Location Detection
Is your feature request related to a problem? Please describe.
Currently, Elvis does not identify or record the Head Office location of companies in the call list. This information would be valuable for users who need to know the main office location for outreach or reporting.
Describe the solution you'd like
Elvis should extract and include the Head Office location for each company, if available, during the scraping process. This could be an additional field in the output or a new column in the call list.
Implementation details (optional)
- Add logic to the extraction scripts (e.g.,
data_input.sh,lib/*.awk,lib/*.sed) to parse and capture Head Office location using various and multiple google-dorking queries via https://lite.duckduckgo.com/lite to obtain company head office address in the search results and further refinement by accessing the company web pages (contact page) and business listings. - If no result is found via automated methods, flag the entry with a tag such as "Manual lookup needed" to indicate that human intervention is required.
- Update the output format and validation scripts to support the new field and flag.
- Consider configuration options in
etc/elvisrcfor toggling this feature. - Ensure all queries to DuckDuckGo Lite comply with robots.txt.
Alternatives considered
- Manual lookup of Head Office locations after scraping as a fallback when automated methods do not yield results (flagged as "Manual lookup needed").
Additional context
This request was received as feedback from a user via the GitHub MCP server.
Compliance Check
- I have reviewed the compliance settings in etc/elvisrc.
- I have read the Security policy.
- I have read the Contribution guidelines.
- This feature does not bypass robots.txt or violate any security guidelines.
Project Board:
https://github.com/users/2MuchC0ff33/projects/1
Recommended Sub-Issue Completion Order
To implement the "Head Office Location Detection" feature efficiently, please address the sub-issues in the following order:
- Implement DuckDuckGo robots.txt Compliance Check #16: DuckDuckGo robots.txt compliance
- Ensure all scraping and data extraction methods are compliant with robots.txt and project compliance policies.
- Add Google-Dorking and Contact Page Scraping Logic #15: Google-dorking/contact page scraping
- Implement logic to extract head office/contact info from company websites using compliant search and scraping methods.
- Update Output Format and Validation Scripts #13: Output format/validation
- Define and validate the output format for head office location data in the calllist.
- Implement “Manual Lookup Needed” Fallback #14: Manual lookup fallback
- Add a manual lookup or override mechanism for cases where automated extraction fails.
- Update Configuration and Documentation #12: Config/docs update
- Update configuration files and documentation to reflect new options, toggles, and compliance notes.
- Testing and QA for Head Office Location Detection #11: Testing/QA
- Implement and run tests to ensure all new logic is robust, compliant, and meets quality standards.
Rationale:
- Compliance must be established first to avoid rework and ensure all subsequent work is policy-aligned.
- Extraction logic should be implemented before output and manual fallback mechanisms.
- Output format and validation should be defined before integrating manual overrides.
- Documentation and configuration should be updated after core logic is in place.
- Testing and QA should be the final step to validate the complete feature.
Please follow this order to maximize efficiency and minimize rework. If dependencies or blockers arise, update this list accordingly.