Contact Scraper

A simple pair of scripts to help the user to scrape a large amount of contact info from a website in relatively little time

contact_scraper.py: scrape all the email addresses and phone numbers from a list of URLs (such as the ones collected with the below script)
get_all_links_console_function.js: collect a list of URLs from a webpage

Email address and phone numbers are scraped by searching each webpage for mailto and tel links inside anchor tags. Such tags could look like the two examples below:

<a href="tel:555-1234">Call us</a>
<a href="mailto:ollie@example.com">Email us</a>

`contact_scraper.py`

Features

Extracts emails and phone numbers from web pages.
Handles multiple URLs concurrently.
Supports output as JSON or CSV.
Validates extracted email addresses.

Usage

Install required libraries:

pip install beautifulsoup4 grequests grequests urllib3

Run the script:
```
python scrape_contacts.py [-o] <URL1> <URL2> ...
```
- Replace <URL1>, <URL2>, ... with the URLs you want to scrape.
- Use -o <OUTPUT_FILE> to specify the output file.
  - If the file extension is .json, the output will be JSON.
  - If the file extension is .csv, the output will be CSV. Please note that only the first email address and phone number scraped from each page will be saved to CSV, to store all data scraped use a JSON file.
  - If no output file is specified, the results will be pretty printed to the console as JSON.

Example

python scrape_contacts.py https://example.com https://another.com -o contacts.json

This will extract contact information from both URLs and save it to the file contacts.csv.

Limitations

The script currently does not validate phone numbers. This functionality is marked as TODO and needs to be implemented.
The script does not perform cross-page deduplication (i.e. no checking for contact info that appears across multiple pages)

`get_all_links_console_function.js`

Features

Collect a list of URLs from a webpage.
Deduplicate the links collected.
Out the links as a string with a custom separator.

Usage

Copy and paste this file's contents into your browsers console and call the scapeLinks() function with appropriate parameters to scrape all the links from the current webpage.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
contact_scraper.py		contact_scraper.py
get_all_links_console_function.js		get_all_links_console_function.js
out.csv.example		out.csv.example
out.json.example		out.json.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Contact Scraper

`contact_scraper.py`

Features

Usage

Example

Limitations

`get_all_links_console_function.js`

Features

Usage

About

Uh oh!

Releases

Packages

Languages

License

ollierwoodman/contact-scraper

Folders and files

Latest commit

History

Repository files navigation

Contact Scraper

contact_scraper.py

Features

Usage

Example

Limitations

get_all_links_console_function.js

Features

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`contact_scraper.py`

`get_all_links_console_function.js`

Packages