local-email-collector

A simple tool to scrape emails from local business websites while respecting their robots.txt rules. This scraper crawls pages within the same domain and extracts emails, saving them neatly into a CSV file for later use.

What does it solve?

When you need to gather emails from a website. This tool will make it simple by automating the process:

Respects robots.txt: Avoids crawling sites that explicitly forbid it.
Domain-specific crawling: Keeps things tidy by only crawling pages within the specified domain.
CSV output: Saves found emails (along with the page they were found on) into a CSV file for easy reference.

How it works right now

Spider crawl the domain: The scraper checks the domain’s robots.txt for permissions and then crawls all pages linked within the domain.
Extract emails: It uses a regex pattern to find emails in the HTML content of each page.
Save results: Outputs a emails.csv file with two columns:
- email: The email address found.
- page_found: The page where the email was located.

How to use:

1: Set up your environment

Clone this repo to your machine:

git clone https://github.com/williamgregorio/local-email-collector.git
cd local-email-collector
python -m venv .venv
source .venv/bin/activate

2. Install dependencies

Install the required Python packages:
```
pip install -r requirements.txt
```

3: Run the email scraper

Run the tool with a domain of your choice:

python(3 *optional) main.py https://example.com

Works with:

Might not work with:

example.com (make sure to include http:// or https://!)

Final output:

Output directory: All results are saved in the emails directory.
E.g, if you scrape https://example.com, you’ll get:
```
emails/example.com/emails.csv
```
Results: A CSV file containing:
Email: The email address found.
Page Found: The page where the email was located.
At the end: The tool tells you how many emails it found at the end of the run.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

local-email-collector

What does it solve?

How it works right now

How to use:

1: Set up your environment

2. Install dependencies

3: Run the email scraper

Works with:

Might not work with:

Final output:

About

Releases

Packages

Languages

License

williamgregorio/local-email-collector

Folders and files

Latest commit

History

Repository files navigation

local-email-collector

What does it solve?

How it works right now

How to use:

1: Set up your environment

2. Install dependencies

3: Run the email scraper

Works with:

Might not work with:

Final output:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages