Skip to content

Commit 7223af0

Browse files
committed
Web Scraper for Articles
1 parent f99c6ee commit 7223af0

File tree

2 files changed

+236
-0
lines changed

2 files changed

+236
-0
lines changed

Python/web_scraper/README.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Web Scraper
2+
3+
A Python command-line tool for scraping news articles from websites using the `newspaper3k` library. The tool can extract individual articles or all articles from a news website and export them to JSON or CSV format.
4+
5+
## Features
6+
7+
- **Single Article Scraping**: Extract content from a specific article URL
8+
- **Bulk Article Scraping**: Scrape all articles linked from a news website homepage
9+
- **Multiple Export Formats**: Export data as JSON or CSV
10+
- **Custom File Names**: Specify custom output file names
11+
- **Article Metadata**: Extract title, authors, publication date, content, and URL
12+
13+
## Installation
14+
15+
1. Ensure you have Python 3.6+ installed
16+
2. Install the required dependencies:
17+
18+
```bash
19+
pip install newspaper3k
20+
```
21+
22+
## Usage
23+
24+
### Basic Single Article Scraping
25+
26+
```bash
27+
python web_scraper.py "https://example.com/news-article"
28+
```
29+
30+
This will create a `news.json` file with the scraped article data.
31+
32+
### Scrape All Articles from a News Site
33+
34+
```bash
35+
python web_scraper.py "https://example-news.com" --all-articles
36+
```
37+
38+
### Export to CSV Format
39+
40+
```bash
41+
python web_scraper.py "https://example.com/article" --csv-format
42+
```
43+
44+
### Custom Output File Name
45+
46+
```bash
47+
python web_scraper.py "https://example.com/article" --file my_articles
48+
```
49+
50+
### Combine Options
51+
52+
```bash
53+
# Scrape all articles and export as CSV with custom filename
54+
python web_scraper.py "https://example-news.com" -a -csv -f my_data
55+
```
56+
57+
## Command Line Arguments
58+
59+
| Argument | Short | Description | Default |
60+
|----------|-------|-------------|---------|
61+
| `url` | - | URL of the webpage to scrape (required) | - |
62+
| `--file` | `-f` | Custom output filename | `news` |
63+
| `--csv-format` | `-csv` | Export to CSV instead of JSON | `False` |
64+
| `--all-articles` | `-a` | Scrape all articles from the site | `False` |
65+
66+
## Output Format
67+
68+
### JSON Output
69+
```json
70+
[
71+
{
72+
"title": "Article Title",
73+
"authors": ["Author One", "Author Two"],
74+
"publish_date": "2023-10-15 14:30:00",
75+
"text": "Full article content...",
76+
"url": "https://example.com/article"
77+
}
78+
]
79+
```
80+
81+
### CSV Output
82+
The CSV file will contain columns for:
83+
- `title`
84+
- `authors` (as a string representation of the list)
85+
- `publish_date`
86+
- `text`
87+
- `url`
88+
89+
## Examples
90+
91+
1. **Scrape a single article to JSON:**
92+
```bash
93+
python web_scraper.py "https://www.bbc.com/news/world-us-canada-12345678"
94+
```
95+
96+
2. **Scrape all articles from CNN and export as CSV:**
97+
```bash
98+
python web_scraper.py "https://www.cnn.com" -a -csv -f cnn_articles
99+
```
100+
101+
3. **Scrape with custom JSON filename:**
102+
```bash
103+
python web_scraper.py "https://example.com/article" -f my_article_data
104+
```
105+
106+
## Notes
107+
108+
- The tool uses the `newspaper3k` library which may not work with all websites, especially those with heavy JavaScript rendering or anti-scraping measures
109+
- Some news sites may block automated scraping attempts
110+
- The quality of extracted content depends on the website's structure and the `newspaper3k` library's parsing capabilities
111+
- For sites with many articles, using `--all-articles` may take considerable time
112+
113+
## Error Handling
114+
115+
- If scraping fails, the tool will display an error message
116+
- Empty results will be indicated with appropriate messages
117+
- Network issues and parsing errors are caught and reported
118+
119+
## License
120+
121+
This tool is provided for educational and personal use. Please respect website terms of service and robots.txt files when scraping.

Python/web_scraper/web_scraper.py

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
#!/usr/bin/python3
2+
3+
import csv
4+
import newspaper
5+
import argparse
6+
import json
7+
from datetime import datetime
8+
9+
class WebScraper:
10+
def __init__(self, url, file_name='news', export_format='json'):
11+
self.url = url
12+
13+
if export_format not in ['json', 'csv']:
14+
raise ValueError('Export format must be either json or csv.')
15+
16+
self.export_format = export_format
17+
18+
if export_format == 'json' and not file_name.endswith('.json'):
19+
self.FILE_NAME = file_name + '.json'
20+
elif export_format == 'csv' and not file_name.endswith('.csv'):
21+
self.FILE_NAME = file_name + '.csv'
22+
else:
23+
self.FILE_NAME = file_name
24+
25+
def export_to_JSON(self, articles):
26+
with open(self.FILE_NAME, 'w') as f:
27+
articles_dict = [article for article in articles]
28+
json.dump(articles_dict, f, indent=2)
29+
30+
def export_to_CSV(self, articles):
31+
with open(self.FILE_NAME, 'w', newline='') as f:
32+
writer = csv.DictWriter(f, fieldnames=['title', 'authors', 'publish_date', 'text', 'url'])
33+
writer.writeheader()
34+
for article in articles:
35+
writer.writerow(article)
36+
37+
def get_one_article(self, url=None):
38+
target_url = url or self.url
39+
try:
40+
article = newspaper.Article(target_url)
41+
article.download()
42+
article.parse()
43+
summary = {
44+
'title': article.title or "No title found",
45+
'authors': article.authors or ["Unknown author"],
46+
'publish_date': article.publish_date.strftime('%Y-%m-%d %H:%M:%S') if article.publish_date else None,
47+
'text': article.text or "No content found",
48+
'url': target_url
49+
}
50+
return summary
51+
52+
except Exception as e:
53+
print(f'Error scraping {target_url}: {e}')
54+
return None
55+
56+
def get_all_articles(self):
57+
try:
58+
summaries = []
59+
paper = newspaper.build(self.url, memoize_articles=False)
60+
for art in paper.articles:
61+
summary = self.get_one_article(art.url)
62+
if summary:
63+
summaries.append(summary)
64+
return summaries
65+
66+
except Exception as e:
67+
print(f'Error building newspaper from {self.url}: {e}')
68+
return []
69+
70+
71+
def main():
72+
parser = argparse.ArgumentParser(description='Web Scraper for News')
73+
parser.add_argument('url', help='URL of the webpage to scrape')
74+
parser.add_argument('--file', '-f', default='news',
75+
help='Custom output file (default: news.json or news.csv)')
76+
parser.add_argument('--csv-format', '-csv', action='store_true',
77+
help='Export to CSV format instead of JSON format')
78+
parser.add_argument('--all-articles', '-a', action='store_true',
79+
help='Get all articles linked to URL instead of only the article from the URL itself')
80+
81+
args = parser.parse_args()
82+
83+
export_format = 'csv' if args.csv_format else 'json'
84+
85+
try:
86+
web_scraper = WebScraper(
87+
url=args.url,
88+
file_name=args.file,
89+
export_format=export_format
90+
)
91+
92+
if args.all_articles:
93+
articles = web_scraper.get_all_articles()
94+
else:
95+
single_article = web_scraper.get_one_article()
96+
articles = [single_article] if single_article else []
97+
98+
article_count = len(articles)
99+
100+
if articles:
101+
if export_format == 'json':
102+
web_scraper.export_to_JSON(articles)
103+
else:
104+
web_scraper.export_to_CSV(articles)
105+
106+
print(f'Successfully exported {article_count} articles to {web_scraper.FILE_NAME}')
107+
else:
108+
print('No articles found to export.')
109+
110+
except Exception as e:
111+
print(f'Error: {e}')
112+
113+
114+
if __name__ == '__main__':
115+
main()

0 commit comments

Comments
 (0)