🕷️ Web Scraper App - Spring Boot

A powerful Spring Boot application with Swing GUI for web scraping news websites and analyzing articles with sentiment analysis.

🚀 Features

Tab 1: Website Link Scraper

Smart Link Extraction: Scrapes and lists latest news articles from websites
Content Preview: Click any link to view full article content
Image Display: Shows article images with proper loading and scaling
News Focus: Filters out navigation/footer links, shows only articles
Performance Optimized: Background loading prevents UI freezing

Tab 2: Article Analyzer

Detailed Article Parsing: Extract headline, author, publish date, and content
Sentiment Analysis: AI-powered emotion detection (Positive/Negative/Neutral)
Word Count: Automatic article statistics
Image Extraction: Finds and displays article images
Keyword Analysis: Shows positive/negative sentiment keywords

🛠️ Technology Stack

Spring Boot 3.2.0 - Application framework
Java Swing - Desktop GUI
JSoup 1.16.2 - HTML parsing and web scraping
Apache HTTP Client - HTTP connections
Java 17 - Runtime environment

📋 Prerequisites

Java 17 or higher
Maven 3.6+
Internet connection for web scraping

🏃‍♂️ Running the Application

Option 1: Maven

mvn spring-boot:run

Option 2: JAR

mvn clean package
java -jar target/web-scraper-app-1.0.0.jar

Option 3: IDE

Run the WebScraperApplication.java main class

🌐 Supported Websites

✅ Confirmed Working:

BBC News (https://www.bbc.com/)
CNN (https://www.cnn.com/)
Reuters (https://www.reuters.com/)
NBC News (https://www.nbcnews.com/)
The Guardian (https://www.theguardian.com/)

⚠️ May Block Automated Requests:

Telegraph India
Many paywalled news sites
Sites with heavy JavaScript content loading

📖 How to Use

Website Link Scraper (Tab 1)

Enter a news website URL (e.g., https://www.bbc.com/)
Click "Get Links" to scrape latest articles
Select any article from the list to view content and images
Images load automatically in the background

Article Analyzer (Tab 2)

Paste a specific article URL
Click "Analyze Article"
View extracted details:
- Headline and Author
- Publication Date
- Sentiment Analysis with color coding
- Full Content with word count
- Article Images
- Sentiment Keywords (positive/negative words found)

🧠 Sentiment Analysis

The built-in sentiment analyzer:

Analyzes emotional tone of articles
Scores from -1.0 to +1.0 (negative to positive)
Color coding: 🟢 Positive, 🔴 Negative, 🔵 Neutral
Keyword detection shows sentiment-bearing words
Statistical analysis with word count metrics

⚙️ Configuration

Timeouts and Limits

Connection timeout: 5-8 seconds
Read timeout: 8-10 seconds
Max article links: 25 (for performance)
Max images per article: 3-5
Max image size: 300x200px (scaled automatically)

Request Headers

The application uses proper browser headers to avoid blocking:

Modern Chrome User-Agent
Accept headers for HTML/images
Referer headers for legitimacy

🔧 Dependencies

Add to your pom.xml:

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter</artifactId>
    </dependency>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.16.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.httpcomponents.client5</groupId>
        <artifactId>httpclient5</artifactId>
    </dependency>
</dependencies>

🚨 Error Handling

The application handles common issues:

403 Forbidden: Website blocks automated requests
Connection timeouts: Network or server issues
SSL errors: Certificate problems with HTTPS sites
Image loading failures: Graceful fallbacks with error messages
Content extraction failures: Clear user feedback

📊 Example Output

=== ARTICLE ANALYSIS ===

HEADLINE: Breaking: Major Economic Policy Changes Announced
AUTHOR: John Smith
PUBLISHED: 2024-08-07 10:30:00
SENTIMENT: Negative (-0.23)
WORD COUNT: 847 words

=== SENTIMENT KEYWORDS ===
Positive: progress, improve, success
Negative: crisis, problem, decline, concern

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🐛 Known Issues

JavaScript-heavy sites: JSoup cannot execute JavaScript, so dynamic content may not be captured
Anti-bot protection: Some sites actively block automated requests
Image loading: Some images may fail due to CORS or authentication requirements

💡 Tips for Best Results

Use major news sites: BBC, CNN, Reuters work best
Check robots.txt: Respect website scraping policies
Don't overwhelm servers: Built-in delays prevent server overload
Try different URLs: If one site blocks, try alternatives

🔮 Future Enhancements

Export analysis results to PDF/CSV
Advanced sentiment analysis with machine learning
Support for RSS feeds
Batch article analysis
Custom keyword tracking
Article comparison features

Built with ❤️ using Spring Boot and Java Swing

For questions or issues, please open a GitHub issue or contact the maintainer.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕷️ Web Scraper App - Spring Boot

🚀 Features

Tab 1: Website Link Scraper

Tab 2: Article Analyzer

🛠️ Technology Stack

📋 Prerequisites

🏃‍♂️ Running the Application

Option 1: Maven

Option 2: JAR

Option 3: IDE

🌐 Supported Websites

✅ Confirmed Working:

⚠️ May Block Automated Requests:

📖 How to Use

Website Link Scraper (Tab 1)

Article Analyzer (Tab 2)

🧠 Sentiment Analysis

⚙️ Configuration

Timeouts and Limits

Request Headers

🔧 Dependencies

🚨 Error Handling

📊 Example Output

🤝 Contributing

📄 License

🐛 Known Issues

💡 Tips for Best Results

🔮 Future Enhancements

About

Uh oh!

Releases

Packages

Languages

License

arifgit12/web-scraper-app

Folders and files

Latest commit

History

Repository files navigation

🕷️ Web Scraper App - Spring Boot

🚀 Features

Tab 1: Website Link Scraper

Tab 2: Article Analyzer

🛠️ Technology Stack

📋 Prerequisites

🏃‍♂️ Running the Application

Option 1: Maven

Option 2: JAR

Option 3: IDE

🌐 Supported Websites

✅ Confirmed Working:

⚠️ May Block Automated Requests:

📖 How to Use

Website Link Scraper (Tab 1)

Article Analyzer (Tab 2)

🧠 Sentiment Analysis

⚙️ Configuration

Timeouts and Limits

Request Headers

🔧 Dependencies

🚨 Error Handling

📊 Example Output

🤝 Contributing

📄 License

🐛 Known Issues

💡 Tips for Best Results

🔮 Future Enhancements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages