Page Scraper

Scrape and clean any page - filter out boilerplate content, navigation elements, and ads to deliver high-quality, structured text data. It aims to be a lightweight Chrome extension that extracts and cleans web page content for analysis, machine learning, or research purposes.

✨ Features

🧠 Intelligent Content Extraction

Smart Content Detection: Automatically identifies and extracts main content while filtering out navigation, ads, and boilerplate text
Quality Scoring: Evaluates content relevance and assigns quality scores to ensure you get the best text
Multi-format Support: Extracts headings, paragraphs, and list items with proper hierarchical structure

📊 Content Analytics

Word Count: Real-time word and sentence counting
Quality Metrics: Content quality scoring based on relevance and structure
Reading Statistics: Average sentence length and readability analysis

💾 Data Management

History Tracking: Automatically saves up to 20 recently scraped pages
Persistent Storage: Content survives browser restarts
Quick Access: One-click access to previously scraped content

📤 Export Options

Copy to Clipboard: Instant copying with formatted metadata
Single Download: Export individual pages as text files
Bulk Export: Download all history as a combined file with timestamps

🎨 User Experience

Progress Tracking: Real-time scraping progress with animated indicators
Responsive Design: Adaptive interface that expands when content is available
Visual Feedback: Button states and animations for all actions

🚀 Installation

Download or clone this repository
Open Chrome and navigate to chrome://extensions/
Enable "Developer mode" in the top right
Click "Load unpacked" and select the extension folder
The Page Scraper icon will appear in your Chrome toolbar

📖 How to Use

Navigate to any webpage you want to scrape
Click the Page Scraper extension icon in your toolbar
Press the "Scrape & Clean" button
View the extracted content with quality metrics
Export using copy, download, or bulk download options

Content Structure

The extension organizes content into three types:

Headings (H1-H6): Marked with # prefix
Paragraphs: Clean body text
Lists: Bulleted items marked with • prefix

🔧 Technical Details

Content Filtering Algorithm

Removes navigation elements, footers, and sidebars
Filters out cookie notices and advertisement content
Scores content blocks based on text density and semantic relevance
Normalizes text formatting and removes special characters

Data Format

{
  url: "https://example.com",
  title: "Page Title",
  content: [
    { type: "heading", level: 1, text: "Main Heading" },
    { type: "paragraph", text: "Content paragraph..." },
    { type: "list-item", text: "List item content" }
  ],
  metadata: {
    wordCount: 1250,
    sentenceCount: 45,
    avgSentenceLength: 27.8,
    qualityScore: "1.00"
  }
}

Permissions Required

activeTab: Access current webpage content
scripting: Inject content extraction scripts
storage: Save scraped content history
<all_urls>: Work on any website

🎯 Use Cases

Research: Collect clean text data for academic research
Machine Learning: Prepare training datasets from web content
Content Analysis: Analyze text structure and quality metrics
Documentation: Archive important webpage content
Data Science: Gather structured text for analysis projects

📁 File Structure

page-scraper/
├── manifest.json       # Extension configuration
├── popup.html         # Main interface
├── popup.js           # UI logic and data management
├── content.js         # Content extraction
├── background.js      # Service worker
├── icon.png          # Extension icon
└── ex.png            # Demo screenshot

🤝 Contributing

Contributions are welcome! Feel free to:

Report bugs or request features
Submit pull requests with improvements
Share feedback on content extraction accuracy

📄 License

Feel free to use Page Scraper in any way you want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Page Scraper

✨ Features

🧠 Intelligent Content Extraction

📊 Content Analytics

💾 Data Management

📤 Export Options

🎨 User Experience

🚀 Installation

📖 How to Use

Content Structure

🔧 Technical Details

Content Filtering Algorithm

Data Format

Permissions Required

🎯 Use Cases

📁 File Structure

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
background.js		background.js
content.js		content.js
ex.png		ex.png
icon.png		icon.png
manifest.json		manifest.json
popup.html		popup.html
popup.js		popup.js

TaylorBeck/page-scraper

Folders and files

Latest commit

History

Repository files navigation

Page Scraper

✨ Features

🧠 Intelligent Content Extraction

📊 Content Analytics

💾 Data Management

📤 Export Options

🎨 User Experience

🚀 Installation

📖 How to Use

Content Structure

🔧 Technical Details

Content Filtering Algorithm

Data Format

Permissions Required

🎯 Use Cases

📁 File Structure

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages