A powerful and user-friendly web crawler designed to find and download PDF files from websites. Built with Python (Flask) and modern JavaScript, featuring a beautiful UI and real-time progress tracking.
- 🔍 Smart PDF Detection: Automatically finds PDFs in links, embedded content, and various URL patterns
- 📊 Customizable Depth: Crawl websites from 1 to 10 levels deep
- 💾 Auto-Download: Download PDFs immediately as they're found or review them first
- 📁 System Directory Picker: Native folder selection dialog for choosing download location
- 🎯 Real-time Progress: Live updates showing crawl progress and found PDFs
- 🔎 Search & Filter: Filter results by filename or download status
- 🌐 Smart URL Handling: Automatically handles various URL formats (with/without http://, www, etc.)
- ⚡ Concurrent Crawling: Multi-threaded crawling for faster results
- 🎨 Modern UI: Clean, responsive interface with smooth animations
- Python 3.7+
- pip (Python package manager)
- Modern web browser (Chrome, Firefox, Safari, Edge)
- Clone the repository:
git clone https://github.com/rimomcosta/crawler.git
cd crawler- Install dependencies:
pip install -r requirements.txt- Run the application:
python app.py- Open your browser:
Navigate to
http://localhost:8080
-
Enter Website URL: Type or paste the website URL you want to crawl
- The app automatically adds
https://if needed - Works with or without
www
- The app automatically adds
-
Set Crawl Depth: Choose how deep to crawl (1-10 levels)
- Level 1: Only the main page
- Level 2: Main page + directly linked pages
- Level 3+: Deeper crawling for comprehensive results
-
Select Download Directory:
- Click "Browse" to open system folder picker
- Or manually enter the full path
-
Enable Auto-Download (Optional):
- Toggle ON: PDFs download immediately when found
- Toggle OFF: Review PDFs first, then download selectively
-
Start Crawling: Click the "Start Crawling" button
- Watch real-time progress
- See PDFs being discovered
- Stop anytime with the "Stop" button
-
Manage Results:
- Search PDFs by filename
- Filter by status (All/Downloaded/Pending/Failed)
- Download individual PDFs
- View source pages where PDFs were found
- Framework: Flask 2.3.3
- Web Scraping: BeautifulSoup4 + Requests
- Concurrency: ThreadPoolExecutor for parallel crawling
- PDF Detection: Multiple strategies including URL patterns, content-type headers, and embedded content
- Pure JavaScript: No framework dependencies
- Responsive Design: Works on desktop and mobile
- Real-time Updates: Polling for live progress
- Modern CSS: Custom properties, flexbox, and grid layouts
app.py: Flask application and API endpointscrawler.py: Core crawling logic and PDF detectionstatic/js/app.js: Frontend application logicstatic/css/style.css: Modern stylingtemplates/index.html: Main UI template
- Max Workers: 5 concurrent threads (configurable in
crawler.py) - Timeout: 10 seconds per request
- User Agent: Chrome 91.0 (customizable)
- Direct PDF links (
.pdfextension) - PDF URLs with patterns (
/pdf/,/download/,/file/) - Embedded PDFs (
<embed>,<iframe>,<object>) - Content-Type header detection
GET /: Main application UIPOST /api/start-crawl: Start crawling a websitePOST /api/stop-crawl: Stop current crawlGET /api/status: Get current crawl statusGET /api/results: Get found PDFsPOST /api/select-directory: Open system directory pickerPOST /api/download-pdf: Download a specific PDF
If port 8080 is busy, you can change it in app.py:
app.run(debug=True, port=8081) # Change to any available port- Ensure you have write permissions to the selected directory
- Check that the directory path is absolute (e.g.,
/Users/username/Downloads) - Verify auto-download is enabled if you want immediate downloads
- Some websites may have rate limiting
- Try reducing the crawl depth
- Check the console for any error messages
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Flask and BeautifulSoup4
- UI inspired by modern web design principles
- Icons from Font Awesome
For questions or support, please open an issue on GitHub.
Made with ❤️
