DigiMoviez Movie Scraper
A robust Rust-based web scraper designed to collect movie information and download links from DigiMoviez.com, storing the data in MongoDB.
This scraper is built to systematically collect movie metadata, including titles, ratings, cast information, and download links. It features robust error handling, rate limiting, and progress tracking to ensure reliable data collection.
- Scrapes comprehensive movie metadata (title, IMDB rating, duration, genres, etc.)
- Collects download links with quality and size information
- Stores data in MongoDB with upsert functionality
- Rust
- MongoDB
- Environment variables configuration (.env)
MONGO_URI=your_mongodb_connection_string(example: mongodb://localhost:27017)
DB_NAME=your_database_name(example: "digimoviez")
DM_COOKIE_NAME=your_cookie_name(your cookie name from digimoviez.com)
DM_COOKIE_VALUE=your_cookie_value(your cookie value from digimoviez.com)
DM_COOKIE_EXPIRES=your_cookie_expiration(your cookie value from digimoviez.com example:"2025-02-19T06:11:29.470Z")
struct Movie {
title: String,
imdb_id: String,
imdb_rating: String,
duration: String,
genres: Vec<String>,
director: String,
stars: Vec<String>,
country: String,
description: String,
metacritic_score: String,
awards: String,
image_url: String,
has_subtitle: bool,
trailer_link: String,
page_number: u32,
content_type: String,
slug: Option<String>,
source: String
}
struct DownloadLinks {
imdb_id: String,
slug: String,
last_updated: DateTime,
sections: Vec<DownloadSection>,
source: String
}
To run the scraper, you need valid authentication cookies from DigiMoviez.com. Follow these steps:
- Log in to DigiMoviez.com with your account
- Open browser Developer Tools:
- Chrome/Edge: Press F12 or Ctrl+Shift+I
- Firefox: Press F12 or Ctrl+Shift+I
- Safari: Enable developer menu in Preferences → Advanced
- Navigate to:
- Chrome/Edge: Application → Cookies
- Firefox: Storage → Cookies
- Safari: Storage → Cookies
- Find the "wordpress_logged_in" cookie
- Extract the following information:
Cookie Name example: wordpress_logged_in_d13b2bvd21d06301434df5f427acb040 Cookie Value example: your-user-name-on-digi-i-think%7C1739974278%7C182Q7p5IpD7eQ8gDwqNEYdAk21wsXtPwLJcxlUb656v%7C0263e859b2eefcf214d19ce002445da249116a01b792dbc06bfa4cbd6e0325d8 Cookie Expiration example: Thu, 20 Feb 2025 02:11:18 GMT
- Clone the repository:
git clone [repository-url]
- Install dependencies:
cargo build
-
Set up environment variables in a
.env
file -
Run the scraper:
cargo run
- Progress Tracking: The scraper starts from the last scraped page (defaults to 889 if no progress is found)
- Movie Collection:
- Fetches movie metadata from each page
- Extracts download links for each movie
- Data Storage:
- Stores movie data in the
movies
collection - Stores download links in the
download_links
collection, you can query on it by "imdb_id" or "slug"
- Stores movie data in the
- 1-second delay between successful requests
- 5-second delay after errors
- Stores last scraped page in MongoDB
- Enables resume functionality
- Updates progress after successful page processing
- movies: Stores movie metadata
- download_links: Stores download links and quality information
- progress: Tracks scraping progress
tokio
: Async runtimereqwest
: HTTP clientmongodb
: MongoDB driverscraper
: HTML parsingserde
: Serialization/Deserializationlazy_static
: Static initializationdotenv
: Environment variable management
- Dependent on site structure stability
- Requires valid cookie credentials
- Sequential page processing
- Single-threaded operation
- Implement parallel processing
- Add proxy support
- Enhance error recovery
- Add data validation
- Implement retry queues
- Add metrics collection
- Implement backup functionality
Created and maintained by "PocketJack (Rez Khaleghi)"
- GitHub: https://github.com/rezkhaleghi
- Email: rezaxkhaleghi@gmail.com