Scrape images from Google Images
Depends on Requests and BeautifulSoup Libraries
Download images (upto 100 currently). To use this, download the scraper.py and run,
python3 scraper.py
-s
- (s)earch term.
-c
- Include this flag to (c)ache the search query. The searches are cached using a simple pickle file at the location of scraper.py
within a subfolder /caches
with the same filename as that of the search query and an extension .cache
. The pickle file is a simple list of urls.
-p
- If you wish to download from a pre-existing cache file, include the (p)ath of the cache file after this argument.
-d
- Choose the location to (d)ownload the files. A downloads
folder is created at the location and the image files are stored in a subdirectory with the name of the search term.
-v
- (V)erbose mode to see the intermediate steps
The implementation is in a class. The class downloader
is initialized with the following:
- search_term - The term to search for
- verbose_mode - Verbose mode status
The Methods provided involve:
get_urls
- Takes cache status as a parameter. Obtains the image urls intodownloadurls
.printprogress
- Print the progess of the download. Takes the currentnumber
of the file being downloaded as a parameterdownload
- Downloads the images into the location specified by thedownload_location
parameterload_from_cache
- Load the download urls from the cache file intodownloadurls
.
- Auto search the cache first for a cache file. Access Google Images only upon a cache miss
- Parallelize the downloads in threads
- Increase downloaded image count per run (currently 100)