Skip to content

Cescollino/Reverse-Image-Search-Scraper

Repository files navigation

Reverse-Image-Search-Scraper

This python notebook can search simultaneously across the 3 following reverse image search engines to scrape image data:

Deployment

  1. X virtual framebuffer abbreviated as Xvfb allows you to execute graphical apps without having to use a monitor by connecting some input device. Virtual memory is used to perform graphical operations and it allows the program to run headlessly.

sudo apt install xvfb

  1. git clone https://github.com/Cescollino/Reverse-Image-Search-Scraper.git
  2. cd Reverse-Image-Search-Scraper
  3. Dependencies Conda: The requirements.txt can be used to create a conda virtual environment with: conda create --name <env> --file requirements.txt

Virtualenv: python3 -m venv scraper source scraper/bin/activate pip install -r requirements.txt

Once the environment is made, you may open Jupyter Lab and load the notebook, you may use the following command on a terminal: jupyter lab

Open the Reverse Image Search Scraper.ipynb in jupyter lab. Uncomment the first cell and run it once, then comment it again. If there are no errors, you are good to go.

You should see this output: image

  1. Running

Defining Paths

Go to the " # Hardcoded File Paths & Run " group of cells

File Path

You may change the file_path variable to be the absolute path of the image you are trying to upload (.jpg) ex. (.../Reverse-Image-Search-Scraper/IMAGES/INPUT/1.jpg)

Result Path

You may change result_path to be the absolute path of the directory you are trying to store your images in. ex. (.../Reverse-Image-Search-Scraper/IMAGES/OUTPUT)

Number of Results

You may change the number of results gathered by each engine by changing: "max_count=100" in the last cell of the notebook

Running

Just press "run all cells". It may be that there arent any images left for the scraper to download. In that case simply press the square button on the top of the notebook to kill the kernel.

Output

This program will output image files to the directory with the following nomenclature:

jpg_"Image Number"_EngineID_ "Search Engine ID".jpg

This is an image

Deduplicating Images

For image nearly similar image deduplication using perceptual hashing, please refer to imgdupes

About

This python notebook can search across 3 different reverse image search engines to scrape image data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors