Skip to content

Latest commit

 

History

History
85 lines (58 loc) · 6.42 KB

README.md

File metadata and controls

85 lines (58 loc) · 6.42 KB

scrapy_playwright_example

This repo contains a scraping script that crawls a JavaScript-rendered webpage using the scrapy-playwright package in Python and the scrapy framework

Objective of the Project

I created this script to test the scrapy-playwright python package in crawling a JavaScript rendered webpage.

image

To scrape dynamic websites in Python, one of these three options can be used:

I prefer option #1 for low-volume scraping and option #3 for high-volume scraping because these proxy services also re-route your requests and overcome the anti-bot mechanisms that E-commerce websites use. Option #2 also works pretty well, but you need to be familiar with docker and have it installed on your computer. scrapy-playwright does not need a docker-image to work and acts as a direct plugin to scrapy, which makes it pretty easy to use.

Usability and Reproducability

Step 0: To know if a website is dynamically rendered or not, click F12, then Ctrl-Shift-P, type in Disable JavaScript, then reload the page. If the text/numbers you want to scrape disappear, then you indeed have a JS-rendered website

Step 1: scrapy-playwright does not work natively on Windows. It only works on Linux and Mac. If you use Windows, you'll need to use Windows Subsystem for Linux (WSL). Otherwise, the spider will always fail

If you are using Windows, please follow the steps in this video from 4:30 to 14:00 to install WSL, VSCode, and Windows Terminal on your machine. The video is courtesy of YouTube user freakingud. It is not in English (probably Hindi), but you will be able to follow the steps without any problems from the screen recordings. I found this to be one of the most straightforward guides to install WSL despite the fact that I did not understand the language.

After installing WSL, you will need to do two additional steps:

  • Upgrade it from WSL1 to WSL2. To do this, follow the steps in this guide
  • Install the VSCode extentions shown in the image below. The ones that are specifically needed for WSL to work are WSL, Pylance, and Python, but the others are pretty useful for other use cases, and I recommend you keep them in your standard toolbox

image Note 1: You will need to install these extensions again in the WSL: Ubuntu environment once you connect to the WSL remote container (steps explained below) Note 2: The name of the distro in the wsl --set-version <distro-name> 2 step is Ubuntu

Step 2: From VSCode, click on the green/purple icon in the bottom left hand corner, then click on New WSL Window using Distro, and finally Ubuntu

image

You should land on a page that looks like this

image

Step 3: Open your terminal and type in git clone https://github.com/omar-elmaria/scrapy_playwright_example.git

Step 4: After the repo is cloned, type cd scrapy_playwright_example in your terminal, then python -m venv venv_scraping to create a virtual environment

Step 5: Activate the virtual environment by typing source venv_scraping/bin/activate

Step 6: Type pip3 install -r requirements.txt to install the dependencies

Step 7: If it is your first time using scrapy-playwright, you will also need to install the headless browsers by typing playwright install in your terminal

Step 8: Before running the crawler, please enter the following lines in your settings.py file

# Playwright
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

This comes directly from the scrapy-playwright official documentation. I encourage you to go through it to get acquainted with more use cases of the plugin.

Step 9: To run the crawler, type cd scrapy_playwright_example/site_crawler in your terminal and then enter the following command --> scrapy crawl spanish_site_crawler. This will launch the spider and crawl the product name, discount tag, and price of the product. spanish_site_crawler_terminal is the name of the spider and can be changed by setting the variable name under the SiteCrawlerSpider class to something else

The end result should look like this...

image

Step 10 (Optional): If you want to launch the spider by running the script itself through the play button at the top right hand corner and not through the terminal, please add the following import command at the start of the script from scrapy.crawler import CrawlerProcess and insert these few lines of code at the end of the script without indentation outside the class code block

process = CrawlerProcess(settings = {
    "DOWNLOAD_HANDLERS": {
        "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    },

    "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
}) # The same lines of code you put in settings.py
process.crawl(SiteCrawlerSpider) # Name of the class
process.start()

Extra Resources

Here are two nice YouTube videos that walk you through how to install and use the package: