In this project, we'll learn the basics of web scraping in python. We'll do this by parsing the New York Times bestseller list. We'll use playwright, a browser automation tool, to automate our scraping.
In this project, you'll gain an understanding of:
- What web scraping is and why you'd do it
- How to extract elements from html using BeautifulSoup
- How to use playwright to automate scraping pages
- How to load scraped data into pandas and analyze it
We'll first start with an overview of web scraping, why you'd want to do it, and how to know if you're allowed.
Then we'll explore the NYT bestsellers list and find the elements that we want to extract. We'll use BeautifulSoup to parse the web page and get the items we want.
Next, we'll use playwright, a browser automation tool, to automate getting data from the bestseller list. We'll also use playwright to click on elements on the page to navigate through the bestseller lists from multiple weeks.
Finally, we'll explore the data a bit in python and talk about how you can use data after scraping it.
You can find the code for this project here.
File overview:
web_scraping.ipynb
- a jupyter notebook where we parse downloaded datasingle_page/1.py
- a script to visit a single page with playwright and screenshot itsingle_page/2.py
- a script to visit a single page with playwright and download the articlesmulti_page/1.py
- a script to visit multiple pages with playwright and download the articlesplaywright_in_jupyter.ipynb
- an example of using the playwright async API in Jupyter notebook
To follow this project, please install the following:
- Python 3 (at least 3.7)
- JupyterLab
- pandas
- Run
pip install pandas
- Run
- BeautfulSoup
- Run
pip install beautifulsoup4
- Run
- playwright
- Run
pip install playwright
- Run
- playwright browsers
- Run
playwright install
- Run
You won't need to download a specific data set for this project. The page we'll be scraping is here.