Scraper in Python for forkked
- Clone the repository
- On the terminal, create a virtual environment by typing
$virtualenv -p python3 .
This project was conceived using Python 3.7 - To load the requirements, type on the terminal
$. bin/activate
$pip install -r requirements.txt
- The script uses the excellent mapping tool peewee which you probably don't have installed. To get it, type
$pip install peewee
- It also uses the requests.html library for the heavylifting (parsing the HTML pages). To install, hit
$pip install requests-html
- To fetch the artworks' URL, I had to use BeautifulSoup because the URLs
src
are under adiv/class/img
tag.Src
is an attribute and not a proper HTML tag, so the requests method does not really work for fetching asrc
URL under animg
tag.
$pip install beautifulsoup4
- To parse and format the date into the YYYY-MM-DD format instead of 'January 1 2020', so the data is better handled by the SQL database. For that, the library htmldate was used. It can be downloaded by installing
$pip install htmldate
$pip install --upgrade htmldate
$pip install git+https://github.com/adbar/htmldate.git
- To create the database file with the preset tables, type
$python3 models.py
- Voilà! You should have now in your folder an
albums.db
file
- The script parses all Pitchfork's album reviews. Yes, that's right. There are album reviews dating back from 1999... And they will be parsed too. As of today (May 2020) there are 1,876 published review pages, amounting to 20,141 unique album reviews.
- As you can probably guess, I ain't got no time to browse each one of them manually.
- The scraper therefore parses every single album review published on Pitchfork, collects and inserts the following data into the database:
database id
pitchfork's album review url
publication date
album score
album year
record label
genre
artwork URL
review title
artist
album
- To run the scraper, type on the terminal
$python3 forkkit.py
and wait - gathering all this data may take a while!
- In the
forkkit.py
file, you can change a couple of variables:
- MAX_WORKERS = it can be increased to increase the running speed of the script.
- RANGE = the number that worked best for me was 1-501, 501-1001, 1001-1501, etc... Iterations of 500 pages per turn, for a smooth run and data-check on my computer.
- RECURSION_DEPTH = should be kept at 1 to avoid duplicates.
Special thanks to @nabaskes