This package provides an easy and maintenable way to build a Rentswatch scraper. Rentswatch is a cross-borders investigation that collects data on flat rents in Europe. Its scrapers mainly focus on classified ads.
Install using pip
...
pip install rentswatch-scraper
Let's take a look at a quick example of using Rentswatch Scraper to build a simple model-backed scraper to collect data from a website.
First, import the package components to build your scraper:
#!/usr/bin/env python
from rentswatch_scraper.scraper import Scraper
from rentswatch_scraper.browser import geocode, convert
from rentswatch_scraper.fields import RegexField, ComputedField
from rentswatch_scraper import reporting
To factorize as much code as possible we created an abstract class that every scraper will implement. For the sake of simplicity we'll use a dummy website as follow:
class DummyScraper(Scraper):
# Those are the basic meta-properties that define the scraper behavior
class Meta:
country = 'FR'
site = "dummy"
baseUrl = 'http://dummy.io'
listUrl = baseUrl + '/rent/city/paris/list.php'
adBlockSelector = '.ad-page-link'
Without any further configuration, this scraper will start to collect
ads from the list page of dummy.io
. To find links to the ads, it
will use the CSS selector .ad-page-link
to get <a>
markups and
follow their href
attributes.
We have now to teach the scraper how to extract key figures from the ad page.
class DummyScraper(Scraper):
# HEADS UP: Meta declarations are hidden here
# ...
# ...
# Extract data using a CSS Selector.
realtorName = RegexField('.realtor-title')
# Extract data using a CSS Selector and a Regex.
serviceCharge = RegexField('.description-list', 'charges : (.*)\s€')
# Extract data using a CSS Selector and a Regex.
# This will throw a custom exception if the field is missing.
livingSpace = RegexField('.description-list', 'surface :(\d*)', required=True, exception=reporting.SpaceMissingError)
# Extract the value directly, without using a Regex
totalRent = RegexField('.description-price', required=True, exception=reporting.RentMissingError)
# Store this value as a private property (begining with a underscore).
# It won't be saved in the database but it can be helpful as you we'll see.
_address = RegexField('.description-address')
Every attribute will be saved as an Ad's property, according to the Ad model.
Some properties may not be extractable from the HTML. You may need to
use a custom function that received existing properties. For this reason
we created a second field type named ComputedField
. Since the
properties order of declaration is recorded, we can use previously
declared (and extracted) values to compute new ones.
class DummyScraper(Scraper):
# ...
# ...
# Use existing properties `totalRent` and `livingSpace` as they were
# extracted before this one.
pricePerSqm = ComputedField(fn=lambda s, values: values["totalRent"] / values["livingSpace"])
# This full exemple uses private properties to find latitude and longitude.
# To do so we use a buid-in function named `convert` that transforms an
# address into a dictionary of coordinates.
_latLng = ComputedField(fn=lambda s, values: geocode(values['_address'], 'FRA') )
# Gets a the dictionary field we want.
latitude = ComputedField(fn=lambda s, values: values['_latLng']['lat'])
longitude = ComputedField(fn=lambda s, values: values['_latLng']['lng'])
All you need to do now is to create an instance of your class and run the scraper.
# When you script is executed directly
if __name__ == "__main__":
dummyScraper = DummyScraper()
dummyScraper.run()
As seen above, every Ad attribute might be used as a Scraper attribute to declare which attribute extract.
Name | Type | Description |
---|---|---|
status |
String | "listed" if needs more scraping, "scraped" if it's done |
site |
String | Name of the website |
createdAt |
DateTime | Date the ad was first scraped |
siteId |
String | The unique ID from the site where it's scrapped from |
serviceCharge |
Float | Extra costs (heating mostly) |
baseRent |
Float | Base costs (without heating) |
totalRent |
Float | Total cost |
livingSpace |
Float | Surface in square meters |
pricePerSqm |
Float | Price per square meter |
furnished |
Bool | True if the flat or house is furnished |
realtor |
Bool | True if realtor, n if rented by a physical person |
realtorName |
Unicode | The name of the realtor or person offering the flat |
latitude |
Float | Latitude |
longitude |
Float | Longitude |
balcony |
Bool | True if there is a balcony/terrasse |
yearConstructed |
String | The year the building was built |
cellar |
Bool | True if the flat comes with a cellar |
parking |
Bool | True if the flat comes with a parking or a garage |
houseNumber |
String | House Number in the street |
street |
String | Street name (incl. "street") |
zipCode |
String | ZIP code |
city |
Unicode | City |
lift |
Bool | True if a lift is present |
typeOfFlat |
String | Type of flat (no typology) |
noRooms |
String | Number of rooms |
floor |
String | Floor the flat is at |
garden |
Bool | True if there is a garden |
barrierFree |
Bool | True if the flat is wheelchair accessible |
country |
String | Country, 2 letter code |
sourceUrl |
String | URL of the page |
The Scraper class defines a lot of method that we encourage you to redefine in order to have the full control of your scraper behavior.
Name | Description |
---|---|
extract_ad |
Extract ads list from a page's soup. |
fail |
Print out an error message. |
fetch_ad |
Fetch a single ad page from the target website then create Ad instances by calling èxtract_ad . |
fetch_series |
Fetch a single list page from the target website then fetch an ad by calling fetch_ad . |
find_ad_blocks |
Extract ad block from a page list. Called within fetch_series . |
get_ad_href |
Extract a href attribute from an ad block. Called within fetch_series . |
get_ad_id |
Extract a siteId from an ad block. Called within fetch_series . |
get_fields |
Used internally to generate a list of property to extract from the ad. |
get_series |
Fetch a list page from the target website. |
has_issue |
True if we met issues with this ad before. |
is_scraped |
True if we already scraped this ad before. |
ok |
Print out an success message. |
prepare |
Just before saving the values. |
run |
Run the scrapper. |
transform_page |
Transform HTML content of the series page before parsing it. |
Use Yoyo:
yoyo new ./migrations -m "Your migration's description"
And apply it:
yoyo apply --database mysql://user:password@host/db ./migrations