Rentswatch Scraper Framework

This package provides an easy and maintenable way to build a Rentswatch scraper. Rentswatch is a cross-borders investigation that collects data on flat rents in Europe. Its scrapers mainly focus on classified ads.

How to install

Install using pip...

pip install rentswatch-scraper

How to use

Let's take a look at a quick example of using Rentswatch Scraper to build a simple model-backed scraper to collect data from a website.

First, import the package components to build your scraper:

#!/usr/bin/env python
from rentswatch_scraper.scraper import Scraper
from rentswatch_scraper.browser import geocode, convert
from rentswatch_scraper.fields import RegexField, ComputedField
from rentswatch_scraper import reporting

To factorize as much code as possible we created an abstract class that every scraper will implement. For the sake of simplicity we'll use a dummy website as follow:

class DummyScraper(Scraper):
    # Those are the basic meta-properties that define the scraper behavior
    class Meta:
        country         = 'FR'
        site            = "dummy"
        baseUrl         = 'http://dummy.io'
        listUrl         = baseUrl + '/rent/city/paris/list.php'
        adBlockSelector = '.ad-page-link'

Without any further configuration, this scraper will start to collect ads from the list page of dummy.io. To find links to the ads, it will use the CSS selector .ad-page-link to get <a> markups and follow their href attributes.

We have now to teach the scraper how to extract key figures from the ad page.

class DummyScraper(Scraper):
    # HEADS UP: Meta declarations are hidden here
    # ...
    # ...

    # Extract data using a CSS Selector.
    realtorName = RegexField('.realtor-title')
    # Extract data using a CSS Selector and a Regex.
    serviceCharge = RegexField('.description-list', 'charges : (.*)\s€')
    # Extract data using a CSS Selector and a Regex.
    # This will throw a custom exception if the field is missing.
    livingSpace = RegexField('.description-list', 'surface :(\d*)', required=True, exception=reporting.SpaceMissingError)
    # Extract the value directly, without using a Regex
    totalRent = RegexField('.description-price', required=True, exception=reporting.RentMissingError)
    # Store this value as a private property (begining with a underscore).
    # It won't be saved in the database but it can be helpful as you we'll see.
    _address = RegexField('.description-address')

Every attribute will be saved as an Ad's property, according to the Ad model.

Some properties may not be extractable from the HTML. You may need to use a custom function that received existing properties. For this reason we created a second field type named ComputedField. Since the properties order of declaration is recorded, we can use previously declared (and extracted) values to compute new ones.

class DummyScraper(Scraper):
    # ...
    # ...

    # Use existing properties `totalRent` and `livingSpace` as they were
    # extracted before this one.
    pricePerSqm = ComputedField(fn=lambda s, values: values["totalRent"] / values["livingSpace"])
    # This full exemple uses private properties to find latitude and longitude.
    # To do so we use a buid-in function named `convert` that transforms an
    # address into a dictionary of coordinates.
    _latLng = ComputedField(fn=lambda s, values: geocode(values['_address'], 'FRA') )
    # Gets a the dictionary field we want.
    latitude = ComputedField(fn=lambda s, values: values['_latLng']['lat'])
    longitude = ComputedField(fn=lambda s, values: values['_latLng']['lng'])

All you need to do now is to create an instance of your class and run the scraper.

# When you script is executed directly
if __name__ == "__main__":
  dummyScraper = DummyScraper()
  dummyScraper.run()

API Doc

`class` Ad

Attributes

As seen above, every Ad attribute might be used as a Scraper attribute to declare which attribute extract.

Name	Type	Description
`status`	String	"listed" if needs more scraping, "scraped" if it's done
`site`	String	Name of the website
`createdAt`	DateTime	Date the ad was first scraped
`siteId`	String	The unique ID from the site where it's scrapped from
`serviceCharge`	Float	Extra costs (heating mostly)
`baseRent`	Float	Base costs (without heating)
`totalRent`	Float	Total cost
`livingSpace`	Float	Surface in square meters
`pricePerSqm`	Float	Price per square meter
`furnished`	Bool	True if the flat or house is furnished
`realtor`	Bool	True if realtor, n if rented by a physical person
`realtorName`	Unicode	The name of the realtor or person offering the flat
`latitude`	Float	Latitude
`longitude`	Float	Longitude
`balcony`	Bool	True if there is a balcony/terrasse
`yearConstructed`	String	The year the building was built
`cellar`	Bool	True if the flat comes with a cellar
`parking`	Bool	True if the flat comes with a parking or a garage
`houseNumber`	String	House Number in the street
`street`	String	Street name (incl. "street")
`zipCode`	String	ZIP code
`city`	Unicode	City
`lift`	Bool	True if a lift is present
`typeOfFlat`	String	Type of flat (no typology)
`noRooms`	String	Number of rooms
`floor`	String	Floor the flat is at
`garden`	Bool	True if there is a garden
`barrierFree`	Bool	True if the flat is wheelchair accessible
`country`	String	Country, 2 letter code
`sourceUrl`	String	URL of the page

`class` Scraper

Methods

The Scraper class defines a lot of method that we encourage you to redefine in order to have the full control of your scraper behavior.

Name	Description
`extract_ad`	Extract ads list from a page's soup.
`fail`	Print out an error message.
`fetch_ad`	Fetch a single ad page from the target website then create Ad instances by calling `èxtract_ad`.
`fetch_series`	Fetch a single list page from the target website then fetch an ad by calling `fetch_ad`.
`find_ad_blocks`	Extract ad block from a page list. Called within `fetch_series`.
`get_ad_href`	Extract a href attribute from an ad block. Called within `fetch_series`.
`get_ad_id`	Extract a siteId from an ad block. Called within `fetch_series`.
`get_fields`	Used internally to generate a list of property to extract from the ad.
`get_series`	Fetch a list page from the target website.
`has_issue`	True if we met issues with this ad before.
`is_scraped`	True if we already scraped this ad before.
`ok`	Print out an success message.
`prepare`	Just before saving the values.
`run`	Run the scrapper.
`transform_page`	Transform HTML content of the series page before parsing it.

Start a migration

Use Yoyo:

yoyo new ./migrations -m "Your migration's description"

And apply it:

yoyo apply --database mysql://user:password@host/db ./migrations

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
migrations		migrations
rentswatch_scraper		rentswatch_scraper
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
setup.py		setup.py
yoyo.ini		yoyo.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rentswatch Scraper Framework

How to install

How to use

API Doc

`class` Ad

Attributes

`class` Scraper

Methods

Start a migration

About

Releases

Packages

Contributors 2

Languages

License

jplusplus/rentswatch-scraper

Folders and files

Latest commit

History

Repository files navigation

Rentswatch Scraper Framework

How to install

How to use

API Doc

class Ad

Attributes

class Scraper

Methods

Start a migration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`class` Ad

`class` Scraper

Packages