Check, download, and parse local council agendas for relevant housing and planning matters.
Users can easily set up notification functionality to be alerted by email (or: to-be-implemented, Discord) when new agendas are released.
This enables YIMBY Melbourne and other organisations to keep easy track of relevant Council activities.
Scraper details, including links and current status, can be found in the docs (docs/councils.md
)
Write a Scraper! (Instructions)
-
Setup and activate the Python environment of your choosing.
-
Ensure you have
poetry
installed (e.g. withpip install poetry
). -
Run
poetry shell
to ensure you've activated the correct virtual env. -
Run
poetry install
to install dependencies.
Preferred code formatter is Black.
poetry run pytest
will run all the tests, including on any new scrapers added to the scrapers/
directory. These tests are also run through GitHub actions upon merge request.
Within your environment, run: python ./aus_council_scrapers/main.py
Logs will print to your terminal and also get saved into /logs/ as well as writing key results to agendas.db
.
You can run an individual scraper by running python ./aus_council_scrapers/main.py --council council_string
. For instance: python council_scrapers/main.py --council yarra
will run the Yarra Council scraper.
A list of councils and their strings can be found in docs/councils.md
.
Optional functionality you can configure to extend the application's utility.
In the .env.example
file, there is the basic variable GMAIL_FUNCTIONALITY.
This functionality is turned off by default. If you want to use the email sending features here, then you'll need to include your Gmail authentication details in a .env
file.
This may require setting up an App-specific password, for which you can find setup instructions here.
This functionality is optional, and the app should work fine without this setup.
Instructions for setting up Discord can be found in docs/discord.md
.
Australia has many, many councils! As such, we need many, many scrapers!
You can find a full list of active scrapers at docs/councils.md
. Additionally, you can find a starting file at docs/scraper_template.py
.
Scrapers for each council are contained within the scrapers/[state]/
directory.
A scraper should be able to reliably find the most recent agenda on a Council's website. Once that link is found, it is checked against an existing database—if the link is new, then the agenda is downloaded, scanned, and a notification can be sent.
In addition to the link, the scraper function should return an object of the following shape, outlined in base.py
:
@dataclass
class ScraperReturn:
name: str # The name of the meeting (e.g. City Development Delegated Committee).
date: str # The date of the meeting (e.g. 2021-08-01).
time: str # The time of the meeting (e.g. 18:00).
webpage_url: str # The URL of the webpage where the agenda is found.
download_url: str # The URL of the PDF of the agenda.
It is not always possible to scrape the date and time of meetings from Council websites. In these cases, these values should be returned as empty strings.
The scraper
function is then included within a Scraper class, which extends BaseScraper.py
.
Thanks to the phenomenal work of @catatonicChimp, a lot of the scraping can now be done by extending the BaseScraper class.
For writing a new scraper, you can refer to and duplicate the template: docs/scraper_template.py
. The Yarra scraper in scrapers/vic/yarra.py
is a good functional straightforward example.
In the case of most councils, you will will be able to use the self.fetcher.fetch_with_requests(url)
method to return the agenda page html as output.
For more complex Javascript pages, you may need to use self.fetcher.fetch_with_selenium(url)
.
For pages requiring interactivity using a headless browser, you may need to write a Selenium script using the driver returned by self.fetcher.get_selenium_driver()
, and then utilise the Selenium library to navigate the page effectively.
Load the HTML into BeautifulSoup like this:
soup = BeautifulSoup(output, 'html.parser')
And then use the BeautifulSoup documentation to navigate the HTML and grab the relevant elements and information.
You may also need to use regular expressions (regexes) to parse dates etc.
Luckily, ChatGPT is quite good at both BeautifulSoup and regexes. So it's recommended that you'll save a great deal of time feeding your HTML into ChatGPT, Github Copilot, or the shockingly reliable Phind.com and iterating like that.
Once you have got the agenda download link and all other available, scrapeable information, return a ScraperReturn object.
To register the Scraper, import the scraper in the relevant folder's __init__.py
file.
As an example, to add the scraper for the Yarra council, open council_scrapers/scrapers/vic/__init__.py
, and add:
from council_scrapers.scrapers.vic.yarra import YarraScraper
Once you have your scraper working locally, run pytest in the root directory (council-meeting-agenda-scraper/
) and add the cached results to the commit when successful.
This is done to prevent spamming requests to council pages during the development of scrapers.