Climate_News_Scraper_ETL

Using this project one is capable of scraping the BBC News website for the latest updates on climate. The scraper functionality is packed in an ETL pipeline build on Prefect and Dask in order to load the scraped news articles in a SQL Lite database.

Webscraping

For webscraping the libraries requests and beautifulsoup are used. Only the latest articles can be scraped, therefore the script is intended to run on a periodic schedule.

ETL

Prefect is used for orchestration of the ETL flow. The flow an easily be monitored from the Prefect Cloud platform.

Sentiment classification

Using the TextBlob library the sentiment (negative/neutral/positive) is added to each article.

Database

All scraped articles will be written to a local SQLite database in the load stage of the ETL flow. No duplicate entries are allowed. Here is an example of a record inserted into the CLIMATENEWS table.

{ 
  "title" : "Warning climate change impacting on avalanche risk", 
  "content" : "Forecasters said a likely effect in Scotland was avalanches occurring in tighter periods of time.", 
  "date" : "2023-01-27T06:04:00.000000000", 
  "sentiment" : "Negative" 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Climate_News_Scraper_ETL

Webscraping

ETL

Sentiment classification

Database

Files

README.md

Latest commit

History

README.md

File metadata and controls

Climate_News_Scraper_ETL

Webscraping

ETL

Sentiment classification

Database