Skip to content

A very simple news crawler with a funny name

License

Notifications You must be signed in to change notification settings

felixvonberlin/fundus

 
 

Repository files navigation

Logo

A very simple news crawler in Python. Developed at Humboldt University of Berlin.

version python Static Badge Publisher Coverage


Fundus is:

  • A static news crawler. Fundus lets you crawl online news articles with only a few lines of Python code! Be it from live websites or the CC-NEWS dataset.

  • An open-source Python package. Fundus is built on the idea of building something together. We welcome your contribution to help Fundus grow!


Quick Start

To install from pip, simply do:

pip install fundus

Fundus requires Python 3.8+.

Example 1: Crawl a bunch of English-language news articles

Let's use Fundus to crawl 2 articles from publishers based in the US.

from fundus import PublisherCollection, Crawler

# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

That's already it!

If you run this code, it should print out something like this:

Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text:  "Democrats jammed three of President Joe Biden's controversial court nominees
          through committee votes on Thursday thanks to a last-minute [...]"
- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From:   FreeBeacon (2023-05-11 18:41)

Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text:  "Student government at Northwestern University in Illinois "indefinitely" froze
          the funds of the university's chapter of College Republicans [...]"
- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From:   FoxNews (2023-05-09 14:37)

This printout tells you that you successfully crawled two articles!

For each article, the printout details:

  • the "Title" of the article, i.e. its headline
  • the "Text", i.e. the main article body text
  • the "URL" from which it was crawled
  • the news source it is "From"

Example 2: Crawl a specific news source

Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:

from fundus import PublisherCollection, Crawler

# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

Example 3: Crawl articles from CC-NEWS

If you're not familiar with CC-NEWS, check out their paper.

from fundus import PublisherCollection, CCNewsCrawler

# initialize the crawler for news publishers based in the US
crawler = CCNewsCrawler(*PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
  print(article)

Tutorials

We provide quick tutorials to get you started with the library:

  1. Tutorial 1: How to crawl news with Fundus
  2. Tutorial 2: How to crawl articles from CC-NEWS
  3. Tutorial 3: The Article Class
  4. Tutorial 4: How to filter articles
  5. Tutorial 5: How to search for publishers

If you wish to contribute check out these tutorials:

  1. How to contribute
  2. How to add a publisher

Currently Supported News Sources

You can find the publishers currently supported here.

Also: Adding a new publisher is easy - consider contributing to the project!

Contact

Please email your questions or comments to Max Dallabetta

Contributing

Thanks for your interest in contributing! There are many ways to get involved; start with our contributor guidelines and then check these open issues for specific tasks.

License

MIT

About

A very simple news crawler with a funny name

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%