Skip to content

Latest commit

 

History

History
117 lines (95 loc) · 10.7 KB

README.md

File metadata and controls

117 lines (95 loc) · 10.7 KB

Automatic News Summariser

If you think this repository is useful for solving your problem please leave a Star! :)

As always feel free to make pull requests or open an issue if you think something is wrong and/or can be improved.

Introduction

The goal of this project is to help me staying updated with the latest news about topics I care while saving some precious time. Also, I think that in this way I will be able to drastically reduce the number of bookmarks in my browser, therefore it is a win-win situation.

You can choose between two type of summaries:

  • Extractive summaries: only the most meaningful phrases in a given article are returned to the user, without any modification of the original text. This is comparable to highlighting the main sentences in a text.
  • Abstractive summaries: using more complex models such as BART and T5 it is possible to generate a summary that is based on the original text but rephrase the various sentences. This is comparable to refer to a friend what you have just read, in your own words.

High-level pipeline:

In order to achieve my starting goal this project is structured as follows:

  1. Most recent articles are obtained by subscribing to the websites of interest' RSS feeds
  2. Each article is scraped in order to obtain the full text. This is necessary because in the majority of cases a RSS feed contain only the first couple of sentences, redirecting the user to the original website for all the remaining infos
  3. An extractive summary is created for each article
  4. Summaries are stored in the DB
  5. (Optional): Summaries are sent to a Telegram Bot of your choice, so that you can read the news using the app
  6. (Optional): Steps 1-5 are repeated every X minutes
  7. (Optional): You can summarise a text of your choice through an HTML page

How to run the news summariser

The preferred way to run this project is via Docker. The reason why is that in this way you will keep your environment clean and it is safe to assume that everything will work. In order to do this, assuming you have Docker installed on your machines you will need to execute the following instructions:

1. (Optional) Create a Telegram Bot and get you chat ID

If you want to receive the summaries on Telegram you will need to create a bot. In order to do this:

  1. Open the Telegram app and look for @BotFather
  2. Type /start in order to start the conversation with this bot
  3. Now let's type /newbot for creating your own bot
  4. Choose an appropriate name and type it
  5. Done! You will now get a message with your token for accessing via the HTTP API your bot. Keep this token safe, you will need it later on
  6. Now access your bot, typing the name specified at step 4 and type /start
  7. For retrieving your Chat ID please follow the first answer in this StackOverflow thread

2. Review the various configurations

In the src/config directory you will find two files: settings.json, associated with the project settings, and websites.json, where all the RSS feeds are specified.

2.1 Settings.json

In settings.json the following parameters are specified:

  • log_fn: where application logs are stored
  • db_path: path to the TinyDB instance where already summarised articles are stored
  • db_telegram_path: path to the TinyDB instance where already sent summaries are stored
  • summaries_dir: folder where summaries are stored
  • summaries_fn: complete filename used for storing summaries
  • min_words_in_sentence: the minimum number of words a sentence must have in order to be included in the summary
  • reduction_factor: how much the original article should be reduced. If set to 3 an article having N sentences will contain N/3 sentences. This is NOT valid for BART and T5, which will have a maximum length of 300 tokens.
  • algorithm: which algorithm to use for summarising articles. At the moment you can choose between:
    • pagerank: sentences in a given article, after being vectorised using a Word Embedding model, are compared to each other in terms of their cosine similarity. Once the similarity matrix is built, PageRank is used for finding the most diverse sentences
    • tf_idf: in this case a tf-idf matrix is built for each article. Sentences with the highest tf-idf average value are included in the summary
    • bart: summaries are created by reformulating the given article using BART. If you choose this option please keep in mind that it is data hungry and you may need to increase docker daemon resources to at least 4 GB of RAM
    • t5: the procedure works as described in the previous point, however in this case T5 model is used for creating abstractive summaries . T5-based summaries, as BART ones, are computationally intensive.
  • distance_metric: for Pagerank summaries it is possible to choose to evaluate sentence similarity using Word Mover's distance (wmd) or Cosine (cosine) distance
  • send_summaries_via_telegram: whether to send the summaries via telegram or not, expressed as boolean
  • telegram_chat_id: the chat id of your chat with the bot
  • telegram_token: the token associated with your bot
  • always_on_execution_mode: whether to execute the entire project every X minutes or not, expressed as boolean
  • scheduling_minutes: how frequent (in minutes) the entire project is run
  • empty_strategy : if set to fill, words not in the given Word Embedding model will be replaced by a vector of 0s. Otherwise they will be skipped(only for PageRank)
  • activate_endpoint: boolean flag regarding the activation of a Rest API for summarising text. More info on Section 4.
2.1.2 Which summarisation algorithm should I choose?

As you will probably know it doesn't exist an algorithm, in the Machine Learning field, that is way better than the others in all possible scenarios (No free lunch theorem). This is true also in this scenario, however based on literature and my experience developing and testing experience you should choose:

  • TF-IDF if you have limited hardware (e.g. Raspberry Pi) and/or you want to summarise a lot of articles and read them as quick as possible.
  • Pagerank if you plan to run the tool on a common PC and your are willing to wait a bit more for gaining potentially better summaries. Keep in mind that the quality of the specified Word Embedding is what makes the difference, therefore choose it appropriately.
    • Distance metrics don't change so much the end results. The main difference between the two is that Word Mover's distance takes into consideration the word positions in a given sentence, therefore on paper it returns a better similarity: however this comparison requires more computational resources compared to the classic Cosine similarity
  • T5 or BART if your main goal is to have articles as coherent as possible without worrying about summarisation time. Please note that sometimes the models behaves erratically because of their abstractive nature.

2.2 Websites.json

In websites.json you can specify one key for each website you intend to summarise, whose keys are:

  • rss: the URL of the given RSS feed
  • main_class: the HTML div class that contains the article(s)
  • number_of_first_paragraphs_to_ignore . In websites like Politico the first paragraphs have not relevant information (e.g. datime, author(s) name(s)). If you specify a given number n, the first n paragraph would be ignored.
  • number_of_last_paragraphs_to_ignore. In websites like Wired UK the last paragraphs have not relevant information (e.g. social media links, related articles). If you specify a given number n, the last n paragraph would be ignored.

I have uploaded an example of configuration with some of the websites I usually read. Feel free to make pull requests just to add the website you care about, I will be more than happy to accept them.

3. Build & launch the project

In order to run and stop the project Docker Compose is the way to go. The first step you will need to do is to build the desired Docker image: in order to do that just launch the docker-compose build command. When the project is successfully build the following commands will be useful:

  • docker-compose up -d: starts the project, the -d flag run containers in the background (detached mode)
  • docker-compose down: stops the container(s)

4. (Optional) Use the Rest API

Now the News Summariser also has a Rest API, made using Flask, that enables to quickly summarise a given text. In order to use this functionality, assuming you have set the activate_endpoint flag to true in the settings file, you need to:

  • Access the backend by going at the following address http://localhost:5000/ with a browser of your choice
  • Paste in the box on the left the text you want to summarise
  • Press Submit

You'll find the summarised text on the right. Currently it uses the settings specified in the settings.json configuration file but in the next weeks I'll add the possibility to personalize some of the summarisation parameters directly through the webpage.

Next steps

In the next weeks I will work on the following points in order to improve the news summariser:

  • Use a proper DB for storing parsed articles: at the moment I am using a JSON file as a temporary solution
  • Pick the most diverse (and meaningful) sentences in an alternative/simpler way (currently PageRank is used)
  • Send output summaries via Telegram and/or email
  • Use other similarity functions
  • Personalize Rest API summaries (e.g. let the user decide the algorithm, reduction factor etc.)
  • Enable summaries of files (e.g. PDF, Word, TXT)
  • Multilang support (at least for one summarisation strategy)
  • Find a way for getting articles' text without having to specify the div class

Sources

The course "Text Mining and Search" of my M.Sc. in Data Science at University of Milan-Bicocca, along with other courses and my working experience, surely helped me in creating all these software components. I would also like to cite relevant articles from which I took (large) inspiration in order to build some of the software components: