diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..b632fa5
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,7 @@
+data/*
+files/*
+!data/test_data.jsonl
+
+.vscode/
+bak/
+test_scripts/
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..f2f7908
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2020 TU/e and EPFL
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..1d81ddb
--- /dev/null
+++ b/README.md
@@ -0,0 +1,63 @@
+# IRIS Virtual Patent Marking Pages Classifier
+Tool to help a human being to classify a list of potential VPM pages into several possible categories. Part of the IRIS project.
+
+The classifier is written in Python, using the PyQt5 library.
+
+It creates a GUI browser that shows sequentially one of the detected pages.
+
+You can interact with the browser with the mouse and you can also use the numerical pad of the keyboard to select one of the categories.
+
+Once you have chosen the right category for a page the software moves to the next page.
+
+## Setup the classifier
+The best is to 
+1. Install [Git](https://git-scm.com/)
+2. Clone this repository with ``git clone https://gitlab.tue.nl/iris/iris-vpm-pages-classifier.git``
+3. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
+4. Create an environment with
+    * ``conda create -n iris-vpm-pages-classifier python=3.9``
+	* ``conda activate iris-vpm-pages-classifier``
+	* ``pip install -r requirements.txt``
+	* ``pip install git+https://gitlab.tue.nl/iris/iris-utils.git``
+5. If you need to use the pre-classifier, you must also install a headless browser with the following command<br>
+	``playwright install chromium``<br>
+	Note: the code has been tested with Chromium v857950 but the last version of the browser will be installed
+
+### GUI classifier on WSL2
+1. Install ``qt5-default`` on the WSL2 distro
+2. Install X410 on Windows (the free alternatives did not work for me) and select ``Allow Public Access`` from its menu
+3. Add the following lines into the ``~/.bashrc`` file of the WSL2 distro (before the bunch of code about Conda)<br>
+``export DISPLAY=$(awk '/nameserver / {print $2; exit}' /etc/resolv.conf 2>/dev/null):0.0``<br>
+Instead, do not add ``export LIBGL_ALWAYS_INDIRECT=1`` as adviced in many online guides.
+
+## Pre-processing
+Before you start to classify the pages by hand, you must run ``pre-classify.py`` to automatically classify some pages.
+This script will create a file with five main categories: cases that are (a) very likely true positives; (b) very likely false positives; (c) maybe positive; (d) maybe negative; (e) unknown.
+
+The first two cases are automatically classified. For the second two, a hint is provided and the person is required to choose if the page is actually a VPM page or not. The last case is left to the person, without any hint.
+
+To use it you need a bunch of software that is as easy to install on GNU/Linux as hard to have on MS-Windows. The advice is, therefore, to use a GNU/Linux machine (the instructions that follow are for Debian GNU/Linux) or use WSL2 (to run the GUI classifier from WSL2 is not trivial but possible; follow the instructions here below).
+1. Install [Tesseract](https://tesseract-ocr.github.io/) with<br>
+``sudo apt install tesseract-ocr``
+2. Install [Poppler](https://poppler.freedesktop.org/)<br>
+``sudo apt install poppler-utils``
+
+To run the automatic classifier, please run<br>
+``python pre-classify.py -I data/scraping_results.jsonl data/websites_to_exclude.txt -o data/pre_classified.jsonl``
+
+## Populate the database
+Once the data have been analyzed by the pre-classifier, you must use its output to populate a database that will be used by the classifier. To do so, please run<br>
+``python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json``
+
+If you want to split the data in sub-databased, so that more than one person can have her/his own data to classify, you can run<br>
+``python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json -O N``<br>
+where ``N`` is the number of files that you want to generate.
+
+Note: you cannot overwrite the database once created (you can only update it, if not using the specific commands of [Flata](https://github.com/harryho/flata)). If you want to do so, you must delete the written files and re-run the script.
+
+## Run the classifier
+1. Remember, each time, to activate the conda environment created in the setup phase with ``conda activate iris-vpm-pages-classifier``
+2. Run ``python classify.py -i data/database.json``
+
+## Acknowledgements
+The authors thank the EuroTech Universities Alliance for sponsoring this work. Carlo Bottai was supported by the European Union's Marie Skłodowska-Curie programme for the project Insights on the "Real Impact" of Science (H2020 MSCA-COFUND-2016 Action, Grant Agreement No 754462).
diff --git a/classify.py b/classify.py
new file mode 100644
index 0000000..b818e89
--- /dev/null
+++ b/classify.py
@@ -0,0 +1,247 @@
+#!/usr/bin/env python
+
+"""
+Tool to help a human being to classify the scraped VPM pages 
+ into several categories
+
+It creates a GUI browser that shows sequentially one of the detected pages. 
+ You can interact with the browser with the mouse and you can also use the 
+ numerical pad of the keyboard to select one of the categories. Once you 
+ have chosen the right category for a page the software moves to the next.
+
+Author: Carlo Bottai
+Copyright (c) 2020 - TU/e and EPFL
+License: See the LICENSE file.
+Date: 2020-10-16
+
+"""
+
+from PyQt5.QtCore import *
+from PyQt5.QtWidgets import *
+from PyQt5.QtGui import *
+from PyQt5.QtWebEngineWidgets import *
+import qtawesome as qta
+import sys
+import webbrowser
+from flata import Flata, Query, JSONStorage
+import requests
+from iris_utils.parse_args import parse_io
+
+
+USER_AGENT = ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) '
+              'Gecko/2009021910 Firefox/3.0.7')
+
+
+class MainWindow(QMainWindow):
+    def __init__(self, *args, **kwargs):
+        super(MainWindow, self).__init__(*args, **kwargs)
+        
+        args = parse_io()
+        self.f_in = args.input
+        
+        self.read_data()
+
+        self.view = QWebEngineView()
+        self.view.settings() \
+            .setAttribute(QWebEngineSettings.PluginsEnabled, True)
+        self.setCentralWidget(self.view)
+
+        self.status = QStatusBar()
+        self.setStatusBar(self.status)
+        
+        navtb = QToolBar('Navigation')
+        self.addToolBar(navtb)
+        
+        back_btn = QAction(qta.icon('fa5s.arrow-left'), 'Back', self)
+        back_btn.triggered.connect(lambda: self.view.back())
+        navtb.addAction(back_btn)
+
+        next_btn = QAction(qta.icon('fa5s.arrow-right'), 'Forward', self)
+        next_btn.triggered.connect(lambda: self.view.forward())
+        navtb.addAction(next_btn)
+       
+        navtb.addSeparator()
+
+        self.urlbar = QLineEdit()
+        self.urlbar.returnPressed.connect(self.go_to_url)
+        navtb.addWidget(self.urlbar)
+       
+        navtb.addSeparator()
+       
+        reload_btn = QAction(qta.icon('fa5s.redo'), 'Reload', self)
+        reload_btn.triggered.connect(lambda: self.view.reload())
+        navtb.addAction(reload_btn)
+
+        stop_btn = QAction(qta.icon('fa5s.stop'), 'Stop', self)
+        stop_btn.triggered.connect(lambda: self.view.stop())
+        navtb.addAction(stop_btn)
+
+        open_btn = QAction(
+            qta.icon('fa5s.external-link-square-alt'), 'Open', self)
+        open_btn.triggered.connect(lambda: \
+            webbrowser.open_new_tab(self.urlbar.text()))
+        navtb.addAction(open_btn)
+        
+        labtb = QToolBar('Labeling')
+        self.addToolBar(Qt.RightToolBarArea, labtb)
+        
+        for name, idx in [
+                ('VPM page | True patent-product link', 1),
+                ('Brochure or description of the product | True patent-product link', 2),
+                ('Hybrid document | True patent-product link', 3),
+                ('List of patents or metadata of a patent | False patent-product link', 4),
+                ('A scientific publication | False patent-product link', 5),
+                ('News about the patent | False patent-product link', 6),
+                ('CV/resume | False patent-product link', 7),
+                ('Something else in a website to keep | False patent-product link', 8),
+                ('Something else in a website to exclude | False patent-product link', 9),
+                ('The document is unreachable | False patent-product link', 0)]:
+            label = QAction(f'{name} ({idx})', self)
+            label.setShortcut(str(idx))
+            label.triggered.connect(lambda checked, lbl=name: self.label_page(lbl))
+            labtb.addAction(label)
+        
+        #labtb.addSeparator()
+        
+        urls_len_lbl = f'{self.data_to_classify_len} URLs left to classify'
+        self.status.showMessage(urls_len_lbl)
+
+        self.open_next_page()
+
+        self.show()
+
+        self.setWindowTitle('VPM pages handmade classifier')
+
+    def read_data(self):
+        DB = Flata(self.f_in, storage=JSONStorage)
+        self.database = DB.table('iris_vpm_pages_classifier')
+        
+        to_classify = \
+            (Query().vpm_page_classification==None) & \
+            (Query().vpm_page!=None)
+        self.data_to_classify = iter(self.database.search(to_classify))
+        self.data_to_classify_len = self.database.count(to_classify)
+
+    def go_to_url(self, url=None):
+        if url is None:
+            url = self.urlbar.text()
+        else:
+            self.urlbar.setText(url)
+            self.urlbar.setCursorPosition(0)
+        
+        try:
+            response = requests.head(
+                url, 
+                headers={'User-Agent': USER_AGENT}, 
+                verify=False, 
+                allow_redirects=True,
+                timeout=10)
+            headers = response.headers
+            content_type = headers['Content-Type']
+            if 'Content-Disposition' in headers:
+                content_disposition = headers['Content-Disposition']
+            else:
+                content_disposition = ''
+            if not (content_type.startswith('text/html') or \
+                    content_type.startswith('application/pdf') or \
+                    content_type.startswith('text/plain')) or \
+               content_disposition.startswith('attachment'):
+                self.msgBox = QMessageBox.about(
+                    self, 
+                    'Additional information (DOWNLOAD)', 
+                    ('It is possible that it is needed to download the next '
+                     'document.\nIf you do not see the page changing, try to '
+                     'open the page in a browser by clicking on '
+                     'the appropriate button'))
+        except:
+            pass
+        
+        url = QUrl(url)
+        
+        if url.scheme() == '':
+            url.setScheme('https')
+
+        self.view.setUrl(url)
+
+    def open_next_page(self):
+        try:
+            self.current_data = next(self.data_to_classify)
+            while self.current_data['vpm_page_classification']:
+                self.current_data = next(self.data_to_classify)
+            
+            INFO_MSG = {
+                'COPYRIGHT': 
+                    ('The information about the patent(s) has been '
+                     'detected close to the copyright information '
+                     'at the bottom of the document.\n'
+                     'Please, confirm whether or not there is a link '
+                     'between a patent and a product in this document'),
+                'NOCORPUS': 
+                    ('No information about any of the patents has been '
+                     'detected in the document.\nPlease, confirm whether '
+                     'or not there is a link between a patent and a '
+                     'product in this document'),
+                'NOCORPUS+IMG': 
+                    ('The only information about the patent(s) '
+                     'has been detected in one of the pictures '
+                     'of the document.\nPlease, confirm whether '
+                     'or not there is a link between a patent '
+                     'and a product in this document'),
+                'NOCORPUS+PATNUMINURL': 
+                    ('The only information about the patent(s) '
+                     'has been detected in the URL '
+                     'of the document.\nPlease, confirm whether '
+                     'or not there is a link between a patent '
+                     'and a product in this document')}
+            vpm_page_automatic_classification = self.current_data[
+                'vpm_page_automatic_classification']
+            vpm_page_automatic_classification_info = \
+                vpm_page_automatic_classification \
+                    .split(' | ')[1]
+            if vpm_page_automatic_classification_info in INFO_MSG.keys():
+                vpm_page_automatic_classification_msg = INFO_MSG[
+                    vpm_page_automatic_classification_info]
+                self.msgBox = QMessageBox.about(
+                    self, 
+                    f'Additional information ({vpm_page_automatic_classification_info})', 
+                    vpm_page_automatic_classification_msg)
+            
+            print('\n+++++++++++++++++++++++++++')
+            print(f"Patent assignee: {self.current_data['patent_assignee']}")
+            try:
+                print(f"Award recipient: {self.current_data['award_recipient']}")
+            except Exception:
+                pass
+            print(f"Patents: {self.current_data['patent_id']}")
+            print('+++++++++++++++++++++++++++\n')
+            
+            url = self.current_data['vpm_page']
+            self.go_to_url(url)
+
+        except:
+            print('\n+++++++++++++++++++++++++++')
+            print('No other pages left. Well done!')
+            print('+++++++++++++++++++++++++++\n')
+            self.close()
+    
+    def label_page(self, label):
+        updated_info = self.database.update(
+            {'vpm_page_classification': label}, 
+            Query().vpm_page==self.current_data['vpm_page'])
+        updated_ids = updated_info[0]
+        
+        # Reduce the number of pages left by one 
+        #   and show this information in the status bar
+        self.data_to_classify_len -= len(updated_ids)
+        urls_len_lbl = f'{self.data_to_classify_len} URLs left to classify'
+        self.status.showMessage(urls_len_lbl)
+
+        self.open_next_page()
+        self.update()
+
+if __name__ == "__main__":
+    app = QApplication(sys.argv)
+    app.setApplicationName('VPM pages handmade classifier')
+    window = MainWindow()
+    app.exec_()
+    
diff --git a/data/test_data.jsonl b/data/test_data.jsonl
new file mode 100644
index 0000000..e69de29
diff --git a/post-classify.py b/post-classify.py
new file mode 100644
index 0000000..3ff25b1
--- /dev/null
+++ b/post-classify.py
@@ -0,0 +1,671 @@
+#!/usr/bin/env python
+
+"""
+Tool to post-process the output of the classification phase.
+
+For each page identified as a "true VPM page" the script checks 
+ which patents, among the possible ones for that specific page,
+ are actually present in the page and which not. 
+It returns, for each entry of the database, a JSON line of the type
+ {'vpm_page': 'URL_OF_THE_PAGE', 
+  'is_true_vpm_page': true/false, 
+  'is_patent_in_page': [(PATENT_NUMBER: true/false), 
+                        (PATENT_NUMBER: true/false)]}
+
+Author: Carlo Bottai
+Copyright (c) 2021 - TU/e and EPFL
+License: See the LICENSE file.
+Date: 2021-05-08
+
+"""
+
+
+## LIBRARIES ##
+
+import numpy as np
+import pandas as pd
+import os
+import pathlib
+from io import BytesIO
+from hashlib import md5
+from nltk.tokenize import sent_tokenize
+import re
+from urllib.parse import urlparse
+from os.path import splitext
+from datetime import datetime
+import json
+
+from flata import Flata, JSONStorage
+
+from bs4 import BeautifulSoup as beautiful_soup
+import html5lib
+
+import pdfminer.high_level as pdfminer
+from pdfminer.pdfparser import PDFParser
+from pdfminer.pdfdocument import PDFDocument
+import pdf2image
+import pytesseract
+
+from striprtf.striprtf import rtf_to_text
+
+import asyncio
+import aiofiles
+
+from aiohttp import ClientSession, BadContentDispositionHeader
+
+from tqdm.asyncio import tqdm as aio_tqdm
+import warnings
+
+from iris_utils.parse_args import parse_io
+
+
+## TYPE HINTS ##
+
+from typing import List, Tuple, Set, TypedDict
+from pathlib import PosixPath
+from flata.database import Table as fa_Table
+class LineDict(TypedDict):
+    db_id: int
+    vpm_page: str
+    patent_ids: int
+
+
+## WARNINGS SUPPRESSION ##
+
+# Suppress PDF text extraction not allowed warning 
+#  and any other warning from the `pdfminer` module
+warnings.filterwarnings('ignore', module = 'pdfminer')
+
+# Suppress BadContentDispositionHeader warning 
+#   from the `aiohttp` module
+warnings.simplefilter('ignore', BadContentDispositionHeader)
+
+
+#################
+#   SETTINGS    #
+#################
+
+# Name of the filder where the local copy of the pages have been saved
+files_folder = 'files'
+
+# User agent
+# Useful for both the type of documents (HTML and others) considered in the script
+USER_AGENT = ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) '
+              'Gecko/2009021910 Firefox/3.0.7')
+
+# Length of the texts in the corpus of each document
+# Number of characters, extracted from the full text of each document, 
+#   before and after the keywords defined afterward
+CONTEXT_SPAN = 500
+
+# Choose the name of the log file where eventual errors will be reported
+# The file will have a name like pre_classify_%Y_%m_%d_%H_%M.log
+LOG_FILE = 'post_classify'
+
+## ASYNCIO SETTINGS ##
+
+# Run no more than 25 tasks at a time
+NUM_CONCURRENT_TASKS = 25
+SEMAPHORE = asyncio.Semaphore(NUM_CONCURRENT_TASKS)
+
+################
+#    REGEX     #
+################
+
+# Punctuation characters that will be removed
+PUNCT_RE = re.compile(r'[\n\f\r\t\x0A\x0C\x0D\x09\s]+')
+
+# Regular expressions used to convert a URL into a file name
+HTTPWWW_RE = re.compile(r'^(.*:\/\/)?(www\.)?', flags = re.IGNORECASE)
+NOALPHA_RE = re.compile(r'\W')
+
+# Regular expression used to remove the sentences about cookie or privacy policy
+# Useful to remove useless portions of the headers and footers
+COOKIE_RE = re.compile(r'(cookie)|(privacy policy)', flags=re.IGNORECASE)
+
+# Regular expression
+PATNUM_RE = re.compile(r'\d{1,2},?\d{3},?\d{3}')
+
+
+#################
+#   FUNCTIONS   #
+#################
+
+def read_input(f_in: str) -> fa_Table:
+    """
+    Read the input file
+    """
+    
+    DB = Flata(f_in, storage=JSONStorage)
+    database = DB.table('iris_vpm_pages_classifier')
+    
+    return database
+
+def generate_file_name(url: str) -> str:
+    """ 
+    Given the URL provided, return a standardized file name
+    """
+
+    # Remove 'https://', 'ftp://' and similar things, and remove 'www'
+    file_name = HTTPWWW_RE.sub('', url)
+    
+    # Replace any non-alphanumeric chars with '_'
+    file_name = NOALPHA_RE.sub('_', file_name)
+
+    # If the generated filename is longer than 250 bytes
+    #   (i.e., about the lenght-limit for an ext4 file system),
+    #    then use as name an hash hexdigest string
+    if len(file_name.encode()) >= 250:
+        file_name = md5(file_name.encode()).hexdigest()
+
+    return file_name
+
+def which_content_type_exists(file_path: str) -> str:
+    """
+    Returns the content type based on which file exists locally
+    Returns None if no file exists for the document of interest
+    """
+    for content_type in ['html', 'txt', 'rtf', 'pdf', 'other']:
+        # NB PDF must always be the last one, since also HTML contents 
+        #   have a PDF version (and potentialy other types 
+        #   will do the same in the future)
+        type_path = file_path.replace('.pdf', f'.{content_type}')
+        if os.path.exists(type_path):
+            return content_type.upper()
+    return None
+
+async def get_content_type(url: str, file_path: str, requests_session: ClientSession) -> str:
+    """ 
+    Determine the type of content returned by a GET request to the URL provided
+    The possible answers are: 
+      - HTML, PDF, TXT (documents handled by the script)
+      - OTHER (documents unhandled by the script)
+      - FAILED (generic error while connecting with the remote source)
+    """
+
+    local_content_type = which_content_type_exists(
+        file_path = file_path)
+    if local_content_type:
+        return local_content_type
+
+    # If the URL names a file that ends in *.pdf (*.txt) its a PDF (TXT)
+    url_path = urlparse(url).path
+    url_root, url_ext = splitext(url_path.lower())
+    if url_ext.endswith('pdf'):
+        return 'PDF'
+    if url_ext.endswith('txt'):
+        return 'TXT'
+    
+    try:
+        # Require the HEAD for the URL
+        response = await requests_session.request(
+            method = 'HEAD', 
+            url = url, 
+            headers = {'User-Agent': USER_AGENT}, 
+            allow_redirects = True, 
+            ssl = False)
+        
+        # assert response.status in [200, 403]
+        
+        # Take the content-type from the HEAD
+        remote_content_type = response.content_type
+
+    except:
+        return 'FAILED'
+
+    # Is the content-type a PDF?
+    if remote_content_type and remote_content_type.startswith('application/pdf'):
+        return 'PDF'
+    
+    # Is the content-type an RTF?
+    if remote_content_type and remote_content_type.startswith('application/rtf'):
+        return 'RTF'
+    
+    # Is the content-type a plain text?
+    if remote_content_type and remote_content_type.startswith('text/plain'):
+        return 'TXT'
+
+    # Is the content-type a stream of data?
+    if remote_content_type and remote_content_type.startswith('application/octet-stream'):
+        try:
+            # Take the content-disposition from the HEAD
+            content_disposition = response.content_disposition
+            # Take the filename field from the content-disposition
+            content_disposition = re.search(r'filename = "(.*)"', content_disposition)
+        except:
+            return 'FAILED'
+        # Is the file a PDF?
+        if content_disposition and \
+           any([splitext(group.lower())[1].endswith('pdf') \
+                for group in content_disposition.groups()]):
+            return 'PDF'
+        if content_disposition and \
+           any([splitext(group.lower())[1].endswith('rtf') \
+                for group in content_disposition.groups()]):
+            return 'RTF'
+        # Is the file a TXT?
+        if content_disposition and \
+           any([splitext(group.lower())[1].endswith('txt') \
+                for group in content_disposition.groups()]):
+            return 'TXT'
+        # Is the file something else?
+        else:
+            return 'OTHER'
+    
+    # Is the content-type an HTML?
+    if remote_content_type and remote_content_type.startswith('text/html'):
+        return 'HTML'
+    
+    # Is the content-type something else?
+    return 'OTHER'
+
+async def get_content_from_url(url: str, requests_session: ClientSession) -> bytes:
+    """ 
+    Download the document from the URL provided, store it locally and return it 
+    """
+
+    try:
+        # Download the content from the URL
+        response = await requests_session.request(
+            method = 'GET', 
+            url = url, 
+            headers = {'User-Agent': USER_AGENT}, 
+            allow_redirects = True, 
+            ssl = False)
+        assert response.status == 200
+    except:
+        text_bytes = b''
+    else:
+        # Read the downloaded content
+        try:
+            text_bytes = await response.read()
+        except:
+            text_bytes = b''
+
+    # Return the content
+    return text_bytes
+
+async def get_text_from_txt(url: str, file_path: str, requests_session: ClientSession) -> str:
+    """
+    Extract the text from the TXT file provided (or downloaded from the URL provided)
+    """
+    
+    if os.path.exists(file_path):
+        with open(file_path, 'rb') as f_in:
+            text_bytes = f_in.read()
+    else:
+        text_bytes = await get_content_from_url(
+            url = url, 
+            requests_session = requests_session)
+    text = text_bytes.decode(errors='ignore')
+    
+    return text
+
+async def get_text_from_pdf(url: str, file_path: str, requests_session: ClientSession, use_ocr = False) -> str:
+    """ 
+    Extract the text from the PDF file provided (or downloaded from the URL provided)
+    """
+
+    if os.path.exists(file_path):
+        with open(file_path, 'rb') as f_in:
+            text_bytes = f_in.read()
+    else:
+        text_bytes = await get_content_from_url(
+            url = url, 
+            requests_session = requests_session)
+
+    if use_ocr:
+        try:
+            pdf_parser = PDFParser(BytesIO(text_bytes))
+            pdf = PDFDocument(pdf_parser)
+            n_pages = pdf.catalog['Pages'].resolve()['Count']
+            # Analyze the document only if it is shorter than 30 pages
+            if n_pages<30:
+                pages = pdf2image.convert_from_bytes(text_bytes, grayscale = True)
+            
+            text = ''
+            for page in pages:
+                page = pytesseract.image_to_string(page, lang = 'eng')
+                text += page
+        except:
+            text = ''
+    else:
+        try:
+            text = pdfminer.extract_text(BytesIO(text_bytes))
+        except:
+            text = ''
+    
+    return text
+
+async def get_text_from_rtf(url: str, file_path: str, requests_session: ClientSession) -> str:
+    """
+    Extract the text from the RTF file provided (or downloaded from the URL provided)
+    """
+
+    if os.path.exists(file_path):
+        with open(file_path, 'rb') as f_in:
+            text_bytes = f_in.read()
+    else:
+        text_bytes = await get_content_from_url(
+            url = url, 
+            requests_session = requests_session)
+    
+    try:
+        text = text_bytes.decode(errors='ignore')
+        text = rtf_to_text(text)
+    except:
+        text = ''
+
+    return text
+
+async def get_text_from_html(url: str, file_path: str, requests_session: ClientSession) -> Tuple[str, List[str]]:
+    """ 
+    Extract the text from the body of the document, 
+      using the local version of the website (or try to create one)
+    """
+
+    html_path = file_path.replace('.pdf', '.html')
+    
+    # Use the, previously stored, local HTML version of the URL, if exists
+    try:
+        if html_path and os.path.exists(html_path):
+            with open(html_path, 'r') as f_in:
+                html_soup = beautiful_soup(f_in, 'html5lib')
+        else:
+            text_bytes = await get_content_from_url(
+                url = url,
+                requests_session = requests_session)
+            html_soup = beautiful_soup(text_bytes, 'html5lib')
+    except:
+        text = ''
+    else:
+        try:
+            # Remove <script> and <style> tags
+            script_style = [el.extract() \
+                for tag in ['script','style'] \
+                    for el in html_soup.find_all(tag)]
+            
+            # Select the <body> of the page
+            body = html_soup.find('body')
+            
+            # # Remove hidden elements
+            # hidden_elements = [el.extract() \
+            #     for el in body.find_all(
+            #         style=re.compile(f'display:\s*none'))]
+            
+            # Extract text from the <body>
+            text = body.get_text(separator=' ')
+
+        except:
+            text = ''
+    
+    return text
+
+async def get_text(vpm_page: str, vpm_page_path: str, requests_session: ClientSession) -> Tuple[str, str]:
+    """ 
+    Extract text from the URL provided
+    Return both the document content and the document type
+    Note: When relevant, use also the OCR
+    """
+
+    content_type = await get_content_type(
+        url = vpm_page, 
+        file_path = vpm_page_path,
+        requests_session = requests_session)
+    
+    if content_type == 'PDF':
+        text = await get_text_from_pdf(
+            url = vpm_page, 
+            file_path = vpm_page_path, 
+            requests_session = requests_session)
+    elif content_type == 'TXT':
+        file_path = vpm_page_path.replace('.pdf', '.txt')
+        text = await get_text_from_txt(
+            url = vpm_page, 
+            file_path = file_path,
+            requests_session = requests_session)
+    elif content_type == 'RTF':
+        file_path = vpm_page_path.replace('.pdf', '.rtf')
+        text = await get_text_from_rtf(
+            url = vpm_page,
+            file_path = file_path, 
+            requests_session = requests_session)
+    elif content_type == 'HTML':
+        text = await get_text_from_html(
+            url = vpm_page, 
+            file_path = vpm_page_path, 
+            requests_session = requests_session)
+        # Remove sentences if 'cookie' or 'privacy policy' are named
+        text = re.sub(' \.+', '', 
+            ' '.join([sentence \
+                for sentence in sent_tokenize(re.sub('(\n\s?){2,}', '. ', text)) \
+                    if not (COOKIE_RE.search(sentence) and len(sentence)<CONTEXT_SPAN)]))
+    else:
+        text = ''
+
+    # Remove new lines, tabs and spaces
+    text = PUNCT_RE.sub(' ', text).strip()
+    
+    return text, content_type
+
+async def search_patent(vpm_page_text: str, patent_ids: Set[int]) -> Set[int]:
+    """
+    Search any substring in the document text that matches 
+      a pattern compatible with a patent number
+    Return the intersection between the matched substrings 
+      and the relevant patent numbers
+    """
+
+    patents_in_text = PATNUM_RE.findall(vpm_page_text)
+    patents_in_text = [patent.replace(',', '') \
+        for patent in patents_in_text  if not patent.startswith('0')]
+    patents_in_text = [int(patent) for patent in patents_in_text]
+    identified_patents = patent_ids.intersection(patents_in_text)
+
+    return identified_patents
+
+async def search_patent_and_write_results(db_id: int, vpm_page: str, patent_ids: Set[str], files_folder: str, requests_session: ClientSession, out_path: PosixPath) -> bool:
+    """
+    Search which of the relevant patent numbers appears in the text of the document 
+      and write the results in the database
+    """
+
+    # Skip the line if there is no VPM page to check
+    if not vpm_page:
+        out_data = {'id': db_id, 'is_patent_in_page': None}
+        
+    else:
+        vpm_page_path = re.sub(r'/$', '', vpm_page)
+        vpm_page_path = f'{files_folder}/{generate_file_name(vpm_page_path)}.pdf'
+        vpm_page_text, content_type = await get_text(
+            vpm_page = vpm_page, 
+            vpm_page_path = vpm_page_path, 
+            requests_session = requests_session)
+
+        identified_patents = await search_patent(
+            vpm_page_text = vpm_page_text,
+            patent_ids = patent_ids)
+        
+        # For PDF files, if no patent id has been identified, 
+        #   use OCR on the document and try to extract the patent ids again
+        if content_type == 'PDF' and len(identified_patents) == 0:
+            vpm_page_text = await get_text_from_pdf(
+                url = vpm_page, 
+                file_path = vpm_page_path, 
+                requests_session = requests_session, 
+                use_ocr = True)
+            identified_patents = await search_patent(
+                vpm_page_text = vpm_page_text,
+                patent_ids = patent_ids)
+
+        is_patent_in_page = []
+        for parent_id in patent_ids:
+            is_in_page = parent_id in identified_patents
+            is_patent_in_page.append((parent_id, is_in_page))
+
+        out_data = {'id': db_id, 'is_patent_in_page': is_patent_in_page}
+    
+    out_data = json.dumps(out_data)
+    
+    async with aiofiles.open(out_path, 'a') as f_out:
+        await f_out.write(out_data+'\n')
+
+    return True
+
+
+#################
+#  PARALLELIZE  #
+#################
+
+async def run_task(line: LineDict, files_folder: str, requests_session: ClientSession, out_path: PosixPath) -> None:
+    """ 
+    Run the parser and writer asynchronously 
+      (limiting to 25 the maximum number of tasks run at a time)
+    """
+
+    async with SEMAPHORE:
+        results = await search_patent_and_write_results(
+            db_id = line['id'],
+            vpm_page = line['vpm_page'], 
+            patent_ids = set(line['patent_id']), 
+            files_folder = files_folder, 
+            requests_session = requests_session, 
+            out_path = out_path)
+        return results
+
+
+################
+#     MAIN     #
+################
+
+async def main() -> None:
+    """
+    Read the input files, create the tasks and run them asynchronously
+    """
+
+    # Read the information from the terminal
+    args = parse_io()
+
+    # Present working directory
+    pwd = pathlib.Path(__file__).parent
+
+    in_path = pwd.joinpath(args.input)
+    out_path = pwd.joinpath(args.output)
+
+    # Read the input database
+    database = read_input(in_path)
+
+    # Extract only the useful information from the database
+    data = [{k:v for k,v in line.items() if k in ['id', 'vpm_page', 'patent_id']} \
+        for line in database]
+    
+    # If the output data file already exists, 
+    #   remove the lines already analyzed from the data to check
+    if os.path.exists(out_path):
+        with open(out_path, 'r') as f_out:
+            bak_ids = [json.loads(line) \
+                for line in f_out.read().splitlines() if line!='']
+            bak_ids = [int(line['id']) for line in bak_ids]
+        
+        data = [line for line in database if line['id'] not in bak_ids]
+
+    # Remove from the data the lines with no VPM page to analyze 
+    #   and write them in the output data file
+    data_none = [{'id': line['id'], 'is_patent_in_page': None} \
+        for line in data if not line['vpm_page']]
+    data = [line for line in data if line['vpm_page']]
+    if len(data_none):
+        with open(out_path, 'a') as f_out:
+            data_none = '\n'.join([json.dumps(line) for line in data_none])+'\n'
+            data_none = re.sub('\n+', '\n', data_none)
+            f_out.write(data_none)
+
+    async with ClientSession() as requests_session:
+        
+        # Create a task for each line
+        tasks = [asyncio.ensure_future(
+            run_task(
+                line = line,
+                files_folder = files_folder,
+                requests_session = requests_session,
+                out_path = out_path)) \
+                    for line in data]
+        
+        # Run the tasks
+        for task in aio_tqdm(asyncio.as_completed(tasks), total = len(tasks)):
+            task_result = await task
+    
+    # TODO This version of the script uses a temporary file (the one passed 
+    #      as output file) to store the data and, at the end, asks the user 
+    #      if she wants to update the input database accordingly with the 
+    #      results stored in the temporary file. This is because the Flata 
+    #      package has some issues with asynchronous writing. This is not the 
+    #      best solution and it must be improved in a future version 
+    #      of the script
+    print('All the entries have been analyzed')
+    update_database = input('Do you want to update the input database with the output of this script? [y/N] ')
+    if update_database == 'y':
+        print('Please, wait until all the results have been written in the input database')
+        print('It can take some minute to complete this task')
+
+        bak_path = args.input
+        bak_path = bak_path.replace('.json', '')
+        bak_path = datetime.now().strftime(f'{bak_path}_%Y_%m_%d_%H_%M.json')
+        bak_path = pwd.joinpath(bak_path)
+        DB_bak = Flata(bak_path, storage=JSONStorage)
+        database_bak = DB_bak.table('iris_vpm_pages_classifier')
+        
+        data_db = [line for line in database.all()]
+        data_db_ids = [line['id'] for line in data_db]
+        
+        with open(out_path, 'r') as f_tmp:
+            data_toadd = [json.loads(line) \
+                for line in f_tmp.read().splitlines() if line!='']
+        data_toadd_ids = [line['id'] for line in data_toadd]
+
+        added_data = database_bak.insert_multiple(data_db)
+        print(f'A backup copy of the input database has been stored in {bak_path}')
+
+        data_db = pd.DataFrame(
+            data_db, index = data_db_ids) \
+            .drop(columns = 'id') \
+            .replace({None: np.nan})
+        
+        data_toadd = pd.DataFrame(
+            data_toadd, index = data_toadd_ids) \
+            .drop(columns = 'id') \
+            .replace({None: np.nan})
+        
+        data_db.update(data_toadd)
+
+        id_min = data_db.index.min()
+        id_max = data_db.index.max()
+        id_all = set(range(id_min, id_max))
+        try:
+            assert len(id_all.difference(data_db.index)) == 0
+        except:
+            print('Something went wrong while updating the database')
+            print(f'The output of the analysis in anyhow store in {out_path}')
+            print('Error: the following expected IDs are missing')
+            print(id_all.difference(data_db.index))
+        else:
+            data_db = data_db \
+                .replace({np.nan: None}) \
+                .to_dict('records')
+            database.purge()
+            added_data = database.insert_multiple(data_db)
+
+
+if __name__ == '__main__':
+    loop = asyncio.get_event_loop()
+    try:
+        loop.run_until_complete(main())
+    except Exception as e:
+        log_file = datetime.now().strftime(f'{LOG_FILE}_%Y_%m_%d_%H_%M.log')
+        print('Something went wrong.')
+        print(f'Please, have a look at {log_file} for further details.')
+        with open(log_file, 'w') as f_log:
+            f_log.write(repr(e)+'\n')
+    finally:
+        loop.run_until_complete(loop.shutdown_asyncgens())
+        loop.close()
diff --git a/pre-classify.py b/pre-classify.py
new file mode 100644
index 0000000..cf0c1af
--- /dev/null
+++ b/pre-classify.py
@@ -0,0 +1,1303 @@
+#!/usr/bin/env python
+
+"""
+Tool to pre-process the output of the scraping phase.
+
+It automatically classifies those cases that are obviously 
+(or better, very likely) either true positives or 
+false positives into their respective categories.
+
+Author: Carlo Bottai
+Copyright (c) 2020 - TU/e and EPFL
+License: See the LICENSE file.
+Date: 2020-10-23
+
+"""
+
+
+## LIBRARIES ##
+
+import numpy as np
+import os
+import json
+import pathlib
+from io import BytesIO
+from PIL import Image
+import warnings
+from datetime import datetime
+from hashlib import md5
+import networkx 
+from networkx.algorithms.components.connected import connected_components
+from tqdm.asyncio import tqdm
+from iris_utils.parse_args import parse_io
+
+import re
+from nltk.tokenize import sent_tokenize
+from urllib.parse import urlparse, urljoin
+from os.path import splitext
+
+import asyncio
+from asyncio.tasks import sleep
+import aiofiles
+
+from aiohttp import ClientSession, BadContentDispositionHeader
+
+from playwright.async_api import async_playwright
+from bs4 import BeautifulSoup as beautiful_soup
+import html5lib
+
+import pdfminer.high_level as pdfminer
+from pdfminer.pdfparser import PDFParser
+from pdfminer.pdfdocument import PDFDocument
+import pdf2image
+import pytesseract
+
+from striprtf.striprtf import rtf_to_text
+
+
+## TYPE HINTS ##
+
+from typing import List, TypedDict, Tuple, Generator, Pattern
+from pathlib import PosixPath
+from playwright.async_api._context_manager import \
+    PlaywrightContextManager as ContextManager
+from playwright.async_api._generated import \
+    ChromiumBrowserContext as BrowserContext
+class LineDict(TypedDict):
+    urls: List[str]
+    patent_ids_regex: Pattern
+class DocInfoDict(TypedDict):
+    url: str
+    content_type: str
+    headers: List[str]
+    corpus: List[str]
+    footers: List[str]
+    text_in_imgs: List[str]
+    patent_ids_regex: Pattern
+    exclude_websites: List[str]
+
+
+## WARNINGS SUPPRESSION ##
+
+# Suppress PDF text extraction not allowed warning 
+#  and any other warning from the `pdfminer` module
+warnings.filterwarnings('ignore', module = 'pdfminer')
+
+# Suppress BadContentDispositionHeader warning 
+#   from the `aiohttp` module
+warnings.simplefilter('ignore', BadContentDispositionHeader)
+
+# Suppress DecompressionBombWarning warning 
+#   from the `PIL` module
+warnings.simplefilter('ignore', Image.DecompressionBombWarning)
+
+
+#################
+#   SETTINGS    #
+#################
+
+# Set DEBUGGING to True and choose the SAMPLE_SIZE you like 
+#   to analyze only a portion of the data provided
+DEBUGGING = False
+SAMPLE_SIZE = 100
+
+# Choose the name of the log file where eventual errors will be reported
+# The file will have a name like pre_classify_%Y_%m_%d_%H_%M.log
+LOG_FILE = 'pre_classify'
+
+# User agent
+# Useful for both the type of documents (HTML and others) considered in the script
+USER_AGENT = ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) '
+              'Gecko/2009021910 Firefox/3.0.7')
+
+# Label of the URL in the output file
+URL_LABEL = 'VPM_PAGE'
+PATENT_ID_LABEL = 'PATENT_ID'
+
+# Length of the texts in the corpus of each document
+# Number of characters, extracted from the full text of each document, 
+#   before and after the keywords defined afterward
+# Notes:
+#  - The header of each document is defined as the first portion of its 
+#      full text.
+#  - The footer of each document is defined as the last portion of its 
+#      full text.
+#  - The length of the each header or footer is equal to one fourth 
+#      of the context span
+CONTEXT_SPAN = 500
+
+## ASYNCIO SETTINGS ##
+
+# Run no more than 50 tasks at a time
+NUM_CONCURRENT_TASKS = 50
+SEMAPHORE = asyncio.Semaphore(NUM_CONCURRENT_TASKS)
+
+## PLAYWRIGHT SETTINGS ##
+
+# Use the browser in headless mode
+# NB If you set it to False, the script cannot save a PDF version of the visited websites
+HEADLESS = True
+
+# Set the navigation timeout to 5min for the browser (text extraction from the HTML documents)
+TIMEOUT = 300000
+
+# Add a margin of 0.75in to the PDF version of the HTML documents
+PDF_MARGIN = '0.75in'
+
+# Use mobile version of the HTML documents?
+USE_MOBILE = False
+
+# Viewport and other settings of the browser context
+CONTEXT = {
+    'viewport': {
+        'width': 1768,
+        'height': 992},
+    'device_scale_factor': 1,
+    'is_mobile': USE_MOBILE,
+    'has_touch': USE_MOBILE,
+    'user_agent': USER_AGENT}
+
+# Chromium browser configuration parameters
+use_mobile = lambda mobile: "true" if mobile else "false"
+BROWSER_CONFIG = [
+  f'--use-mobile-user-agent={use_mobile(USE_MOBILE)}',
+  f'--user-agent={USER_AGENT}',
+  '--ignore-certificate-errors',
+  '--no-sandbox',
+  '--disable-setuid-sandbox',
+  '--disable-dev-shm-usage',
+  '--disable-accelerated-2d-canvas',
+  '--disable-gpu',
+  '--window-position=0,0',
+  '--start-fullscreen',
+  '--hide-scrollbars']
+
+
+################
+#    REGEX     #
+################
+
+# Punctuation characters that will be removed
+PUNCT_RE = re.compile(r'[\n\f\r\t\x0A\x0C\x0D\x09\s]+')
+
+# The corpus is created taking the 250 characters 
+#   around each of the following regular expressions
+CORPUS_RES = [
+    re.compile(regex, flags = re.IGNORECASE) for regex in [
+        r'patent',
+        r'marking',
+        r'(^|\W)U\.?S\.?C\.?(\W|$)',
+        r'United States Code',
+        r'America Invents Act',
+        r'(^|\W)A\.?I\.?A\.?(\W|$)',
+        r'Securities and Exchange Commission', 
+        r'Form ([A-Z]|[0-9]{1,2})-([A-Z]|[0-9]{1,2})',
+        r'(^|\W)404(\W|$)']]
+
+# Regular expression used to remove the sentences about cookie or privacy policy
+# Useful to remove useless portions of the headers and footers
+COOKIE_RE = re.compile(r'(cookie)|(privacy policy)', flags=re.IGNORECASE)
+
+# Regular expression used by the URL and LAW rules
+PATMARK_RE = re.compile(r'(virtual|patents?).?marking', flags = re.IGNORECASE)
+
+# Regular expressions used by the LAW rule
+LAW_RES = [
+    re.compile(regex, flags = re.IGNORECASE) for regex in [
+        r'America Invents Act', 
+        r'35 U\.?S\.?C\.?(\ssect)?\W*287',
+        r'287\(a\) of Title 35 of the United States Code']]
+
+# Regular expressions used by the TEXT rule
+TEXTP_RE = re.compile(r'patent', flags = re.IGNORECASE)
+TEXT1_RE = re.compile(
+    r'(^|\s)(protect|cover)[a-z]* (by|under|our)', 
+    flags = re.IGNORECASE)
+TEXT2_RE = re.compile(r'(^|\s)manufactur[a-z]* under', flags = re.IGNORECASE)
+TEXT3_RE = re.compile(r'patent\W*protected', flags = re.IGNORECASE)
+TEXT4_RE = re.compile(r'our patented', flags = re.IGNORECASE)
+TEXT5_RE = re.compile(
+    r'(^|\s)(((emplo|appl(y|ie))[a-z]*|uses?) (the|a|our|a number|several|some)|using our) .{0,50}patent', 
+    flags = re.IGNORECASE)
+
+# Symbols used by the TRADEMARK rule
+R_TM_RES = [
+    re.compile(regex, flags = re.IGNORECASE) for regex in [
+        r'®', 
+        r'\(r\)', 
+        r'™']]
+
+# Regular expressions used by the COPYRIGHT rule
+C_RES = [
+    re.compile(regex, flags = re.IGNORECASE) for regex in [
+        r'©\W?[0-9]{4}', 
+        r'[0-9]{4}\W?©', 
+        r'\(c\)\W?[0-9]{4}', 
+        r'[0-9]{4}\W?\(c\)', 
+        r'Copyright\W?[0-9]{4}',
+        r'[0-9]{4}\W?Copyright']]
+
+# Regular expressions used by the SEC rule
+# The second rule is a generalization of cases like "Form 10-Q", "Form 10-K", "Form S-1"
+SEC_RE = re.compile(
+    (r'United\W*States\W*Securities\W*and\W*Exchange\W*Commission\W*Washington'
+     r'\W*D\W?C\W*[0-9]+\W*(Form|Schedule)\W*([A-Z]|[0-9]{1,2})-?([A-Z]|[0-9]{1,2})'),
+    flags = re.IGNORECASE)
+
+# Text used by the PATENT rule
+# It matches "United States Patent" with title cases only and 
+#   without the "s" after "Patent" to exclude cases of lists 
+#   of patents preceded by the headline "United States Patents:"
+PAT_RES = [
+    re.compile(regex) for regex in [
+        r'United States Patent([^s].*|$)',
+        r'United States.*Patent Application Publication',
+        (r'The Director of the United States.*'
+         r'Patent and Trademark Office.*'
+         r'Has received an application for a patent')]]
+
+NF_RES = [
+    re.compile(regex) for regex in [
+        r'Error\W?404',
+        r'404\W*((File\W)(or\WDirectory\W)?)?(Page\W)?Not\WFound']]
+
+# Regular expressions used to convert a URL into a file name
+HTTPWWW_RE = re.compile(r'^(.*:\/\/)?(www\.)?', flags = re.IGNORECASE)
+NOALPHA_RE = re.compile(r'\W')
+
+
+################
+#    RULES     #
+################
+
+async def patent_rule(doc_info: DocInfoDict) -> bool:
+    """ 
+    Is the document the PDF of the patent itself? 
+    If you can find the sentence `United States Patent` in the first 250 characters 
+      of the document, this is considered the PDF of the patent itself
+    Notes:
+      - Works only for the PDF documents
+      - The number of characters included (250 by default) depends on the ones chosen in the 
+          get_corpus() function
+    """
+    
+    headers = doc_info['headers']
+    content_type = doc_info['content_type']
+
+    if content_type == 'PDF':
+        for header in headers:
+            if any([PAT_RE.search(header) for PAT_RE in PAT_RES]):
+                return True
+    return False
+
+async def excluded_rule(doc_info: DocInfoDict) -> bool:
+    """ 
+    Is the URL provided included in the list of websites considered irrelevant?
+    Note: In theory, you have already excluded these pages during the scraping process.
+      However, you can add other domains to the list afterward to help the classifier.
+      Moreover, it is possible that the scraper did find a page also in one 
+      of the excluded domains since the bar.foo.com is anyhow explored by Google 
+      if you do not exclude foo.com completely
+    """
+
+    url = doc_info['url']
+    exclude_websites = doc_info['exclude_websites']
+
+    url = re.sub(r'.*://(www.)?', '', url)
+    for exclude_website in exclude_websites:
+        exclude_website_len = len(exclude_website.split('.'))
+        url_check = '.'.join(url.split('/')[0].split('.')[-exclude_website_len:])
+        if url_check.startswith(exclude_website):
+            return True
+    return False
+
+async def url_rule(doc_info: DocInfoDict) -> bool:
+    """ 
+    Does the URL provided contain the expression `virtual (patent) marking`?
+    """
+    
+    url = doc_info['url']
+    
+    return PATMARK_RE.search(url) is not None
+
+async def law_rule(doc_info: DocInfoDict) -> bool:
+    """
+    Is the America Invents Act named in the corpus of the document provided?
+    NB What you can find here depends on the rules used to extract the corpus from the full text
+    """
+    
+    corpus = doc_info['corpus']
+    
+    for text in corpus:
+        if any([LAW_RE.search(text) for LAW_RE in LAW_RES+[PATMARK_RE]]):
+            return True
+    return False
+
+async def text_rule(doc_info: DocInfoDict) -> bool:
+    """ 
+    Is any of the relevant regexs present in the corpus of the document provided?
+    NB What you can find here depends on the rules used to extract the corpus from the full text
+    """
+    
+    corpus = doc_info['corpus']
+    PATNUM_RE = doc_info['patent_ids_regex']
+
+    for text in corpus:
+        if ((TEXT1_RE.search(text) or TEXT2_RE.search(text)) and \
+            (TEXTP_RE.search(text) or PATNUM_RE.search(text))) or \
+           TEXT3_RE.search(text) or \
+           TEXT4_RE.search(text) or \
+           TEXT5_RE.search(text):
+            return True
+    return False
+
+async def sec_rule(doc_info: DocInfoDict) -> bool:
+    """ 
+    Is the document provided a document required by the Securities and Exchange Commission?
+    """
+    
+    corpus = doc_info['corpus']
+    
+    for text in corpus: 
+        if SEC_RE.search(text):
+            return True
+    return False
+
+async def trademark_rule(doc_info: DocInfoDict) -> bool:
+    """ 
+    Does the corpus extracted by the document provided contain any trademark symbol
+      close to one of the relevant patent numbers?
+    """
+    
+    corpus = doc_info['corpus']
+    PATNUM_RE = doc_info['patent_ids_regex']
+
+    for text in corpus:
+        if any([R_TM_RE.search(text) for R_TM_RE in R_TM_RES]) and PATNUM_RE.search(text):
+            return True
+    return False
+
+async def copyright_rule(doc_info: DocInfoDict) -> bool:
+    """ 
+    Does the last element of the corpus extracted by the document provided contain any copyright symbol
+      close to one of the relevant patent numbers?
+    """
+    
+    footers = doc_info['footers']
+    PATNUM_RE = doc_info['patent_ids_regex']
+
+    for footer in footers:
+        if any([C_RE.search(footer) for C_RE in C_RES]) and PATNUM_RE.search(footer):
+            return True
+    return False
+
+async def img_rule(doc_info: DocInfoDict) -> bool:
+    """ 
+    Does any of the relevant patent numbers appear in one of the images included in the document provided?
+    Note: Works only for the HTML documents (provided that the OCR is not used on them)
+    """
+    
+    text_in_imgs = doc_info['text_in_imgs']
+    PATNUM_RE = doc_info['patent_ids_regex']
+
+    for text in text_in_imgs:
+        if PATNUM_RE.search(text):
+            return True
+    return False
+
+async def nocorpus_rule(doc_info: DocInfoDict) -> bool:
+    """
+    Does none of the relevant keywords has been found in the corpus, 
+      but there is something in the header (meaning that the document has been read by the script)?
+    Note: It is useful to combine it with the IMG rule 
+      (since the patent number can be in one of the images of the document)
+    """
+    
+    headers = doc_info['headers']
+    corpus = doc_info['corpus']
+    
+    not_empty_header = len(headers)>0 and not all([len(header)==0 for header in headers])
+    empty_corpus = len(corpus)==0 or all([len(text)==0 for text in corpus])
+    
+    return not_empty_header and empty_corpus
+
+async def notfound_rule(doc_info: DocInfoDict) -> bool:
+    """
+    Was any `Error 404` statement found in the document?
+    """
+
+    # TODO This rule works very poorly because, in most cases, 
+    #   it's written '404' without 'Error' in front or 'Not Found'
+    #   See e.g.
+    #   - https://www.eppendorf.com/uploads/media/Eppendorf_2021_AK01003931_-_FR_V1.pdf
+    #   Another issue to solve is that, if there was a PDF file, 
+    #     it will be expected a PDF file, but the document actually 
+    #     found is an HTML page
+    #   See e.g.
+    #   - https://www.thermal-dynamics.com/thermal-dynamics/shared/content/requestliterature/de/upload/0-5120de-cutmaster-a40_a60-ac.pdf
+
+    corpus = doc_info['corpus']
+
+    for text in corpus:
+        if any([NF_RE.search(text) for NF_RE in NF_RES]):
+            return True
+    return False
+
+async def patnuminurl_rule(doc_info: DocInfoDict) -> bool:
+    """
+    Does any of the relevant patent numbers appear in the URL of the document provided?
+    """
+
+    url = doc_info['url']
+    PATNUM_RE = doc_info['patent_ids_regex']
+
+    if PATNUM_RE.search(url):
+        return True
+    return False
+
+# Switch-case-like object
+# Note: if you define a new rule, you can add it here 
+#   and it will be applied automatically during the analysis
+RULE_FS = {
+    'EXCLUDED': excluded_rule, 
+    'PATENT': patent_rule, 
+    'SEC': sec_rule,
+    'URL': url_rule, 
+    'LAW': law_rule, 
+    'TEXT': text_rule,
+    'TRADEMARK': trademark_rule,
+    'COPYRIGHT': copyright_rule,
+    'IMG': img_rule,
+    'NOCORPUS': nocorpus_rule,
+    'NOTFOUND': notfound_rule,
+    'PATNUMINURL': patnuminurl_rule}
+
+
+#################
+#   FUNCTIONS   #
+#################
+
+def generate_file_name(url: str, files_folder: str) -> str:
+    """ 
+    Given the URL provided, return a standardized file name
+    """
+
+    # Remove 'https://', 'ftp://' and similar things, and remove 'www'
+    file_name = HTTPWWW_RE.sub('', url)
+    
+    # Replace any non-alphanumeric chars with '_'
+    file_name = NOALPHA_RE.sub('_', file_name)
+
+    # If the generated filename is longer than 250 bytes
+    #   (i.e., about the lenght-limit for an ext4 file system),
+    #    then use as name an hash hexdigest string and write the 
+    #   corresponding URL in a file named <FILENAME>.url
+    if len(file_name.encode()) >= 250:
+        file_name = md5(file_name.encode()).hexdigest()
+        with open(f'{files_folder}/{file_name}.url', 'w') as f_out:
+            f_out.write(url)
+
+    return file_name
+
+def generate_patent_ids_regex(patent_ids: List[int]) -> Pattern:
+    """ 
+    Given that we know that at least one of the parent numbers in this list is present in the URL of interest, 
+      convert the list of patent numbers into a convenient regex
+    It takes into accout of patterns like 'US7194162'; '8,926,731'; '10 088 283'
+    """
+
+    patnum_re = [str(patent_id) for patent_id in patent_ids]
+    patnum_re = [f'{patent_id[:-6]}.?{patent_id[-6:-3]}.?{patent_id[-3:]}' for patent_id in patnum_re]
+    patnum_re = '|'.join(patnum_re)
+    patnum_re = r'(^|[^0-9])({})([^0-9]|$)'.format(patnum_re)
+    patnum_re = re.compile(patnum_re, re.IGNORECASE)
+
+    return patnum_re
+
+def deduplicate_data(data):
+    """
+    Look into the data and merge lines with at least one VPM page in common
+    The new line will be composed of the union of the pages of the two (or more) joined lines and
+      by the union of the patents relevant for one or the other line
+    """
+
+    def list_to_edges(vpm_pages):
+        vpm_pages_iterator = iter(vpm_pages)
+        next_page = next(vpm_pages_iterator)
+
+        for current_page in vpm_pages_iterator:
+            yield next_page, current_page
+            next_page = current_page
+    
+    G = networkx.Graph()
+    for line in data:
+        G.add_nodes_from(line['vpm_pages'])
+        edges_vpm_pages = list_to_edges(line['vpm_pages'])
+        G.add_edges_from(edges_vpm_pages)
+    
+    data_connected_components = [vpm_page for vpm_page in connected_components(G)]
+    
+    data_dedup = []
+    for line_connected_components in data_connected_components:
+        patent_connected_components = []
+        for line in data:
+            vpm_pages = line['vpm_pages']
+            if set(line_connected_components).intersection(vpm_pages):
+                patent_connected_components.extend(line['patent_id'])
+        line_dedup = {
+            'vpm_pages': list(line_connected_components), 
+            'patent_id': list(set(patent_connected_components))}
+        data_dedup.append(line_dedup)
+    
+    return data_dedup
+
+def which_content_type_exists(file_path: str) -> str:
+    """
+    Returns the content type based on which file exists locally
+    Returns None if no file exists for the document of interest
+    """
+    for content_type in ['html', 'txt', 'rtf', 'other', 'pdf']:
+        # NB PDF must always be the last one, since also HTML contents 
+        #   have a PDF version (and potentialy other types 
+        #   will do the same in the future)
+        type_path = file_path.replace('.pdf', f'.{content_type}')
+        if os.path.exists(type_path):
+            return content_type.upper()
+    return None
+
+async def get_content_type(url: str, file_path: str, requests_session: ClientSession) -> str:
+    """ 
+    Determine the type of content returned by a GET request to the URL provided
+    The possible answers are: 
+      - HTML, PDF, TXT (documents handled by the script)
+      - OTHER (documents unhandled by the script)
+      - FAILED (generic error while connecting with the remote source)
+    """
+
+    local_content_type = which_content_type_exists(
+        file_path = file_path)
+    if local_content_type:
+        return local_content_type
+
+    # If the URL names a file that ends in *.pdf (*.txt) its a PDF (TXT)
+    url_path = urlparse(url).path
+    url_root, url_ext = splitext(url_path.lower())
+    if url_ext.endswith('pdf'):
+        return 'PDF'
+    if url_ext.endswith('txt'):
+        return 'TXT'
+    
+    try:
+        # Require the HEAD for the URL
+        response = await requests_session.request(
+            method = 'HEAD', 
+            url = url, 
+            headers = {'User-Agent': USER_AGENT}, 
+            allow_redirects = True, 
+            ssl = False)
+        
+        # assert response.status in [200, 403]
+        
+        # Take the content-type from the HEAD
+        remote_content_type = response.content_type
+
+    except:
+        return 'FAILED'
+
+    # Is the content-type a PDF?
+    if remote_content_type and remote_content_type.startswith('application/pdf'):
+        return 'PDF'
+    
+    # Is the content-type an RTF?
+    if remote_content_type and remote_content_type.startswith('application/rtf'):
+        return 'RTF'
+    
+    # Is the content-type a plain text?
+    if remote_content_type and remote_content_type.startswith('text/plain'):
+        return 'TXT'
+
+    # Is the content-type a stream of data?
+    if remote_content_type and remote_content_type.startswith('application/octet-stream'):
+        try:
+            # Take the content-disposition from the HEAD
+            content_disposition = response.content_disposition
+            # Take the filename field from the content-disposition
+            content_disposition = re.search(r'filename = "(.*)"', content_disposition)
+        except:
+            return 'FAILED'
+        # Is the file a PDF?
+        if content_disposition and \
+           any([splitext(group.lower())[1].endswith('pdf') \
+                for group in content_disposition.groups()]):
+            return 'PDF'
+        if content_disposition and \
+           any([splitext(group.lower())[1].endswith('rtf') \
+                for group in content_disposition.groups()]):
+            return 'RTF'
+        # Is the file a TXT?
+        if content_disposition and \
+           any([splitext(group.lower())[1].endswith('txt') \
+                for group in content_disposition.groups()]):
+            return 'TXT'
+        # Is the file something else?
+        else:
+            return 'OTHER'
+    
+    # Is the content-type an HTML?
+    if remote_content_type and remote_content_type.startswith('text/html'):
+        return 'HTML'
+    
+    # Is the content-type something else?
+    return 'OTHER'
+
+async def get_content_from_url(url: str, file_path: str, requests_session: ClientSession) -> bytes:
+    """ 
+    Download the document from the URL provided, store it locally and return it 
+    """
+
+    # TODO
+    # Check whether the request has been redirected and report also the actual URL
+    # The EXCLUDED rule should look at both the URLs
+    # See e.g.
+    # - https://www.parc.com/patent/obtaining-spectral-information-from-moving-objects/
+
+    try:
+        # Download the content from the URL
+        response = await requests_session.request(
+            method = 'GET', 
+            url = url, 
+            headers = {'User-Agent': USER_AGENT}, 
+            allow_redirects = True, 
+            ssl = False)
+        assert response.status == 200
+    except:
+        text_bytes = b''
+    else:
+        # Read the downloaded content
+        try:
+            text_bytes = await response.read()
+        except:
+            text_bytes = b''
+        # Write the downloaded content locally
+        else:
+            with open(file_path, 'wb') as f_out:
+                f_out.write(text_bytes)
+
+    # Return the content
+    return text_bytes
+
+async def get_text_from_txt(url: str, file_path: str, requests_session: ClientSession) -> str:
+    """
+    Extract the text from the TXT file provided (or downloaded from the URL provided)
+    """
+    
+    if os.path.exists(file_path):
+        with open(file_path, 'rb') as f_in:
+            text_bytes = f_in.read()
+    else:
+        text_bytes = await get_content_from_url(
+            url = url, 
+            file_path = file_path,
+            requests_session = requests_session)
+    text = text_bytes.decode(errors='ignore')
+    
+    return text
+
+async def get_text_from_pdf(url: str, file_path: str, requests_session: ClientSession, use_ocr: bool = False) -> str:
+    """ 
+    Extract the text from the PDF file provided (or downloaded from the URL provided)
+    Notes:
+     - If use_ocr is True, transform the PDF in a PNG file and extract the text 
+         from this last (looking into the first and last 5 pages only)
+     - If the file already exists locally, use it instead of downloading it 
+         from the remote source
+    """
+
+    if os.path.exists(file_path):
+        with open(file_path, 'rb') as f_in:
+            text_bytes = f_in.read()
+    else:
+        text_bytes = await get_content_from_url(
+            url = url, 
+            file_path = file_path, 
+            requests_session = requests_session)
+
+    if use_ocr:
+        try:
+            pdf_parser = PDFParser(BytesIO(text_bytes))
+            pdf = PDFDocument(pdf_parser)
+            n_pages = pdf.catalog['Pages'].resolve()['Count']
+            if n_pages>10:
+                pages = \
+                    pdf2image.convert_from_bytes(text_bytes, grayscale = True, last_page = 5) + \
+                    pdf2image.convert_from_bytes(text_bytes, grayscale = True, first_page = n_pages-4)
+            else:
+                pages = pdf2image.convert_from_bytes(text_bytes, grayscale = True)
+            # Look into the first and the last 10 pages
+            # This because it's likely that the useful information 
+            #  will be in the front page (or in the first few page)
+            #  or at the end of the document. Otherwise there are 
+            #  very long documents that take an enormous amount of time 
+            #  to be analyzed
+            
+            text = ''
+            for page in pages:
+                page = pytesseract.image_to_string(page, lang = 'eng')
+                text += page
+        except:
+            text = ''
+    else:
+        try:
+            text = pdfminer.extract_text(BytesIO(text_bytes))
+        except:
+            text = ''
+    
+    return text
+
+async def get_text_from_imgs(url: str, url_imgs: List[str], requests_session: ClientSession) -> List[str]:
+    """
+    Use OCR to extract text components from any image present in a website 
+      (provided that the image is store within the same domain of the main web site)
+    """
+    
+    # TODO Consider using fuzzy matching with the NUMPAT_RE to extract also cases with one (or few) errors
+    #      The 'regex' package replaces 're' and has also fuzzy matching
+
+    url_base = urlparse(url).netloc
+    
+    is_abs_url = lambda url_img: \
+        url_img.startswith('http://') or \
+        url_img.startswith('www.') or \
+        url_img.startswith('data:img/') or \
+        url_img.find(url_base) >= 0
+    url_imgs = [url_img if is_abs_url(url_img) else urljoin(url, url_img) \
+        for url_img in url_imgs]
+    url_imgs = [url_img \
+        for url_img in url_imgs if url_img.find(url_base) >= 0]
+    
+    text_in_imgs = []
+    for url_img in url_imgs:
+        try:
+            img_response = await requests_session.request(
+                method = 'GET', 
+                url = url_img, 
+                headers = {'User-Agent': USER_AGENT}, 
+                allow_redirects = True, 
+                ssl = False)
+            assert img_response.status == 200
+            img_bytes = await img_response.read()
+        except:
+            img_bytes = b''
+        try:
+            img = Image.open(BytesIO(img_bytes))
+            if np.prod(img.size) > Image.MAX_IMAGE_PIXELS:
+                continue
+            img = img.convert('RGBA')
+            text_in_img = pytesseract.image_to_string(img, lang = 'eng')
+        except Exception as e:
+            text_in_img = ''
+        text_in_imgs.append(text_in_img)
+    
+    # Remove empty strings
+    text_in_imgs = list(filter(None, text_in_imgs))
+    
+    return text_in_imgs
+
+async def get_html_content(url: str, file_path: str, browser_context: BrowserContext) -> str:
+    """
+    Visit the URL and save a PDF and HTML version of the website locally
+    """
+
+    # TODO
+    # Check whether the request has been redirected and report also the actual URL
+    # The EXCLUDED rule should look at both the URLs
+
+    html_path = file_path.replace('.pdf', '.html')
+
+    if not os.path.exists(html_path):
+        page = await browser_context.new_page()
+        
+        try:
+            await page.goto(url, wait_until='networkidle')
+        except:
+            return None
+        
+        if HEADLESS and not os.path.exists(file_path):
+            try:
+                # TODO 
+                # There are cases in which the text is in gray and is unreadable by the OCR (assuming it has to be used on the file)
+                # See this example: https://www.surfacetechnology.com/Products-and-Services/Products/Addplate-and-NiPLATE-730-Chemical-Solutions.aspx
+                await page.pdf( 
+                    path = file_path, 
+                    margin = {side: PDF_MARGIN \
+                        for side in ['top', 'right', 'bottom', 'left']})
+            except:
+                pass
+        
+        try:
+            html = await page.content()
+            await page.close()
+        except:
+            return None
+        else:
+            with open(html_path, 'w') as f_out:
+                f_out.write(html)
+    
+    return html_path
+
+async def get_text_from_html(url: str, file_path: str, requests_session: ClientSession, browser_context: BrowserContext, use_ocr: bool = False) -> Tuple[str, List[str]]:
+    """ 
+    Extract the text from the body of the document, 
+      using the local version of the website (or try to create one)
+    """
+
+    html_path = await get_html_content(
+        url = url, 
+        file_path = file_path, 
+        browser_context = browser_context)
+    
+    # Use the, previously stored, local HTML version of the URL, if exists
+    if html_path and os.path.exists(html_path):
+        if use_ocr:
+            if os.path.exists(file_path):
+                text = await get_text_from_pdf(
+                    url = url, 
+                    file_path = file_path, 
+                    requests_session = requests_session, 
+                    use_ocr = True)
+                text_in_imgs = []
+                return text, text_in_imgs
+            else:
+                return '', []
+        
+        with open(html_path, 'r') as f_in:
+            try:
+                html_soup = beautiful_soup(f_in, 'html5lib')
+            except:
+                text = ''
+                url_imgs = []
+            else:
+                try:
+                    # Remove <script> and <style> tags
+                    script_style = [el.extract() \
+                        for tag in ['script', 'style'] \
+                            for el in html_soup.find_all(tag)]
+                    # Extract text from the <body> of the page
+                    body = html_soup.find('body')
+                    # Remove hidden elements
+                    hidden_elements = [el.extract() \
+                        for el in body.find_all(
+                            style=re.compile(f'display:\s*none'))]
+                    # TODO There are cases like
+                    #      https://www.flir.com/patentnotices/instruments/
+                    #      https://www.eppendorf.com/fileadmin/General/trademarks-patents/patents_us.htm
+                    #      in which part of the relevant content is contained 
+                    #      in an iFrame tag. It would be desirable to extract
+                    #      it and store it in the local static version of
+                    #      the document
+                    text = body.get_text(separator = ' ')
+                except:
+                    text = ''
+                try:
+                    url_imgs = [img.get('src') \
+                        for img in html_soup.find_all('img') if img.get('src')]
+                except:
+                    url_imgs = []
+        text_in_imgs = await get_text_from_imgs(
+            url = url, 
+            url_imgs = url_imgs, 
+            requests_session = requests_session)
+
+        return text, text_in_imgs
+    return '', []
+
+async def get_text_from_rtf(url: str, file_path: str, requests_session: ClientSession) -> str:
+    """
+    Extract the text from the RTF file provided (or downloaded from the URL provided)
+    """
+
+    if os.path.exists(file_path):
+        with open(file_path, 'rb') as f_in:
+            text_bytes = f_in.read()
+    else:
+        text_bytes = await get_content_from_url(
+            url = url, 
+            file_path = file_path, 
+            requests_session = requests_session)
+    
+    try:
+        text = rtf_to_text(text_bytes.decode(errors='ignore'))
+    except:
+        text = ''
+
+    return text
+
+async def get_text(url: str, file_path: str, requests_session: ClientSession, browser_context: BrowserContext, use_ocr: bool = None) -> Tuple[List[str], List[str], str]:
+    """ 
+    Extract text from the URL provided
+    Return both the document content and the document type
+    Note: When relevant, use also the OCR
+    """
+
+    content_type = await get_content_type(
+        url = url, 
+        file_path = file_path, 
+        requests_session = requests_session)
+    
+    if content_type == 'PDF':
+        text = await get_text_from_pdf(
+            url = url, 
+            file_path = file_path, 
+            requests_session = requests_session, 
+            use_ocr = False)
+        text_ocr = await get_text_from_pdf(
+            url = url, 
+            file_path = file_path, 
+            requests_session = requests_session, 
+            use_ocr = True)
+        texts = [text, text_ocr]
+        text_in_imgs = []
+        cookie_found = False
+    elif content_type == 'TXT':
+        file_path = file_path.replace('.pdf', '.txt')
+        text = await get_text_from_txt(
+            url = url, 
+            file_path = file_path, 
+            requests_session = requests_session)
+        texts = [text]
+        text_in_imgs = []
+        cookie_found = False
+    elif content_type == 'RTF':
+        file_path = file_path.replace('.pdf', '.rtf')
+        text = await get_text_from_rtf(
+            url = url,
+            file_path = file_path,
+            requests_session = requests_session)
+        texts = [text]
+        text_in_imgs = []
+        cookie_found = False
+    elif content_type == 'HTML':
+        text, text_in_imgs = await get_text_from_html(
+            url = url, 
+            file_path = file_path, 
+            requests_session = requests_session, 
+            browser_context = browser_context, 
+            use_ocr = use_ocr)
+        # Remove sentences if 'cookie' or 'privacy policy' are named
+        cookie_found = COOKIE_RE.search(text) is not None
+        text = re.sub(' \.+', '', 
+            ' '.join([sentence \
+                for sentence in sent_tokenize(re.sub('\n{2,}', '. ', text)) \
+                    if not (COOKIE_RE.search(sentence) and len(sentence)<CONTEXT_SPAN)]))
+            # FIXME The '\n{2,}' regex is problematic in this case
+            #       https://www.optoknowledge.com/mstir.html
+            #       It would be better to use '(\n\s?){2,}'
+            #       However, it would be better to test it on a larger 
+            #       sample to see if there are unforeseen disadvantages 
+            #       with this new regex
+        texts = [text]
+    elif content_type == 'OTHER':
+        file_path = file_path.replace('.pdf', '.other')
+        if not os.path.exists(file_path):
+            with open(file_path, 'w') as f_out:
+                f_out.write('')
+        texts = []
+        text_in_imgs = []
+        cookie_found = False
+    else:
+        texts = []
+        text_in_imgs = []
+        cookie_found = False
+
+    # Remove new lines, tabs and spaces
+    texts = [PUNCT_RE.sub(' ', text).strip() for text in texts]
+    text_in_imgs = [PUNCT_RE.sub(' ', text_in_img).strip() \
+        for text_in_img in text_in_imgs]
+    # Remove empty strings
+    text_in_imgs = list(filter(None, text_in_imgs))
+
+    return texts, text_in_imgs, content_type, cookie_found
+
+def merge_matches(matches: List[Tuple[int]]) -> Generator[Tuple[int], None, None]:
+    """ 
+    If two or more of the intervals provided overlap, join them in one
+    """
+
+    matches.sort()
+    
+    start, end = matches[0]
+    for start_next, end_next in matches[1:]:
+        if (start_next-1) <= end:
+            end = max(end, end_next)
+        else:
+            yield start, end
+            start, end = start_next, end_next
+    yield start, end
+
+async def get_corpus(url: str, patent_ids_regex: Pattern, file_path: str, requests_session: ClientSession, browser_context: BrowserContext, use_ocr: bool = None) -> Tuple[List[str], List[str], List[str], List[str], str]:
+    """ 
+    Using several regular expressions, extract the relevant corpus 
+      (i.e., 500 characters around the keyword considered) 
+      from the text extracted from the document
+    """
+    
+    # TODO Consider using fuzzy matching with the NUMPAT_RE to extract also cases with one (or few) errors
+    #      The 'regex' package replaces 're' and has also fuzzy matching
+
+    PATNUM_RE = patent_ids_regex
+    texts, text_in_imgs, content_type, cookie_found = await get_text(
+        url = url, 
+        file_path = file_path, 
+        requests_session = requests_session, 
+        browser_context = browser_context, 
+        use_ocr = use_ocr)
+    
+    headers = set()
+    corpus = set()
+    footers = set()
+    for text in texts:
+        corpus_matches = [match.span() \
+            for CORPUS_RE in CORPUS_RES+[PATNUM_RE] \
+                for match in CORPUS_RE.finditer(text)]
+        
+        corpus_matches_len = len(corpus_matches)
+
+        if corpus_matches_len == 0 and \
+           not any(PATNUM_RE.search(text_in_img) \
+               for text_in_img in text_in_imgs) and \
+           content_type == 'HTML' and \
+           not use_ocr:
+            return await get_corpus(
+                url = url, 
+                patent_ids_regex = patent_ids_regex, 
+                file_path = file_path, 
+                requests_session = requests_session, 
+                browser_context = browser_context, 
+                use_ocr = True)
+        
+        if corpus_matches_len>1:
+            # NB merge_matches returns a generator, not a list
+            corpus_matches = list(merge_matches(corpus_matches))
+
+        for start, end in corpus_matches:
+            start = round(max(0,start-(CONTEXT_SPAN/2)))
+            end = round(end+(CONTEXT_SPAN/2))
+            matched = text[start:end].strip()
+            corpus.add(matched)
+        
+        header = text[:round(CONTEXT_SPAN/4)].strip()
+        headers.add(header)
+
+        footer = text[-round(CONTEXT_SPAN/4):].strip()
+        footers.add(footer)
+
+    headers = list(filter(None, headers))
+    corpus = list(filter(None, corpus))
+    footers = list(filter(None, footers))
+    
+    return headers, corpus, footers, text_in_imgs, content_type, cookie_found
+
+async def test_rule(rule_label: str, doc_info: DocInfoDict) -> bool:
+    """ 
+    Apply a rule using the relevant information from the document provided
+    """
+
+    try:
+        rule = RULE_FS[rule_label]
+        is_rule = await rule(doc_info)
+    except:
+        is_rule = 'EXCEPTION'
+    finally:
+        return is_rule
+
+async def parse_urls_and_write_results(line_idx: int, urls: List[str], patent_ids: List[int], exclude_websites: List[str], files_folder: str, requests_session: ClientSession, playwright: ContextManager, out_path: PosixPath) -> bool:
+    """ 
+    Get the document from the URL, extract the relevant texts from the document, 
+      apply the classification rules and write the results in the output file
+    """
+
+    patent_ids_regex = generate_patent_ids_regex(patent_ids)
+
+    browser = await playwright.chromium.launch(args = BROWSER_CONFIG, headless = HEADLESS)
+        
+    browser_context = await browser.new_context(**CONTEXT)
+    browser_context.set_default_timeout(TIMEOUT)
+
+    urls_len = len(urls)
+    for idx, url in enumerate(tqdm(urls, position = 1, desc = f'Line {line_idx}', leave = False)):
+        url_out = dict()
+        url_out[URL_LABEL] = url
+        url_out[PATENT_ID_LABEL] = patent_ids
+        
+        url = re.sub(r'/$', '', url)
+
+        file_name = generate_file_name(url, files_folder)
+        file_path = f'{files_folder}/{file_name}.pdf'
+
+        headers, corpus, footers, text_in_imgs, content_type, cookie_found = await get_corpus(
+            url, patent_ids_regex, file_path, requests_session, browser_context)
+
+        url_out['CONTENT_TYPE'] = content_type
+        url_out['HEADERS'] = [''] if len(headers)==0 else headers
+        url_out['CORPUS'] = [''] if len(corpus)==0 else corpus
+        url_out['FOOTERS'] = [''] if len(footers)==0 else footers
+        url_out['TEXT_IN_IMGS'] = [''] if len(text_in_imgs)==0 else text_in_imgs
+        url_out['COOKIE_FOUND'] = cookie_found
+        
+        doc_info = {
+            'url': url,
+            'content_type': content_type,
+            'headers': headers,
+            'corpus': corpus,
+            'footers': footers,
+            'text_in_imgs': text_in_imgs,
+            'patent_ids_regex': patent_ids_regex,
+            'exclude_websites': exclude_websites}
+        rules = dict()
+        for rule_label in RULE_FS.keys():
+            rule_outcome = await test_rule(rule_label, doc_info)
+            rules[rule_label] = rule_outcome
+
+        url_out.update(rules)
+        url_out = json.dumps(url_out)
+
+        async with aiofiles.open(out_path, 'a') as f_out:
+            await f_out.write(url_out+'\n')
+        
+        # If there are more than 5 URLs in one of the lines and it is not the last URL of the line
+        extra_wait = urls_len >= 5 and idx+1 != urls_len
+        if extra_wait:
+            # Wait 10' more, so that the script doesn't overload the requested website
+            await sleep(10)
+    
+    await browser_context.close()
+    await browser.close()
+
+    return True
+
+
+#################
+#  PARALLELIZE  #
+#################
+
+async def run_task(line_idx: int, line: LineDict, exclude_websites: List[str], files_folder: str, requests_session: ClientSession, playwright: ContextManager, out_path: PosixPath) -> None:
+    """ 
+    Run the parser and writer asynchronously 
+      (limiting to 50 the maximum number of tasks run at a time)
+    """
+
+    async with SEMAPHORE:
+        results = await parse_urls_and_write_results(
+            line_idx = line_idx,
+            urls = line['urls'], 
+            patent_ids = line['patent_ids'], 
+            exclude_websites = exclude_websites, 
+            files_folder = files_folder, 
+            requests_session = requests_session, 
+            playwright = playwright,
+            out_path = out_path)
+        if not results:
+            return None
+
+
+################
+#     MAIN     #
+################
+
+async def main() -> None:
+    """
+    Read the input files, create the tasks and run them asynchronously
+    """
+    
+    # Read the information from the terminal
+    args = parse_io()
+
+    # Present working directory
+    pwd = pathlib.Path(__file__).parent
+
+    with open(pwd.joinpath(args.input_list[0]), 'r') as f_in:
+        data = [json.loads(line) for line in f_in.read().splitlines()]
+    
+    # Take only the relevant information from the database
+    data = [{key: value \
+        for key, value in line.items() \
+            if key in ['vpm_pages', 'patent_id']} \
+                for line in data]
+
+    # Remove the lines with no VPM page from the database
+    data = [line for line in data if len(list(filter(None, line['vpm_pages'])))>0]
+
+    # Group lines that share a page (or more) together
+    data = deduplicate_data(data)
+    
+    # Read the list of websites that have been excluded
+    with open(pwd.joinpath(args.input_list[1]), 'r') as f_in:
+        exclude_websites = f_in.read().splitlines()
+    
+    # If you set DEBUGGING as True in the settings, use only a random sample of the data
+    if DEBUGGING:
+        import random
+        data = random.sample(data, SAMPLE_SIZE)
+        with open('sample.jsonl', 'w') as f_sample:
+            f_sample.write('\n'.join([json.dumps(line) for line in data]))
+
+    out_path = pwd.joinpath(args.output)
+    
+    # Remove the VPM pages already classified
+    # Note: the new data are appended to the output file, provided it exists 
+    #   (please, don't use a file you want to preserve or overwrite)
+    already_classified_urls = []
+    if os.path.exists(out_path):
+        with open(out_path, 'r') as f_bak:
+            already_classified_urls = [json.loads(line)[URL_LABEL] \
+                for line in f_bak.read().splitlines()]
+    data_left = []
+    for line in data:
+        patent_ids = line['patent_id']
+        vpm_pages = line['vpm_pages']
+        vpm_pages = [vpm_page for vpm_page in vpm_pages \
+            if vpm_page not in already_classified_urls]
+        if len(vpm_pages)>0:
+            line_left = {'urls': vpm_pages, 'patent_ids': patent_ids}
+            data_left.append(line_left)
+
+    # Create the folder in which the PDF, HTML and TXT files downloaded (or generated) by the script are stored
+    # These files will also be used (preferably) by the script if they exist 
+    #   (i.e., in that case, the document will not be checked online again)
+    files_folder = 'files'
+    if not os.path.exists(files_folder):
+        os.mkdir(files_folder)
+
+    async with ClientSession() as requests_session, \
+               async_playwright() as playwright:
+        
+        # Create a task for each line
+        tasks = [asyncio.ensure_future(
+            run_task(
+                line_idx = line_idx,
+                line = line, 
+                exclude_websites = exclude_websites, 
+                files_folder = files_folder, 
+                requests_session = requests_session, 
+                playwright = playwright,
+                out_path = out_path)) \
+                    for line_idx, line in enumerate(data_left)]
+        
+        # Run the tasks
+        for task in tqdm(asyncio.as_completed(tasks), total = len(tasks), desc = 'Main'):
+            await task
+
+
+if __name__ == '__main__':
+    loop = asyncio.get_event_loop()
+    try:
+        loop.run_until_complete(main())
+    except Exception as e:
+        log_file = datetime.now().strftime(f'{LOG_FILE}_%Y_%m_%d_%H_%M.log')
+        print('Something went wrong.')
+        print(f'Please, have a look at {log_file} for further details.')
+        with open(log_file, 'w') as f_log:
+            f_log.write(repr(e)+'\n')
+    finally:
+        loop.run_until_complete(loop.shutdown_asyncgens())
+        loop.close()
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..71604f7
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,20 @@
+pathos==0.2.6
+numpy==1.20.2
+pandas==1.2.3
+pdf2image==1.14.0
+PyQt5==5.15.1
+PyQtWebEngine==5.15.1
+pytesseract==0.3.6
+qtawesome==1.0.1
+beautifulsoup4==4.9.3
+playwright==1.10.0
+aiofiles==0.6.0
+aiohttp==3.7.4.post0
+html5lib==1.1
+tqdm==4.59.0
+pdfminer.six==20201018
+networkx==2.5
+flata==5.0.0
+striprtf==0.0.12
+requests==2.25.1
+playwright-stealth==1.0.5
diff --git a/write-database.py b/write-database.py
new file mode 100644
index 0000000..453c9ea
--- /dev/null
+++ b/write-database.py
@@ -0,0 +1,213 @@
+#!/usr/bin/env python
+
+"""
+Script to populate a database useful for the classifier
+It merges and arranges information from the input and output 
+ of the pre-classifier in a convenient format for the classifier
+
+Author: Carlo Bottai
+Copyright (c) 2021 - TU/e and EPFL
+License: See the LICENSE file.
+Date: 2021-04-10
+
+"""
+
+
+## LIBRARIES ##
+
+import numpy as np
+import pandas as pd
+import json
+from flata import Flata, JSONStorage
+from iris_utils.parse_args import parse_io
+
+
+def main():
+    args = parse_io()
+
+    # Read the input of the pre-classifier
+    data = pd.read_json(
+        args.input_list[0], 
+        lines=True) \
+        .explode('vpm_pages') \
+        .rename(columns={
+            'vpm_pages': 'vpm_page'})
+    data['vpm_page'] = data.vpm_page \
+        .apply(lambda row: np.nan if row is None else row)
+    data['scraped_websites'] = data.scraped_websites \
+        .apply(lambda row: [np.nan] if row[0] is None else row)
+    
+    # Read the output of the pre-classifier
+    automatic_classification = pd.read_json(
+        args.input_list[1], 
+        lines=True) \
+        .rename(columns={
+            'VPM_PAGE': 'vpm_page'})
+
+    automatic_classification_list = []
+
+    # Classify as false cases those pages that have been identified as 
+    #   patents, SEC documents, or as part of irrelevant domains 
+    query = ' | '.join([f'{rule}==True' \
+        for rule in ['EXCLUDED', 'PATENT', 'SEC']])
+    false_vpm_pages = automatic_classification.query(query)[['vpm_page']]
+    if len(false_vpm_pages):
+        false_vpm_pages.loc[:,'vpm_page_automatic_classification'] = \
+            'Automatic classification | False patent-product link'
+        automatic_classification_list.append(false_vpm_pages)
+    
+    # Classify as true cases those pages for which 
+    #   (at least) one of the "strong" rules is True
+    query_pos = ' | '.join([f'{rule}==True' \
+        for rule in ['URL', 'LAW', 'TRADEMARK', 'TEXT']])
+    query_neg = ' & '.join([f'{rule}==False' \
+        for rule in ['EXCLUDED', 'PATENT', 'SEC']])
+    query = f'({query_pos}) & {query_neg}'
+    true_vpm_pages = automatic_classification.query(query)[['vpm_page']]
+    if len(true_vpm_pages):
+        true_vpm_pages.loc[:,'vpm_page_automatic_classification'] = \
+            'Automatic classification | True patent-product link'
+        automatic_classification_list.append(true_vpm_pages)
+    
+    # Classify as COPYRIGHT those pages for which 
+    #   the COPYRIGHT rule is True 
+    # This label will be used by the classifier
+    query = ' & '.join([f'{rule}==False' \
+        for rule in [
+            'EXCLUDED', 
+            'PATENT', 
+            'SEC', 
+            'URL', 
+            'LAW', 
+            'TRADEMARK', 
+            'TEXT']])
+    copyright_vpm_pages = automatic_classification \
+        .query(f'COPYRIGHT==True & {query}')[['vpm_page']]
+    if len(copyright_vpm_pages):
+        copyright_vpm_pages.loc[:,'vpm_page_automatic_classification'] = \
+            'Automatic classification | COPYRIGHT'
+        automatic_classification_list.append(copyright_vpm_pages)
+    
+    # Classify as NOCORPUS+IMG those pages for which 
+    #   both the NOCORPUS and IMG rules are True
+    # This label will be used by the classifier
+    query = ' & '.join([f'{rule}==False' \
+        for rule in [
+            'EXCLUDED', 
+            'PATENT', 
+            'SEC', 
+            'URL', 
+            'LAW', 
+            'TRADEMARK', 
+            'TEXT', 
+            'COPYRIGHT']])
+    nocorpus_imgs_vpm_pages = automatic_classification \
+        .query(f'NOCORPUS==True & IMG==True & {query}')[['vpm_page']]
+    if len(nocorpus_imgs_vpm_pages):
+        nocorpus_imgs_vpm_pages.loc[:,'vpm_page_automatic_classification'] = \
+            'Automatic classification | NOCORPUS+IMG'
+        automatic_classification_list.append(nocorpus_imgs_vpm_pages)
+    
+    # Classify as NOCORPUS+PATNUMINURL those pages for which 
+    #   both the NOCORPUS and PATNUMINURL rules are True
+    # This label will be used by the classifier
+    query = ' & '.join([f'{rule}==False' \
+        for rule in [
+            'EXCLUDED', 
+            'PATENT', 
+            'SEC', 
+            'URL', 
+            'LAW', 
+            'TRADEMARK', 
+            'TEXT', 
+            'COPYRIGHT',
+            'IMG']])
+    nocorpus_patnuminurl_vpm_pages = automatic_classification \
+        .query(f'NOCORPUS==True & PATNUMINURL==True & {query}')[['vpm_page']]
+    if len(nocorpus_patnuminurl_vpm_pages):
+        nocorpus_patnuminurl_vpm_pages.loc[:,'vpm_page_automatic_classification'] = \
+            'Automatic classification | NOCORPUS+PATNUMINURL'
+        automatic_classification_list.append(nocorpus_patnuminurl_vpm_pages)
+    
+    # Classify as NOCORPUS those pages for which 
+    #   the NOCORPUS rule is True and the IMG rule is False
+    # This label will be used by the classifier
+    query = ' & '.join([f'{rule}==False' \
+        for rule in [
+            'EXCLUDED', 
+            'PATENT', 
+            'SEC', 
+            'URL', 
+            'LAW', 
+            'TRADEMARK', 
+            'TEXT', 
+            'COPYRIGHT',
+            'IMG',
+            'PATNUMINURL']])
+    nocorpus_vpm_pages = automatic_classification \
+        .query(f'NOCORPUS==True & {query}')[['vpm_page']]
+    if len(nocorpus_vpm_pages):
+        nocorpus_vpm_pages.loc[:,'vpm_page_automatic_classification'] = \
+            'Automatic classification | NOCORPUS'
+        automatic_classification_list.append(nocorpus_vpm_pages)
+    
+    # Put together the labels just created
+    automatic_classification = pd.concat(automatic_classification_list)
+    
+    # Create another column with the definitive classification
+    # For the pages that have been labeled as surely true (or false), report this same label
+    # For the unsure pages, report None
+    automatic_classification['vpm_page_classification'] = np.nan
+    subset = automatic_classification.vpm_page_automatic_classification \
+        .str.split(' \| ').str[1].isin([
+            'True patent-product link', 
+            'False patent-product link'])
+    automatic_classification.loc[subset, 'vpm_page_classification'] = \
+        automatic_classification.loc[subset, 'vpm_page_automatic_classification']
+    
+    # Merge the labels just created 
+    #   with the other information pieces from the main database
+    data_out = pd.merge(
+        data, automatic_classification, 
+        on='vpm_page', how='left')
+
+    subset = data_out.vpm_page.isna()
+    
+    # Label as Unclassified the rows still without a classification
+    data_out.loc[
+        ~subset, 'vpm_page_automatic_classification'] = \
+            data_out \
+                .loc[~subset, 'vpm_page_automatic_classification'] \
+                .fillna('Automatic classification | Unclassified')
+
+    # Reshuffle randomly the data
+    data_out = data_out.sample(frac=1, random_state=410)
+    
+    out_name = args.output.split('.')
+    out_base = '.'.join(out_name[:-1])
+    out_ext = out_name[-1]
+    
+    frac_size = 1/args.n_output
+    
+    for idx in range(args.n_output):
+        if idx<args.n_output-1:
+            data_out_frac = data_out.sample(frac=frac_size)
+            data_out = data_out.drop(data_out_frac.index)
+        else:
+            data_out_frac = data_out
+        
+        # Transform the DataFrame into a list of dictionaries
+        data_out_frac = json.loads(data_out_frac.to_json(orient='records'))
+        
+        # Create the output database
+        DB = Flata(f'{out_base}_{idx}.{out_ext}', storage=JSONStorage)
+
+        # Create an output table into the database
+        database = DB.table('iris_vpm_pages_classifier')
+
+        # Populate the database with the useful data
+        added_data = database.insert_multiple(data_out_frac)
+
+
+if __name__ == '__main__':
+    main()