Comprehensive Guide to Machine Learning Software for Text Screening

This project aims to provide a comparison of different software tools for machine learning-assisted text screening. The comparison is designed to help researchers and practitioners make informed decisions when selecting a suitable tool for their needs. We compare various aspects, such as software functionality, data handling capabilities, and machine learning properties.

Inclusion Criteria

The initial selection process for selecting the software tools is documented on the Open Science Framework and meet the following inclusion criteria:

Implements a Researcher-in-the-Loop (RITL)-based active learning cycle for systematically screening large volumes of textual data.
Achieves a Technology Readiness Level of at least TRL7.
Offers user-friendly software that is accessible to a broad audience.
Provides a generic application that is not limited to specific content, fields, or types of interventions.

Overview

The table below offers a concise overview of various software tools designed for systematically screening large volumes of textual data using machine learning techniques. Each software is evaluated based on the following properties:

Is there a website?
Is the software open-source (provide a 🔗 to the source code)?
Is the software peer-reviewed in a scientific article?
Is documentation or a manual available (provide a 🔗)?
Is the full version of the software free of charge?

Software	Website	Open-Source	Published	Documentation	Free
Abstrackr	🔗	❌		❌	✅
ASReview	🔗	✅🔗		✅🔗	✅
Colandr	🔗	❌		✅🔗	✅
DistillerSR	🔗	❌		✅🔗	❌
EPPI-Reviewer	🔗	❌	❌	✅🔗	❌
FASTREAD	❌	✅🔗		✅🔗	✅
Rayyan	🔗	❌		✅🔗	❌
RobotAnalyst	🔗	❌		❌	❔¹
SWIFT-Active Screener	🔗	❌		✅🔗	❌

✅ Yes/Implemented; ❌ No/Not implemented; ❔ Unknown (requires an issue).

¹ See issue #29

Installation

This table summarizes the various installation options available for each software tool, highlighting whether:

The software can be installed locally, ensuring that data and labeling decisions are only stored on the user's device (yes/no)?
The software can be installed on a server (yes/no)?
The software is available as an online service (Software as a Service - SAAS; yes/no; provide a link to the registration page)?

Software	Local	Server	Online Service
Abstrackr	❌	❌	✅🔗
ASReview	✅	✅	❌
Colandr	❌	❌	✅🔗
DistillerSR	❌	❌	✅🔗
EPPI-Reviewer	❌	❌	✅🔗
FASTREAD	✅	✅	❌
Rayyan	❌	❌	✅🔗
RobotAnalyst	❌	❌	✅🔗¹
SWIFT-Active Screener	❌	❌	❌🔗

✅ Yes; ❌ No; ❔ Unknown (requires an issue).

¹ To use RobotAnalyst, you need to request an account via email.

Data Handling

This table provides an overview of the data input/output capabilities of each software, including:

Supported import data formats.
Whether partially labeled data can be imported (yes/no; if yes, as S(ingle) or M(ultiple) files)?
Supported export data formats.
If the export file includes the labeling decisions.
Whether the export file can be re-imported into the same software, retaining the labeling decisions (Re-Import-1: yes/no)?
Whether the export file can be re-imported into reference manager software, retaining the labeling decision (Re-Import-2: yes/no)?

Software	Input data format	Partly labeled	Output data format	Labeling decisions	Re-Import-1	Re-Import-2
Abstrackr	RIS, TAB, TXT¹	❌	CSV, XML, RIS	✅	❌	✅
ASReview	RIS, TSV, CSV, XLSX, TAB, `+`²	✅(S)`+`²	RIS, TSV, CSV, XLSX, TAB	✅	✅	✅
Colandr	RIS, BIB, TXT	✅(M)	CSV	✅	❌	❌
DistillerSR	ENLX, RIS, CSV, ZIP	✅(M)	RIS, CSV, XLSX, Word	❔⁴	❔⁴	❔⁴
EPPI-Reviewer	RIS, TXT, `+`³	✅(M)	RIS, XLSX	❔⁵	❔⁵	❔⁵
FASTREAD	CSV	✅(S)	CSV	✅	✅	❌
Rayyan	RIS, ENW, BIB, CSV, XML, CIW, NBIB	✅(M)	RIS, BIB, ENW, CSV	✅	❌	✅
RobotAnalyst	RIS, NBIB	✅❔⁶	❔⁶	✅	❔⁶	❌
SWIFT-Active Screener	TXT, RIS, XML, BibTex	✅(M)	CSV, RIS	✅	❔⁷	✅

✅ Yes/Implemented; ❌ No/Not implemented; ⚡ Only for some extensions (add a footnote for more explanation); ❔ Unknown (requires an issue).

¹ List of PubMed IDs

² ASReview provides several open-source tools to convert file formats (e.g., CSV->RIS or RIS->XLSX), combine datasets (labeled, partly labeled, or unlabeled), and deduplicate records based on title/abstract/DOI.

³ EPPI-Reviewer provides a closed-source online file converter to convert several file formats to RIS.

⁴ See issue #54

⁵ See issue #21

⁶ See issue #29

⁷ See issue #40

Machine Learning Properties

The tables below provide an overview of the machine learning properties of each software.

Active Learning

Training Data

Can the user select training data (prior knowledge) to train the first iteration of the model (yes/no)?
What is the minimum training data size (provide a number for Relevant and Irrelevant records)?

Software	Tr.Data by user	Minimum Tr.data
Abstrackr	❌	❔¹
ASReview	✅	≥1R+≥1I
Colandr	✅	10
DistillerSR	✅	25 or 2%²
EPPI-Reviewer	✅	≥5R
FASTREAD	✅	≥1R
Rayyan	✅	≥50 with ≥5R
RobotAnalyst	✅	≥1R
SWIFT-Active Screener	✅⁴	≥1R⁵

✅ Yes/Implemented; ❌ No/Not implemented; ⚡ With some effort (add a footnote for more explanation); ❔ Unknown (requires an issue).

¹ See issue #34

² Training takes place after screening 25 records or after screening 2% of the dataset, whichever is greater.

⁴ Only relevant records can be provided as training data prior to screening.

⁵ If no relevant records are uploaded prior to screening, training will be initiated after screening ≥30 records with atleast ≥1R and ≥1I.

Model Selection

The table below provides an overview of the model selection properties for each software.

Can the user select the active learning model (yes/no)?
Can a user upload their own model (yes/no)?
Can the feature extraction results be stored (yes/no)?
Does (re-)training proceed Automatically or is it triggered Manually?
Can the user continue labeling during training (yes/no)?
Can the user select batch size (yes/no; provide the default)?
Is it possible to switch to a different model during screening (yes/no)?

Software	Select model	User model	Store Feat.matrix	Training	Continue	Batch size	Switch
Abstrackr	❌	❌	❌	A	✅	❌	❌
ASReview	✅	✅	✅	A	✅	❌ (1)	⚡¹
Colandr	❌	❌	❌	A	✅	❌ (10)	❌
DistillerSR	❌	❌	❌	A, M	✅	❌	❌
EPPI-Reviewer	❌	❌	❌	M	✅	❌	❌
FASTREAD	❌	❌	❌	M	❌	❌	❌
Rayyan	❌	❌	❌	M	✅	❌	❌
RobotAnalyst	❌	❌	❌	M	❔²	❌	❌
SWIFT-Active Screener	❌	❌	❌	A	❔³	❌ (30)	❌

✅ Yes/Implemented; ❌ No/Not implemented; ⚡ With some effort (add a footnote with more explanation);

¹ Switching to a different model in ASReview is available by exporting the data of the first model and importing the data back into ASReview. The software will recognize all previous labeling decisions, and a new model can be trained.

² See issue #29

³ See issue #40

Overview of Available Models

Which feature extraction methods are available? BOW = bag of words; Doc2Vec = document to vector; sBERT = sentence bidirectional encoder representations from transformers; TF–IDF = term frequency–inverse document frequency; Word2Vec = words to vector; ML = Multi-language;
Which classifiers are available? CNN = convolutional neural network; DNN = dense neural network; LDA = latent Dirichlet allocation; LL = log linear; LR= logistic regression; LSTM = long short-term memory; NB = naive Bayes; RF =random forests; SGD = stochastic gradient descent; SVM = support vector machine;
Which balancing strategies are available? S / Simple = no balancing balance strategy; D / Double = Double balance strategy; T / Triple = Triple balance strategy; U / Under = Undersampling balance strategy; A / Aggressive = Aggressive undersampling balance strategy (after classifier is stable); W / Weighting = Weighting for data balancing (before and after classifier is stable); M / Mixing = Mixing: weighting is applied before the classifier is stable and aggressive undersampling is applied after the classifier is stable;
Which query strategies are available? R / Random = Records are selected randomly; C / Certain = Certainty based; U / Uncertain = Uncertainty based; M / Mixed = A combination of query strategies, for example 90% Certainty based and 10% Random; Cl / Clustering = Clustering query strategy;

Software	Feature Extr.	Classifiers	Balancing	Query Stra.
Abstrackr	TF-IDF ❔¹	SVM	❔¹	R, C, U
ASReview	TF–IDF, Doc2Vec, sBert, TF-IDF, ML	CNN, DNN, LR, LSTM, NB, RF, SVM	S, D, U, T	R, C, U, M, CL
Colandr	Word2Vec ❔²	SGD ❔ ²	❔²	C
DistillerSR	❔³	SVM	❔³	R, C
EPPI-Reviewer	TF-IDF	SVM	❔⁴	R, C, Cl
FASTREAD	TF-IDF	SVM	S, A, W, M	C, U
Rayyan	❔⁵	SVM	❔⁵	C, U
RobotAnalyst	TF-IDF + BOW + LDA2vec	SVM	❔⁶	R, C, U, Cl
SWIFT-Active Screener	TF-IDF	LL	S:grey_question:⁷	C

✅ Yes/Implemented; ❌ No/Not implemented; ❔ Unknown (requires an issue).

¹ See issue #34

² See issue #16

³ See issue #54

⁴ See issue #21

⁵ See issue #19

⁶ See issues #29

⁷ See issues #40

Supervised Learning

Software	Feature Extr.	Classifiers	Balancing	Query Stra.
EPPI-Reviewer¹	TF-IDF	SVM:grey_question:²	❔²	R, C, Cl

¹ EPPI-Reviewer offers the option to choose from, or use custom, pre-trained models to find a specific type of literature, e.g., for RCTs.

² See issue #21

Unsupervised Learning

Software	Q1

Software

This section briefly describes the software in alphabetical order.

Abstrackr

Abstrackr is a collaborative (i.e., multiple reviewers can simultaneously screen citations for a review), web-based annotation tool for the citation screening task.

ASReview

ASReview, developed at Utrecht University, helps scholars and practitioners to get an overview of the most relevant records for their work as efficiently as possible while being transparent in the process. It allows multiple machine learning models, and ships with exploration and simulation modes, which are especially useful for comparing and designing algorithms. Furthermore, it is intended to be easily extensible, allowing third parties to add modules that enhance the pipeline with new models, data, and other extensions.

Colandr

Colandr is a free, web-based, open-access tool for conducting evidence synthesis projects.

DistillerSR

DistillerSR automates the management of literature collection, screening, and assessment using AI and intelligent workflows. From a systematic literature review to a rapid review to a living review, DistillerSR makes any project simpler to manage and configure to produce transparent, audit-ready, and compliant results.

EPPI-Reviewer

EPPI-Reviewer is a web-based software program for managing and analysing data in literature reviews. It has been developed for all types of systematic review (meta-analysis, framework synthesis, thematic synthesis etc) but also has features that would be useful in any literature review. It manages references, stores PDF files and facilitates qualitative and quantitative analyses such as meta-analysis and thematic synthesis. It also contains some new ‘text mining’ technology which is promising to make systematic reviewing more efficient.

FASTREAD

FASTREAD (FAST2) is a tool to support primary study selection in systematic literature review.

Rayyan

Rayyan is a free web and mobile app, that helps expedite the initial screening of abstracts and titles using a process of semi-automation while incorporating a high level of usability.

RobotAnalyst

RobotAnalyst was developed as part of the Supporting Evidence-based Public Health Interventions using Text Mining project to support the literature screening phase of systematic reviews.

SWIFT-Active Screener

SWIFT-Active Screener (SWIFT is an acronym for “Sciome Workbench for Interactive computer-Facilitated Text-mining”) is a freely available interactive workbench which provides numerous tools to assist with problem formulation and literature prioritization.

Contributing

If you know of other software that meets the inclusion criteria, please make a Pull Request and add it to the overview. If you find any missing, incorrect, or incomplete information, please open an issue to discuss it.

By collaborating on this repository, we can create a valuable resource for researchers, practitioners, and other stakeholders interested in leveraging machine learning for text screening purposes.

License

This project is licensed under CC-BY 4.0.

Contact

For suggestions, questions, or comments, please file an issue in the issue tracker.

This comparison is maintained by Rens van de Schoot. The goal is to provide a fair and unbiased comparison. If you have any concerns regarding the comparison, please open an issue in the issue tracker so that it can be discussed openly.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comprehensive Guide to Machine Learning Software for Text Screening

Table of Contents

Inclusion Criteria

Overview

Installation

Data Handling

Machine Learning Properties

Active Learning

Training Data

Model Selection

Overview of Available Models

Supervised Learning

Unsupervised Learning

Software

Abstrackr

ASReview

Colandr

DistillerSR

EPPI-Reviewer

FASTREAD

Rayyan

RobotAnalyst

SWIFT-Active Screener

Contributing

License

Contact

About

Releases

Packages

Contributors 7

License

Rensvandeschoot/software-overview-machine-learning-for-screening-text

Folders and files

Latest commit

History

Repository files navigation

Comprehensive Guide to Machine Learning Software for Text Screening

Table of Contents

Inclusion Criteria

Overview

Installation

Data Handling

Machine Learning Properties

Active Learning

Training Data

Model Selection

Overview of Available Models

Supervised Learning

Unsupervised Learning

Software

Contributing

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks