DedupEndNote

Deduplication of EndNote RIS files:

deduplicate one file: produces a new RIS file with the unique records
deduplicate two files (NEW-RECORDS and OLD-RECORDS): deduplicates both files and produces a RIS file with the unique records from NEW-RECORDS
mark the duplicates of one file: produces a RIS file with the Label field containing the ID of the duplicate record

DedupEndNote is available at http://dedupendnote.nl:9777

Actions

Export one or two EndNote databases as RIS file(s)
Upload the file(s)
Choose the action
Download the result file (RIS)
Import the result file into a new EndNote database

Building your own version

DedupEndNote is a Java web application (Java 17, Spring Boot 2.7, fat jar). It can be started locally with:

    java -jar DedupEndNote-[VERSION].jar

and the application will be available at

    http://localhost:9777

Why DedupEndNote?

Deduplication in EndNote misses many duplicate records. Building and maintaining a Journals List within Endnote can partly solve this problem, but there remain lots of cases where EndNote is too unforgiving when comparing records. Some bibliographic databases offer deduplication for their own databases (OVID: Medline and EMBASE), but this does not help PubMed, Cochrane or Web of Science users.

DedupEndNote deduplicates an EndNote RIS file and writes a new RIS file with the unique records, which can be imported into a new EndNote database. It is more forgiving than EndNote itself when comparing records, but tests have shown that it identifies many more duplicates (see below under "Performance").

The program has been tested on EndNote databases with records from:

CINAHL (EBSCOHost)
Cochrane Library (Trials)
EMBASE (OVID)
Medline (OVID)
PsycINFO (OVID)
PubMed
Scopus
Web of Science

The program has been tested with files with up to 50.000 records.

What does DedupEndNote do?

1. Deduplicate

Each pair of records is compared in 5 different ways. The general rule is:

Comparison	Result	Action
1 ... 5	YES	go to next comparison if present, else mark the records as duplicates
	(insufficient data for comparison)
	NO	stop comparisons for this pair of record

The following comparisons are used (in this order, chosen for performance reasons):

Publication year: Are they at most 1 year apart?

Prepocessing: publication years before 1900 are removed (see insufficient data)
Insufficient data: Records without a publication year are compared to all records unless they have been identified as a duplicate.

Starting page or DOI: Are they the same?
If the starting pages are different or one or both are absent, the DOIs are compared.

Preprocessing: Article number is treated as a starting page if starting page itself is empty or contains "-".
Preprocessing: Starting pages are compared only for number: "S123" and "123" are considered the same.
Preprocessing: In DOIs 'http://dx.doi.org/', 'http://doi.org/', ... are left out. URL- and HTML-encoded DOIs are decoded ('10.1002/(SICI)1098-1063(1998)8:6<627::AID-HIPO5>3.0.CO;2-X' becomes '10.1002/(SICI)1098-1063(1998)8:6<627::AID-HIPO5>3.0.CO;2-X'). DOIs are lowercased.
Insufficient data: If one or both DOIs are missing and one or both of the starting pages are missing, the answer is YES. This is important because of PubMed ahead of print publications.

Authors: Is the Jaro-Winkler similarity of the authors > 0.67?

Preprocessing: The author "Anonymous," is treated as no author.
Preprocessing: Group author names are removed. "Author" names which contain "consortium", "grp", "group", "nct" or "study" are considered group author names.
Preprocessing: First names are reduced to initials ("Moorthy, Ranjith K." to "Moorthy, R. K.").
Preprocessing: All authors from each record are joined by "; ".
Insufficient data: If one or both records have no authors, the answer is YES (except if one of the records is a reply (see below) and one of the records has no starting page or DOI).

Title: Is the Jaro-Winkler similarity of (one of) the normalized titles > 0.9?
The fields Original publication (OP), Short Title (ST), Title (TI) and sometimes Book section (T3, see below) are treated as titles. Because the Jaro-Winkler similarity algorithm puts a heavy penalty on differences at the beginning of a string, the normalized titles are also reversed.

Preprocessing: The titles are normalized (converted to lower case, text between "<...>" removed, all characters which are not letters or numbers are replaced by a space character, ...).
Insufficient data: If one of the records is a reply (see below), the titles are not compared / the answer is YES (but the Jaro-Winkler similarity of the authors should be > 0.75 and the comparison between the journals is more strict).

Reply: a publication is considered a reply if the title (field TI) contains "reply", or contains "author(...)respon(...)", or is nothing but "response" (all case insensitive).

T3 field: Especially EMBASE (OVID) uses this field for (1) Conference title (majority of cases), (2) an alternative journal title, and (3) original (non English) title. Case 1 (identified as containing a number or "Annual", "Conference", "Congress", "Meeting" or "Society") is skipped. All other T3 fields are treated as Journals and as titles.

ISSN or Journal: Are they the same (ISSN) or similar (Journal)?
The fields Journal / Book Title (T2), Alternate Journal (J2) and sometimes Book section (T3, see below) are treated as journals, ISBNs as ISSNs. All ISSns and journal titles (including abbreviations) in the records are used. Abbreviated and full journal titles are compared in a sensible way (see examples below). If the ISSns are different or one or both records have no ISSN, the journals are compared.

Preprocessing: ISSNs are normalized (dashes are removed, lowercased). For ISBN-10 the first 9 digits are used, for ISBN-13 the 9 digits starting at position 4.
Preprocessing: Journal titles of the form "Zhonghua wai ke za zhi [Chinese journal of surgery]" or "Zhonghua wei chang wai ke za zhi = Chinese journal of gastrointestinal surgery" or "The Canadian Journal of Neurological Sciences / Le Journal Canadien Des Sciences Neurologiques" are split into 2 journal titles.
Preprocessing: the journal titles are normalized (hyphens, dots and apostrophes are replaced with space, end part between round or square brackets is removed, initial article is removed, ...).

If two records get 5 YES answers, they are considered duplicates. Only the first record of a set of duplicate records is copied to the output file.

2. Enrich the records

When writing the output file (except in Mark Mode), the following fields can be changed:

Author (AU):
- if the (only) author is "Anonymous", the author is omitted
DOI (DO):
- the DOIs of the removed duplicate records are copied to the saved record and deduplicated. The DOI field is important for finding the full text in EndNote.
- DOIs of the form "10.1038/ctg.2014.12", "http://dx.doi.org/10.1038/ctg.2014.12", ... are rewritten in the prescribed form "https://doi.org/10.1038/ctg.2014.12". DOIs of this form are clickable links in EndNote.
Publication year (PY):
- if the saved record has no value for its Publication year but one of the removed duplicate records has, the first not empty Publication year of the duplicates is copied to the saved record.
Starting page (SP) and Article Number (C7):
- the article number from field C7 is put in the Pages field (SP) if the Pages field is empty or does not contain a "-", overwriting the Pages field content.
- the article number field (C7) is omitted
- if the saved record has no value for its Pages field (e.g. PubMed ahead of print publications) but one of the removed duplicate records has, the first not empty pages of the duplicates are copied to the saved record.
- the Pages field gets an unabbreviated form: e.g. "482-91" is rewritten as "482-491".
- if the ending page is the same as the starting page, only the starting page is written ("192" instead of "192-192").
Title (TI):
- If the publication is a reply, the title is replaced with the longest title from the duplicates (e.g. "Reply from the authors" is replaced by "Coagulation parameters and portal vein thrombosis in cirrhosis Reply")

The output file is a new RIS file which can be imported into a new EndNote database.

DedupEndNote is slower than EndNote in deduplicating records because its comparisons are more time consuming. EndNote can deduplicate a EndNote database of ca. 15,000 records in less dan 5 seconds. DedupEndNote needs around 20 seconds to deduplicate the export file in RIS format (115MB).

Performance

Data are from:

[SRA] Rathbone, J., Carter, M., Hoffmann, T. et al. Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module. Syst Rev 4, 6 (2015). https://doi.org/10.1186/2046-4053-4-6
The data sets are available at https://osf.io/dyvnj/
[McKeown] McKeown, S., Mir, Z.M. Considerations for conducting systematic reviews: evaluating the performance of different methods for de-duplicating references. Syst Rev 10, 38 (2021). https://doi.org/10.1186/s13643-021-01583-y
[BIG_SET] Own test database for DedupEndNote on portal vein thrombosis (52,828 records, with 4923 records validated)

Name	Tool	True pos	False neg	Sensitivity	True neg	False pos	Specificity	Accuracy
SRA: Cytology screening (1856 rec)	EndNote X9	885	518	63.1%	452	1	99.8%	72.0%
	SRA-DM	1265	139	90.1%	452	0	100.0%	92.5%
	DedupEndNote	1359	61	95.7%	436	0	100.0%	96.8%

SRA: Haematology (1415 rec)	EndNote	159	87	64.6%	1165	4	99.7%	93.6%
	SRA-DM	208	38	84.6%	1169	0	100.0%	97.3%
	DedupEndNote	222	14	94.1%	1179	0	100.0%	99.0%

SRA: Respiratory (1988 rec)	EndNote X9	410	391	51.2%	1185	2	99.8%	80.2%
	SRA-DM	674	125	84.4%	1189	0	100.0%	93.7%
	DedupEndNote	766	34	95.7%	1188	0	100.0%	97.8%

SRA: Stroke (1292 rec)	EndNote X9	372	134	73.5%	784	2	99.7%	89.5%
	SRA-DM	426	81	84.0%	785	0	100.0%	93.7%
	DedupEndNote	503	7	98.6%	782	0	100.0%	99.5%

McKeown 3130 rec	OVID	1982	90	95.7%	1058	0	100.0%	97.1%
	EndNote	1541	531	74.4%	850	208	80.3%	76.4%
	Mendeley	1877	195	90.6%	1041	17	98.4%	93.2%
	Zotero	1473	599	71.1%	1038	20	98.1%	80.2%
	Covidence	1952	120	94.2%	1056	2	99.8%	96.1%
	Rayyan	2023	49	97.6%	1006	52	95.1%	96.8%
	DedupEndNote	2010	62	97.0%	1058	0	100.0%	98.0%

BIG_SET (4923 rec)	DedupEndNote	3685	271	93.1%	966	1	99.9%	94.5%

Limitations

Input file size: The maximum size of the input file is limited to 150MB.
Input file format: only EndNote RIS file (at present)
Input file encoding: The program assumes that the input file is encoded as UTF-8.
The program uses a bibliographic point of view: an article or conference abstract that has been published in more than one (issue of a) journal is not considered a duplicate publication.
If authors AND (all) titles AND (all) journal names for a record use a non-Latin script, results for this record may be inaccurate.
Each input file must be an export from ONE EndNote database: the ID fields are used internally for identifying the records, so they have to be unique. When comparing 2 files the ID fields may be common between the 2 files.
The program has been developed and tested for biomedical databases (PubMed, EMBASE, ...) and some general databases (Web of Science, Scopus). Deduplicating records from other databases is not garanteed to work.
Records for each publication year are compared to records from the same and the following year: a record from 2016 is compared to the records from 2015 (when treating the records from 2015) and from 2016 and 2017 (when treating the records from 2016). A PubMed ahead-of-print record from 2013 and a corresponding record from 2017 (when it was 'officially' published) will not be compared (and possibly deduplicated).
Bibliographic databases are not always very accurate in the starting page of a publication. Because starting page is part of the comparisons, DedupEndNote misses the duplicates when bibliographic databases don't agree on the starting page (and one or both records have no DOIs).

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.settings		.settings
src		src
.checkstyle		.checkstyle
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
TODO		TODO
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DedupEndNote

Actions

Building your own version

Why DedupEndNote?

What does DedupEndNote do?

1. Deduplicate

2. Enrich the records

Performance

Limitations

About

Releases

Packages

Languages

License

globbestael/DedupEndNote

Folders and files

Latest commit

History

Repository files navigation

DedupEndNote

Actions

Building your own version

Why DedupEndNote?

What does DedupEndNote do?

1. Deduplicate

2. Enrich the records

Performance

Limitations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages