Skip to content

Commit

Permalink
Update arXiv regex
Browse files Browse the repository at this point in the history
Simplifies the arXiv matching regex to make it more readable and to allow extracting the id with version number, without version number or just the version number. The multiple capture groups are overkill, but the regex might be helpful for future changes of the arXiv translator.
  • Loading branch information
thebluepotato authored Oct 4, 2024
1 parent e00d98d commit ab4f8df
Showing 1 changed file with 4 additions and 5 deletions.
9 changes: 4 additions & 5 deletions utilities.js
Original file line number Diff line number Diff line change
Expand Up @@ -430,13 +430,12 @@ var Utilities = {

// Next try arXiv
if (!identifiers.length) {
// arXiv identifiers are extracted without version number
// i.e. 0706.0044v1 is extracted as 0706.0044,
// because arXiv OAI API doesn't allow to access individual versions
let arXiv_RE = /((?:[^A-Za-z]|^)([\-A-Za-z\.]+\/\d{7})(?:(v[0-9]+)|)(?!\d))|((?:\D|^)(\d{4}\.\d{4,5})(?:(v[0-9]+)|)(?!\d))/g;
// arXiv identifiers are extracted with and without version number
// i.e. 0706.0044v1 is extracted as [0706.0044v1, 0706.0044, 1]
let arXiv_RE = /\b(([-A-Za-z.]+\/\d{7}|\d{4}\.\d{4,5})(?:v(\d+))?)(?!\d)/g; // 1: full ID, 2: ID without version, 3: version #
let m;
while ((m = arXiv_RE.exec(text))) {
let arXiv = m[2] || m[5];
let arXiv = m[1];
if (arXiv && !foundIDs.has(arXiv)) {
identifiers.push({arXiv: arXiv});
foundIDs.add(arXiv);
Expand Down

0 comments on commit ab4f8df

Please sign in to comment.