Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanDOI accepts malformed DOI prefix; general behaviour when input doesn't match #24

Open
zoe-translates opened this issue Apr 13, 2023 · 0 comments

Comments

@zoe-translates
Copy link

utilities/utilities.js

Lines 415 to 426 in b93f16d

/**
* Strip info:doi prefix and any suffixes from a DOI
* @type String
*/
cleanDOI: function(/**String**/ x) {
if(typeof(x) != "string") {
throw new Error("cleanDOI: argument must be a string");
}
var doi = x.match(/10(?:\.[0-9]{4,})?\/[^\s]*[^\s\.,]/);
return doi ? doi[0] : null;
},

The regular expression will match a string like "10/example" or "10/10" (as in ten out of ten, great, A+), which is not valid DOI (prefix does not start with 10.).

In general, cleanDOI() returns the first matching substring if any. I'm not sure whether this is the intended behaviour that we expect the caller to depend on. Its doc says "Strip info:doi prefix and any suffixes from a DOI", which isn't very clear to me.

In the translators repository, there are currently 36 translators making use of cleanDOI. I haven't checked the other Zotero components. I think we need to check how many of those calls depend on the current behaviour, before we go on to improve the regexp here.


My own thoughts

In general, the best that any code here could realistic do, is to provide some reasonable baseline accuracy. There's bound to be false positive and negatives when trying to "parse" or clean DOI because DOIs don't obey grammar. The DOI spec basically said they could be anything (printable Unicode graphic characters). It is explicitly said to be an opaque string, and nothing is supposed to be deduced based on features of the string alone.

So for the general-purpose utility function like this, we need to set realistic expectation and clearly document what it does exactly. We may have to leave the caller (especially translators) to implement their own further processing. The reason is that, given the opaque and arbitrary nature of DOIs and the diverse ways they can appear in, the best improvements should not come from the sophistication of a generic filter, but from domain-specific knowledge. Which is what goes into the individual translators.

But we need to check how many of the translators calling cleanDOI agree with this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant