`cleanDOI` accepts malformed DOI prefix; general behaviour when input doesn't match #24

zoe-translates · 2023-04-13T06:15:52Z

Lines 415 to 426 in b93f16d

    
           	/** 
        
           	 * Strip info:doi prefix and any suffixes from a DOI 
        
           	 * @type String 
        
           	 */ 
        
           	cleanDOI: function(/**String**/ x) { 
        
           		if(typeof(x) != "string") { 
        
           			throw new Error("cleanDOI: argument must be a string"); 
        
           		} 
        
           		var doi = x.match(/10(?:\.[0-9]{4,})?\/[^\s]*[^\s\.,]/); 
        
           		return doi ? doi[0] : null; 
        
           	},

The regular expression will match a string like "10/example" or "10/10" (as in ten out of ten, great, A+), which is not valid DOI (prefix does not start with 10.).

In general, cleanDOI() returns the first matching substring if any. I'm not sure whether this is the intended behaviour that we expect the caller to depend on. Its doc says "Strip info:doi prefix and any suffixes from a DOI", which isn't very clear to me.

In the translators repository, there are currently 36 translators making use of cleanDOI. I haven't checked the other Zotero components. I think we need to check how many of those calls depend on the current behaviour, before we go on to improve the regexp here.

My own thoughts

In general, the best that any code here could realistic do, is to provide some reasonable baseline accuracy. There's bound to be false positive and negatives when trying to "parse" or clean DOI because DOIs don't obey grammar. The DOI spec basically said they could be anything (printable Unicode graphic characters). It is explicitly said to be an opaque string, and nothing is supposed to be deduced based on features of the string alone.

So for the general-purpose utility function like this, we need to set realistic expectation and clearly document what it does exactly. We may have to leave the caller (especially translators) to implement their own further processing. The reason is that, given the opaque and arbitrary nature of DOIs and the diverse ways they can appear in, the best improvements should not come from the sophistication of a generic filter, but from domain-specific knowledge. Which is what goes into the individual translators.

But we need to check how many of the translators calling cleanDOI agree with this...

The text was updated successfully, but these errors were encountered:

northword mentioned this issue Jan 4, 2024

[Bug] cleanDOI 可能错误处理 northword/zotero-format-metadata#114

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cleanDOI` accepts malformed DOI prefix; general behaviour when input doesn't match #24

`cleanDOI` accepts malformed DOI prefix; general behaviour when input doesn't match #24

zoe-translates commented Apr 13, 2023

cleanDOI accepts malformed DOI prefix; general behaviour when input doesn't match #24

cleanDOI accepts malformed DOI prefix; general behaviour when input doesn't match #24

Comments

zoe-translates commented Apr 13, 2023

My own thoughts

`cleanDOI` accepts malformed DOI prefix; general behaviour when input doesn't match #24

`cleanDOI` accepts malformed DOI prefix; general behaviour when input doesn't match #24