-
Notifications
You must be signed in to change notification settings - Fork 0
Web of Science data schema
Issues found:
-
Fields appear in multiple places in the xmls. See link for a breakdown https://github.com/iuni-cadre/DataPipelineAndProvenanceForCADRE/blob/master/wosParseYan/WoSfieldTagsCompact.csv
-
Duplicate records found in wos_summary_names table conditioned on two indices "id" and "seq_no". Possible redundant parsing happened in the original SQL-parser in terms of path can be nested and rematched https://github.com/cns-iu/generic_parser/blob/master/generic_parser.py
-
Duplicate records still exists with DISTINCT "id" and "seq_no", in cases where role of an "author" is not "author", group/corporation authors can lead to same name with different "seq_no". For example: https://atlas.cern/discover/collaboration https://journals.aps.org/prd/abstract/10.1103/PhysRevD.101.012002
-
Unreliable author-address mapping before 2008, is there a "institution enhanced" label in the data set?
-
We need labels for back-files for more granular access control
-
References with fractional numbers represents citations outside of WoS collection
-
Paragraphs and keywords needs to concatenated from the paragraphs, some duplication exists
-
We need a detailed official data dictionary. Right now we only have the 2013 version. https://iuni.iu.edu/files/WoS_Documents/WoKRawXML20130509.pdf