Web of Science data schema

Issues found:

Fields appear in multiple places in the xmls. See link for a breakdown https://github.com/iuni-cadre/DataPipelineAndProvenanceForCADRE/blob/master/wosParseYan/WoSfieldTagsCompact.csv
Duplicate records found in wos_summary_names table conditioned on two indices "id" and "seq_no". Possible redundant parsing happened in the original SQL-parser in terms of path can be nested and rematched https://github.com/cns-iu/generic_parser/blob/master/generic_parser.py
Duplicate records still exists with DISTINCT "id" and "seq_no", in cases where role of an "author" is not "author", group/corporation authors can lead to same name with different "seq_no". For example: https://atlas.cern/discover/collaboration https://journals.aps.org/prd/abstract/10.1103/PhysRevD.101.012002
Unreliable author-address mapping before 2008, is there a "institution enhanced" label in the data set?
We need labels for back-files for more granular access control
References with fractional numbers represents citations outside of WoS collection
Paragraphs and keywords needs to concatenated from the paragraphs, some duplication exists
We need a detailed official data dictionary. Right now we only have the 2013 version. https://iuni.iu.edu/files/WoS_Documents/WoKRawXML20130509.pdf

Provide feedback