Skip to content

Web of Science data schema

XiaoranYan edited this page Feb 18, 2020 · 17 revisions

Issues found:

  1. Fields appear in multiple places in the xmls. See link for a breakdown https://github.com/iuni-cadre/DataPipelineAndProvenanceForCADRE/blob/master/wosParseYan/WoSfieldTagsCompact.csv

  2. Duplicate records found in wos_summary_names table conditioned on two indices "id" and "seq_no". Possible redundant parsing happened in the original SQL-parser in terms of path can be nested and rematched https://github.com/cns-iu/generic_parser/blob/master/generic_parser.py

  3. Duplicate records still exists with DISTINCT "id" and "seq_no", in cases where role of an "author" is not "author", group/corporation authors can lead to same name with different "seq_no". For example: https://atlas.cern/discover/collaboration https://journals.aps.org/prd/abstract/10.1103/PhysRevD.101.012002

  4. Unreliable author-address mapping before 2008, is there a "institution enhanced" label in the data set?

  5. We need labels for back-files for more granular access control

  6. References with fractional numbers represents citations outside of WoS collection

  7. Paragraphs and keywords needs to concatenated from the paragraphs, some duplication exists

  8. We need a detailed official data dictionary. Right now we only have the 2013 version. https://iuni.iu.edu/files/WoS_Documents/WoKRawXML20130509.pdf