-
Notifications
You must be signed in to change notification settings - Fork 0
Data Consistency and Data Pipeline Issues
In this wiki we will record the fact that the data integrity needs to be preserved while loading the data into the different databases. Ideally we want to ensure that we do not touch/change the data but if we absolutely have to change the data then we need to agree on a standard so that there is consistency in our data.
Link for the MAG Parser: https://github.com/iuni-cadre/MicrosoftAcademicGraphParser/blob/master/mag2parser/src/main/java/iuni/msacademics/parser/MSAcademicsParser.java
Link for the WOS XML Parser: https://github.com/cns-iu/generic_parser/blob/master/generic_parser.py
In postgres we are trying to find a delimiter and a quote character that will allow us to load the data without making any changes to the data. The only limitation in the delimiter and the quote character is that they have to be a single byte character.
Ben preprocessed the Abstracts.csv file using sed so that it can loaded into the postgres database: https://github.com/iuni-cadre/cadre-wiki/wiki/Postgresql-Copy-command