-
Notifications
You must be signed in to change notification settings - Fork 1
Ideas for WoS Database design
Ideas for WoS Database
-
Property Graph Database This is what we're working on with the RDS group. I've developed a preliminary schema based on the MAG data set. There are a couple of different ways a property graph could work: (a) A complete copy of the data stored in the graph database with distinct classes and relationships between nodes and multiple properties stored on each node. This is the approach currently being taken with MAG in the benchmarking study. (b) Only papers and citation/reference relationships. Since the publication is the discrete unit in the XML file, each record can become a node and all the meta-data associated with each paper stored as a properties on the node. This should be much more simple to build from the WoS data but will limit network queries to only citations. (c) Only citation relationships and unique id's areas stored in the database. This model would function as more of an extension of a traditional RDBMS. This approach would only store the citation network and return id's which would then need to be located in the other database. This method would be the most simple to implement from a database perspective but would require more work in the middleware.
-
Document database This method is based on converting the XML data to JSON and storing that. This can be done in postgres using the JSONB format or with a native document database like MongoDb. This requires less manipulation of the data since XML is similar to JSON and time spent on schema design. The difficulty maybe searching the data; it's not clear to me what the performance of this type of database would be. The other challenge is separating the XML's from Clarivate into separate JSON's for each publication for indexing and searching. I started this process by loading the data into postgres but it crashed after 44 million records. This type of database could be used in conjunction with 1(c). Possibly, this type of database would work well with something like Lucene to accelerate the search. It would be much less work to load the data from WoS but MAG comes as tsv so would need to be converted.
-
RDF Graph database This database would use a similar database engine to the property graph but would have a very different schema. Based on the semantic web, RDF requires the organization of the data into nested hierarchy or ontology. The advantage of this approach is that ontologies have already been developed for scholarly data and the same ontology can be applied to data in the same domain eg WoS and MAG. Additionally, XML to RDF is a relatively simple conversion. The challenge would be performance and complexity of the resulting graph. In the RDF model, every value is represented as a node and relationship triple which can make storage of complex datasets more challenging.