You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: joss/paper.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -44,7 +44,7 @@ Loading, cleansing, and organizing data can dominate the time spent on a data sc
44
44
45
45
Existing extract, transform and load (ETL) technologies such as [Microsoft SQL Server Integration Services](https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services) help with data staging. Similarly, data manipulation tools like [pandas](https://pandas.pydata.org) facilitate transformation of series and matrix data. **Carnival** distinguishes itself by offering a lightweight data caching mechanism coupled with data manipulation services built on a property graph rather than arrays and data frames. Graphs present an alternative to relational data structures that more naturally represent complex and highly relational data and are more adaptive to change. A property graph database is an implementation of the graph structure that represents data as nodes and directed edges (relationships) between the nodes, where nodes and edges can have properties (key/value pairs) associated with them. Carnival’s combination of features and graph data representation empowers informaticians and programmers working in complex data domains to build pipelines, utilities, and applications that are comparatively richer in semantics and provenance.
46
46
47
-
Knowledge bases in Resource Description Framework (RDF) triplestores can be valuable tools to harmonize and enrich complex data. Transforming source relational data into RDF triples reflecting a data model is challenging. While there exist relational-to-RDF mappers such as Karma[@10.1007/978-3-662-46641-4_40], the configuration process is labor intensive and the resulting triples may not match a data model particularly one of sufficient complexity.
47
+
Knowledge bases in Resource Description Framework (RDF) triplestores can be valuable tools to harmonize and enrich complex data. Transforming source relational data into RDF triples reflecting a data model is challenging. While there exist relational-to-RDF mappers such as Karma[@10.1007/978-3-662-46641-4_40], the configuration process is labor intensive and the resulting triples may not match a data model particularly one of sufficient complexity.
48
48
49
49
**Carnival** was developed to create domain-specific property graph data models, and provide tools to create robust pipelines to import and manage data in that model. There are two main components to Carnival. The primary component is a layer built on top of [Apache Tinkerpop](https://tinkerpop.apache.org) that seeks to provide more standardized and semantically driven methods of interacting with a property graph. An additional component is a data caching mechanism that supports the efficient aggregation of data from disparate sources.
50
50
@@ -61,7 +61,7 @@ Knowledge bases in Resource Description Framework (RDF) triplestores can be valu
61
61
Carnival was initially developed to facilitate the production of analytical data sets for human subjects research. The source data repositories included a relational data warehouse accessible by SQL, a REDCap [@HARRIS2019103208; @HARRIS2009377] installation accessible by API, and manually curated data files in CSV format. Data pertaining to the set of study subjects was distributed across each of these data sources. Using Carnival, a data pipeline was implemented to pull data from the data sources, instantiate them in a property graph, clean and harmonize them, and produce analytical data sets at required intervals.
62
62
63
63
#### Queries over enriched data
64
-
A key challenge of human subjects research is to locate patients to recruit to a study, frequently done by searching a research data set containing raw patient data. Potential recruits need to be stratified by attributes, such as age, race, and ethnicity, matched against inclusion criteria, such as the presence of a diagnosis code, and filtered by exclusion criteria, such as a treatment modality. **Carnival** has been used effectively in this area by loading the relevant raw data into a graph, stratifying and categorizing patients by the relevant criteria, then using graph traversals to extract the patients who are potential recruits[@FREEDMAN2020100086; @carnivalcohort].
64
+
A key challenge of human subjects research is to locate patients to recruit to a study, frequently done by searching a research data set containing raw patient data. Potential recruits need to be stratified by attributes, such as age, race, and ethnicity, matched against inclusion criteria, such as the presence of a diagnosis code, and filtered by exclusion criteria, such as a treatment modality. **Carnival** has been used effectively in this area by loading the relevant raw data into a graph, stratifying and categorizing patients by the relevant criteria, then using graph traversals to extract the patients who are potential recruits[@FREEDMAN2020100086; @carnivalcohort].
65
65
66
66
#### Integration with [OBO Foundry](https://obofoundry.org) Ontologies
67
67
We drew upon ontology modeling in the OBO Foundry as inspiration for the Carnival graph data model. For example, a ‘process’, is an event that occurs at some time on some material entity. A ‘planned process’ extends ‘process’ to include a pre-defined plan, participants, inputs, and outputs. In the Carnival graph, healthcare encounters are modeled as planned processes, where participants include the patient and clinician and the outputs may be diagnoses and medications.
0 commit comments