Skip to content

Commit

Permalink
dataset-RDFlib
Browse files Browse the repository at this point in the history
  • Loading branch information
antaldaniel committed Dec 25, 2024
1 parent 2181c20 commit 75bedc3
Show file tree
Hide file tree
Showing 5 changed files with 74 additions and 4 deletions.
Binary file modified data/iris_dataset.rda
Binary file not shown.
1 change: 0 additions & 1 deletion tests/testthat/test-creator.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ test_that("creator() <- value works with overwrite", {
})



test_that("creator() <- value works without overwrite", {
iris_dataset_3 <- iris_dataset
creator(x=iris_dataset_3, overwrite=FALSE) <- person("Jane", "Doe")
Expand Down
3 changes: 2 additions & 1 deletion tests/testthat/test-n_triple.R
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@ test_that("create_iri()", {
role = c("aut", "cre"),
comment = c(ORCID = "0000-0001-7513-6760"))
expect_error(create_iri(list(a=1:2)))
expect_equal(create_iri(as.POSIXct(10000, origin = "2024-01-01", tz="UTC")), "\"2024-01-01T03:46:40Z\"^^<xs:dateTime>")
expect_output(print(create_iri(as.POSIXct(10000, origin = "2024-01-01", tz="UTC"))), "2024-01-01T03:46:40Z")
expect_output(print(create_iri(as.POSIXct(10000, origin = "2024-01-01", tz="UTC"))), "\\^\\^<xs:dateTime>")
expect_equal(create_iri(author_person), "<https://orcid.org/0000-0001-7513-6760>")
jane_doe <- person(given="Jane", family="Doe", role = "aut", email = "example@example.com")
expect_equal(create_iri(x=jane_doe), "\"Jane Doe [aut]\"^^<http://www.w3.org/2001/XMLSchema#string>")
Expand Down
1 change: 1 addition & 0 deletions tests/testthat/test-publication_year.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ test_that("publication_year() works", {
expect_warning(publication_year(iris_dataset, overwrite=F) <- 1934)
})


value <- 1936

test_that("publication_year() <- assignment works", {
Expand Down
73 changes: 71 additions & 2 deletions vignettes/rdf.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,70 @@ library(dataset)
library(rdflib)
```

The **RDF** (Resource Description Framework) annotation significantly enhances the interoperability and exchangeability of datasets in data repositories by leveraging a standardised, machine-readable format for describing and linking data. This vignette shows how to leverage the capabilities of the _dataset_ package with [rdflib](https://docs.ropensci.org/rdflib/index.html), an R-user-friendly wrapper on ROpenSci to work with the _redland_ Python library for performing common tasks on rdf data, such as parsing and converting between formats including rdfxml, turtle, nquads, ntriples, and trig, creating rdf graphs, and performing SPARQL queries.

```{r prov}
provenance(iris_dataset)
## Standardised Semantic Framework
RDF provides a common framework to describe resources and their relationships using triples (subject-predicate-object). This standardisation ensures that data from different systems can be understood in a unified way, regardless of the original source or format. Notice that this format is a stricter version of the tidy dataset concept, where not only on every observation is in a row, but there are always strictly three columns.

```{r iris}
head(iris)[1:2,]
```
Instead of placing the relevant measurement of an observed flower into the intersection of columns and rows, in the triple format we put them next to each other:

- the first flower's sepal length is 5.1
- the second flowers's sepal length is 4.9


```{r triples}
dataset_to_triples(iris[1:2,])[1:10,]
```
We describe the `dataset_df` datasets in such triplets, where each triplet is a semantic statement: it connects a single observation unit with a single measurement.

## Enhanced Interoperability
RDF uses globally unique identifiers (URIs) for resources, ensuring that different datasets can reference the same entities unambiguously. This allows seamless data integration and querying across repositories, even if the datasets come from diverse domains.

Our `defined` class supports this enhanced interoperability. In the example below, an application can look up that the numeric values in your table conform the statistical definition of GDP, and they are expressed in millions of dollars; meaning that you have to multiply them by 1000 if you want to join them with different data expressed in thousands of dolllars.

```{r defined}
gdp_vector <- defined(
c(3897, 7365, 6753),
label = "Gross Domestic Product",
unit = "https://rdf.vegdata.no/V440/v440-doc/v440-brudata-owl-doc/unit_MillionUSD.html",
definition = "http://data.europa.eu/83i/aa/GDP"
)
```

There are several ways to add permanent identifiers to observational units, variable definitions, and specific observed values. The simplest (but certainly not the easiest to read for a human eye) standard format for writing them into a plain text file that you can share online is the [RDF 1.1 N-Triples](https://www.w3.org/TR/n-triples/) format.The NTriple format creates URIs (similarly formatted as URLs) for the definitions that can be looked up in an online resource. This can be combined with literal strings that may also include information if they should be read back to a system as strings, doubles, integers, dates or date-time variables.

```{r ntriples}
n_triple(s="https://doi.org/10.5281/zenodo.10396807", # permanent, global ID of the dataset
p="http://purl.org/dc/terms/description", # library definition of 'description'
o="The famous (Fisher's or Anderson's) iris data set.") # literal string
```
## Richer metaadata
RDF supports linking datasets through shared URIs, enabling the creation of interconnected knowledge graphs. Linked Data principles help relate datasets in meaningful ways, making it easier to discover, navigate, and integrate information. RDF annotations allow datasets to include detailed metadata about their structure, provenance, usage rights, and content. This metadata provides critical context, enabling automated tools to interpret and process the data effectively.

Most scientific researchers are familiar with data *findability*, *accessibility*, *interoperability*, and *reuse*. Your dataset's properties will significantly improve if you add standard metadata used by libraries globally (according to the Dublin Core standards) or the DataCite data repository standards. Such standards use globally shared definitions on how a title or a subtitle should be added to your dataset or how you can add with IRIs keywords that any user interprets the same way in the world, even if they do not speak English or your language.

RDF supports the use of ontologies and controlled vocabularies (e.g., DataCite, Dublin Core, Schema.org), allowing datasets to be described consistently within and across domains.

The `as_dublincore` function allows the export of your dataset's data in the Dublin Core format, and `as_datacite` in the DataCite format. Some of the metadata are generated behind the scenenes, for example, timestamps or size measurements.

```{r bibliography}
as_dublincore(iris_dataset, type="ntriples")
```

Interoperability and reusability can further increase if the next user can trust your dataset, and has to perform less checks on it; or the next user can reproduce what you did. Data provenance is the metadata that provides a comprehensive record of the origins, history, and transformations of data throughout its lifecycle. Our `provenance` functions records some of this data automatically, and allow you to add more information, for example, about your data sources, the R packages used, the persons involved in the creation and review process, or the statistical transformations carried out.

```{r prov}
provenance(iris_dataset)
```
## Adding your dataset into an RDF triplestore

RDF data can be stored in triple stores and queried using SPARQL, a powerful query language.
This makes it easier to retrieve specific subsets of data or infer new information based on existing annotations


```{r rdf}
# initialise an rdf triplestore:
dataset_describe <- rdf()
Expand All @@ -45,3 +100,17 @@ rdf_parse(rdf = dataset_describe, doc=temp_prov, format="ntriples")
dataset_describe
```

By using RDF, datasets can be exchanged as interoperable graphs (e.g., in formats like RDF/XML, Turtle, or JSON-LD).

```{r jsonld}
options(rdf_print_format = "jsonld")
dataset_describe
```
## Make the entire dataset interoperable

Eventually you can make the entire dataset interoperable, with making every observation, every statement independent of R, your computer, your OS, and to a large extent the natural language that you use. _This will be further developed until we can express in a semantically correct way an entire dataset automatically_

```{r}
n_triples(dataset_to_triples(iris[1:4,]))
```

0 comments on commit 75bedc3

Please sign in to comment.