Skip to content

Commit

Permalink
final version for the defense
Browse files Browse the repository at this point in the history
  • Loading branch information
dachafra committed May 18, 2021
1 parent a0fc48b commit c192d87
Show file tree
Hide file tree
Showing 11 changed files with 57 additions and 20 deletions.
20 changes: 20 additions & 0 deletions 0_frontmatter/abstract.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,32 @@


\begin{abstractslong}
Over the last years, a large and constant growth of data have been made available on the Web. These data are published in many different formats following several schemes. The Semantic Web, and more in detail the Knowledge Graphs, have gained momentum as a result of this explosion of available data and the demand of expressive models to integrate factual knowledge spread across various data sources. Although these results endorse the success of Semantic Web technologies, they also exhort the development of computational tools to scale up knowledge graphs to the large data growth expected for the next years. The proposal of robust methods able to integrate these data sources across the Web is the first step that has to be solved so as to start seeing the Web as an integrated overall database.

This thesis addresses the problem of constructing knowledge graphs exploiting declarative mapping rules. The contributions presented in this document are:
\begin{itemize}
\item A complete evaluation framework for knowledge graph construction engines.
\item The concept of \textit{mapping translation} and its desirable properties.
\item Optimizations and enhancements during the access to heterogeneous data sources in the construction of virtual knowledge graphs exploiting the mapping translation concept.
\item Optimizations in the construction of materialized knowledge graphs over complex data integration scenarios translating mapping rules among different specifications.
\end{itemize}

The final conclusions of this thesis reflect that the optimization of the construction of knowledge graphs at scale has been approached for the first time using the translations among mapping languages, a novel concept in the state of the art. This has been accompanied by a complete evaluation framework that allows the identification of weakness and strengthens of these engines. Finally, the future lines of work reflect the need to continue researching new methods and techniques that ensure a wide adoption of this type of technologies on a large scale in the industry.
\end{abstractslong}

\cleardoublepage
\begin{abstractslongSpanish}
Hoy en día existe una cantidad ingente de datos que se han ido almancenando en la Web. Estos datos se representan en formatos y haciendo uso de vocabularios y esquemas muy diversos. La Web Semántica, y más en concreto los Grafos de Conocimientos, se han posicionado como una solución a escala capaz de integrar esta gran cantidad de datos siguiendo un model común (ontología). Aunque esto refleja un éxito de estás tecnologías, siendo utilizadas por muchas compañías y proyectos en la gestión de los datos, también reflejan una necesidad en el desarrollo y conceptualiación de sistemas capaces de construir estos grafos de conocimiento en escenarios dónde caracterísitcas como el volumen o la variedad de los datos es compleja.

En esta tesis se aborda el problema de la construcción de grafos de conocimiento la traducción entre relgas de mapeo declarativas. Las contribuciones que se presentan son:
\begin{itemize}
\item Un sistema de evaluación completo para herramientas de construcción de grafos de conocimientos.
\item El concepto de \textit{traducción entre languajes de mapeo} y sus propiedades.
\item Mejoras en el acceso a datos heterogéneos durante la construcción de grafos de conocimientos virtuales explotando la traducción entre lenguajes.
\item Optimizaciones para la construcción de grafos de conocimientos materializados en escenarios de integración de datos complejos haciendo traduciendo reglas entre diferentes especificaciones.
\end{itemize}

Las conclusiones finales de esta tesis reflejan que se ha abordado por primera vez la optimización de construcción de grafos de conocimiento a escala haciendo uso de la traducción de lenguajes de mapeo, un concepto novedoso en el estado del arte. Esto se ha acompañado de un sistema completo para la evaluación de estas herramientas. Finalmente, las líneas de trabajo futuro reflejan la necesidad de seguir investigando sobre nuevos métodos y técnicas que aseguren una amplia adopción de estas tecnologías a gran escala en la industria.



Expand Down
2 changes: 1 addition & 1 deletion 0_frontmatter/acknowledgement.tex
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

To the people from Ghent, especially to Pieter and Anastasia, my ``postdocs'' in the shadows. Thanks for giving me the possibility to work and collaborate with you and your teams, especially to Ben, Pieter and Julián. Hope that this is only a promising beginning. Lucie and Sam, I am extremely glad to meet you, first in Bertinoro, and then in Hannover. Lucie, you were my support during the good and bad moments of my internship in Germany. I am really happy to have a friend like you in my life. Sam, I am never going to forget our legendary week in Rodhes and all the awesome things that came after it. I really admire your passion and I enjoyed working with you; you will be a reference researcher for many people. Finally, to the rest of the SDM-TIB family: Enrique, Kemele, Ahmad, Ariam, Maria Isabel...

E por último, pero non menos importante, quero dar as grazas a miña xente de sempre, aos meus galegos. A meus pais, que foron unha continua axuda durante esta longa viaxe; deles quédome coa súa forza para afrontar os momentos delicados e coa paixón coa que confrontan o día a día. Mamá, grazas por todas as discusións e o teu punto de vista sobre que é un bo investigador; eres un referente para min. Papá, ogallá algún día poda desfrutar de cada momento coa mesma intensidade e ledicia coa que o fas ti. Martín, meu irmán, creo que aínda que en diferentes etapas da vida, durante estes anos os nosos camiños xuntáronse pouco a pouco, e o teu apoio incondicional sempre foi de gran axuda. Gracias aos meus amigos de sempre: Rafa, Marta, Clau, Paula, Erle, Ánxela, Carme, Olga e Minia... Pero tamén aos que atopei en Madrid: Santi, Patricia, Pamela, Carolina... Sempre conseguiron sacarme un sonriso e fixéronme ver que a academia e a universidad non son o único que existe. Nada sería igual de non ser por eles.
E por último, pero non menos importante, quero dar as grazas a miña xente de sempre, aos meus galegos. A meus pais, que foron unha continua axuda durante esta longa viaxe; deles quédome coa súa forza para afrontar os momentos delicados e coa paixón coa que confrontan o día a día. Mamá, grazas por todas as discusións e o teu punto de vista sobre que é un bo investigador; eres un referente para min. Papá, ogallá algún día poda desfrutar de cada momento coa mesma intensidade e ledicia coa que o fas ti. Martín, meu irmán, creo que aínda que en diferentes etapas da vida, durante estes anos os nosos camiños xuntáronse pouco a pouco, e o teu apoio incondicional sempre foi de gran axuda. Gracias aos meus amigos de sempre: Rafa, Marta, Clau, Paula, Erle, Ánxela, Carme, Polly, Olga e Minia... Pero tamén aos que atopei en Madrid: Santi, Patricia, Pamela, Carolina... Sempre conseguiron sacarme un sonriso e fixéronme ver que a academia e a universidad non son o único que existe. Nada sería igual de non ser por eles.

Las últimas líneas van dedicadas a toda esa gente que me ha ayudado, muchas veces sin saberlo, a conseguir esta meta. Las largas mañanas de experimentación con 180º y Virginia Díaz, las tardes-noches de tenis con Rafa, Iñaki y Andrés, entre otros; la última temporada escribiendo la tesis en el piso de Madrid y saliendo a ``rodar'' unos kilómetros con Ismael; gracias a todos los estudiantes que han participado en alguna edición del Open Summer of Code Spain. Gracias, Jorge, por tu hospitalidad y amistad durante mi estancia en Santiago de Chile.

Expand Down
4 changes: 2 additions & 2 deletions 2_stateoftheart/stateofart.tex
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ \subsubsection{Other approaches to Construct Knowledge Graphs}
\label{fig:soa_mapping_rules_story}
\end{figure}

\noindent\textbf{SPARQL-Generate.} The solution proposed in~\citep{lefranccois2017sparql} presents a template-based language to construct knowledge graphs extending SPARQL 1.1. It exploits the representation capabilities from SPARQL to declare the transformation rules inside the query. The approach of using SPARQL to define the transformation rules has been also proposed in previous approaches focused on specific data formats such as Tarql for CSV files\footnote{\url{https://tarql.github.io/}} or Triplify~\citep{auer2009triplify}. However, this kind of solution follows a procedural approach to define rules, instead of a declarative one. Therefore, and in comparison with the rest of proposals, they cannot exploit the benefits of declarative definition of rules such as the maintainability, reproducibility and understandability.
\noindent\textbf{SPARQL-Generate.} The solution proposed in~\citep{lefranccois2017sparql} presents a template-based language to construct knowledge graphs extending SPARQL 1.1. It exploits the representation capabilities from SPARQL to declare the transformation rules inside the query. The approach of using SPARQL to define the transformation rules has been also proposed in previous approaches focused on specific data formats such as Tarql for CSV files\footnote{\url{https://tarql.github.io/}} or Triplify~\citep{auer2009triplify}. %However, this kind of solution follows a procedural approach to define rules, instead of a declarative one. Therefore, and in comparison with the rest of proposals, they cannot exploit the benefits of declarative definition of rules such as the maintainability, reproducibility and understandability.

\noindent\textbf{ShExML.} In~\citep{garcia2020shexml}, the authors propose a solution based on the validation language for knowledge graphs, ShEx~\citep{prud2014shape}. Although it is based on this specification, it uses its own syntax and grammar. In comparison to R2RML-based proposals, ShExML separates the declaration, how to extract the data from the input sources, from the shapes, how to generate the desirable RDF graph. The main objective of this proposal is to help users to create the rules, and it also provides a translation engine to transform ShExML rules to RML mappings.

Expand Down Expand Up @@ -330,7 +330,7 @@ \subsection{Materialized Knowledge Graph Construction Engines}
\subsection{Virtual Knowledge Graph Construction Engines}
Most of the works proposed under this framework are focused on providing access to relational databases~\citep{priyatna2014formalisation,calvanese2017ontop,sequeda2013ultrawrap} and optimizations on the SPARQL-to-SQL translation process. The first approach on the translation between these two query languages is proposed in~\citep{chebotko2009semantics}, where the authors define an algorithm to create an equivalent SQL query based on an input SPARQL query. With the appearance of the R2RML recommendation, both Morph-RDB~\citep{priyatna2014formalisation} and Ontop~\citep{calvanese2017ontop} propose the use of these mapping rules and optimize the algorithm proposed in~\citep{chebotko2009semantics}. For example, there are recent studies and optimizations for an efficient translation of the \texttt{OPTIONAL} SPARQL operator in this process in~\citep{xiao2018efficient}. Nowadays, Morph-RDB and Ontop are the two most well-known open source engines that construct virtual knowledge graphs from relational databases. Ultrawrap~\citep{sequeda2013ultrawrap,sequeda2014obda} is another SPARQL-to-SQL engine that use SQL views to enhance the execution of the SQL queries after its translation from SPARQL. In~\citep{sequeda2014obda}, the authors propose a cost function that helps to decide when a SQL view should be or not physically materialized in the RDB.

In this context, the term constraint has been used in~\citep{hovland2016obda}, where the authors defined two new properties extending the concept of OBDA instance. They propose a set of optimizations during the SPARQL-to-SQL translation process with techniques that take into account these constraints. However, the main assumptions made over the OBDA framework (e.g, the data source is an RDB o has an RDB wrapper, or the schema contains a set of constraints) are maintained. There are other works such as~\citep{michel2015translation,botoeva2019ontology} that apply the OBDA framework over document-based databases. For example, in Morph-xR2RML~\citep{michel2015translation}, the authors formally define the translation from SPARQL to NoSQL databases.
In this context, the term constraint has been used in~\citep{hovland2016obda}, where the authors defined two new properties extending the concept of OBDA instance. They propose a set of optimizations during the SPARQL-to-SQL translation process with techniques that take into account these constraints. However, the main assumptions made over the OBDA framework (e.g, the data source is an RDB o has an RDB wrapper, or the schema contains a set of constraints) are maintained. There are other works such as~\citep{michel2016generic,botoeva2019ontology} that apply the OBDA framework over document-based databases. For example, in Morph-xR2RML~\citep{michel2016mapping}, the authors formally define the translation from SPARQL to NoSQL databases.

There are two main proposals of the construction of virtual knowledge graphs beyond relational databases: Ontario and Squerall. Ontario~\citep{endris2019ontario} is based on the concept of RDF molecule templates~\citep{endris2017mulder} which aims to perform efficient source selection in a data lake composed of heterogeneous data sources in their original format. It creates a set of star-shaped sub-queries that match the RDF Molecule Templates (RDF-MT), and applies optimization techniques to define the query plan that will be executed. Similarly, Squerall~\citep{mami2019querying} takes input data and mappings, and offers a middleware that is able to aggregate the intermediate results in a distributed manner. Finally, Polyweb~\citep{khan2019one} is another proposal that is able to translate and distribute queries using RML mappings over relational databases and CSV files.

Expand Down
6 changes: 3 additions & 3 deletions 4_mappingtranslation/translation.tex
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,9 @@ \subsection{From declarative mappings to programmed adapters}
We proposed the use of the mapping translation concept to facilitate the generation of GraphQL resolvers. We propose specifying mappings in R2RML, which is a well-defined and formalized mapping language, and apply a mapping translation technique to generate automatically the corresponding GraphQL schemes and resolvers in different programming languages. Our intuition is that following this approach, GraphQL resolver will be easier to maintain, as they are declarative and independent from any programming language. The details of this approach are presented in Section \ref{chap6_morphgraphql}.

\subsection{Providing access to semi-structured data}
Semi-structured data formats are one of the most widely used formats to publish data on the Web. Although existing mapping languages provide support for this type of data sources, existing engines are mostly focused on the generation (materialization) of RDF-based knowledge graphs, with only a few proposals (e.g. xR2RML~\citep{michel2015translation}) focused on the application of query-translation techniques (virtualization) over such types of data sources.
Semi-structured data formats (e.g., JSON, XML, CSV) are one of the most widely used formats to publish data on the Web. Although existing mapping languages provide support for this type of data sources, existing engines are mostly focused on the generation (materialization) of RDF-based knowledge graphs, with only a few proposals (e.g. xR2RML~\citep{michel2016generic}) focused on the application of query-translation techniques (virtualization) over such types of data sources.

In the specific case of spreadsheets (CSV), providing access to this format is difficult for two main reasons: (i) CSV does not provide its own query language, (ii) there are some transformations that are commonly needed when treating data available in CSVs. For solving the first issue, query translation techniques have been applied over such data format by considering a CSV file as a single table that can be loaded in an RDB. For the second issue, some extensions of well-known mapping languages (RML together with the Function Ontology~\citep{de2017declarative}) and annotations following the CSVW specification~\citep{tennison2015model} can be used.
In the specific case of spreadsheets (CSV), providing access to this format is difficult for two main reasons: (i) CSV does not provide its own query language, (ii) there are some transformations that are commonly needed when treating data available in CSVs. For solving the first issue, query translation techniques have been applied over such data format by considering a CSV file as a single table that can be loaded in an RDB. For the second issue, some extensions of well-known mapping languages (RML together with the Function Ontology~\citep{de2017declarative}) and annotations following the CSVW specification~\citep{tennison2015model} can be used to enforced implicit constraints over the input data sources.

Morph-CSV applies the concept of mapping translation for enhancing OBDA query translation over CSV files from SPARQL. It exploits the information of CSVW annotations and RML+FnO mappings to create an enriched RDB representation of the CSV files together with the corresponding R2RML mappings, allowing the use of existing query translation (SPARQL-to-SQL) techniques implemented in R2RML-compliant OBDA engines. This contribution is detailed in Section \ref{chap6_morphgcsv}.

Expand All @@ -116,7 +116,7 @@ \subsection{Understanding the semantics of mapping languages}
\section{Use Case: Virtual Knowledge Graph Construction in the Statistics Domain}
\label{sec:chap4_rmlc}

Statistics data is one of the most common ways of sharing public information nowadays. PDF, HTML and, especially, CSV format, are some of the most used formats of tabular data being published on the web by statistics agencies. Whereas this is still the main trend, many agencies worldwide are embracing semantic technologies for publishing their resources as Statistics Knowledge Graphs (SKG)\footnote{\url{https://semstats.org}}. In many cases both formats co-exist, allowing to access the information in different ways.
Statistics data is one of the most common ways of sharing information of public domains though open data portals. PDF, HTML and, especially, CSV format, are some of the most used formats of tabular data being published on the web by statistics agencies. Whereas this is still the main trend, many agencies worldwide are embracing semantic technologies for publishing their resources as Statistics Knowledge Graphs (SKG)\footnote{\url{https://semstats.org}}. In many cases both formats co-exist, allowing to access the information in different ways.

Due to the high volume and variability of data, the transformation from tabular to SKG-oriented formats requires a process that is standard and maintainable. We identify two main approaches for transforming tabular data to Statistics Knowledge Graph (SKG). The first approach is an ad-hoc approach, such as one that has been reported in~\citep{corcho2017publishing}, in which CSV data is converted into RDF Data Cube~\citep{cyganiak2012rdf} using a set of custom rules. The second approach is defined on the basis of mapping languages, such as the RDB2RDF W3C Recommendation, R2RML, in which transformations are codified in a standard language, with several tools available for applying them.

Expand Down
Loading

0 comments on commit c192d87

Please sign in to comment.