background.tex

%!TEX root = main.tex

\section{Background and Related Work}\label{sec:background}

% Data quality in general
The problem of ``vocabulary quality'' is closely related to the more general one of ``data quality'' and has been discussed in data and information systems research (cf.~\cite{Batini2009}). Pipino et al.~\cite{Pipino2002} argue that dealing with data quality should involve both ``subjective perceptions of the individuals'' and ``objective measurements based on the data''. We see our work as a contribution to the latter.

The SKOS specification does not mention the notion of quality, but defines in total six integrity conditions~\cite{SkosReference2008}, each of which is a statement that defines under which circumstances data are consistent with the SKOS data model. For example, ``a resource has no more than one value of \texttt{skos:prefLabel} per language tag''. Tools that can check whether these conditions are met are already available: two of the six conditions are defined formally in the OWL representation of SKOS and can therefore be validated by any OWL reasoner. For validating a SKOS vocabulary against the other integrity conditions, one can use tools such as the PoolParty Thesaurus Consistency Checker\footnote{\url{http://demo.semantic-web.at:8080/SkosServices/check}}, or the Skosify\footnote{\url{http://code.google.com/p/skosify/}} validator, which can also correct some detected quality problems.

% MOVE TO RELATED WORK: Gueret et al. [\todo{CM}{find the right reference in google scholar}], for instance, calculate network measures to support human judgement on the quality of Linked Data \todo{CM}{?} graphs.


%In the course of the LATC project, Gueret et al. propose an infrastructure that utilizes crowd-sourcing technology to semi-automatically create links between datasets of the LOD cloud. To ensure the quality of the resulting network, human judgement is supported by calculating a report of locally approximated network measures.

% SKOS Integrity Conditions (http://www.w3.org/TR/skos-reference/):
% S9    skos:ConceptScheme is disjoint with skos:Concept.
% S13   skos:prefLabel, skos:altLabel and skos:hiddenLabel are pairwise disjoint properties. (* NOT expressed formally *)
% S14   A resource has no more than one value of skos:prefLabel per language tag. (* NOT expressed formally *)
% S27   skos:related is disjoint with the property skos:broaderTransitive. (* NOT expressed formally *)
% S37   skos:Collection is disjoint with each of skos:Concept and skos:ConceptScheme.
% S46   skos:exactMatch is disjoint with each of the properties skos:broadMatch and skos:relatedMatch. (* NOT expressed formally *)
% -> 2 of 6 are expressed formally in OWL; the others cannot be expressed because OWL doesn't support disjoint properties

% Poolparty quality checker implementation:
% S13, S14, S27, S46 plus the formally defined S9, S37
% syntactic checks: URI, character encoding, whitespaces
% missing language tags: in SKOS labels and textual content)
% missing labels: prefLabels for skos:Concepts and rdfs:labels for skos:ConceptSchemes)
% loose concepts: concepts that are no top concepts and have no broaders

% SKOSify
% S13, S14, S27

% Mapping qSKOS Integrity Conditions <-> qSKOS quality criteria
% S13 + S14 combined in "Ambigious labels"
% S27 + S46 combined in "Associative vs. Hierarchical Relation Clashes" and "Exact vs. Associative and Hierarchical Mapping Clashes"


Typical applications of controlled vocabularies are classification, indexing, auto-completion, query reformulation, or serving as a glossary. As we discussed in detail in earlier work~\cite{Nagy2011}, these areas impose specific requirements on vocabulary features, such as structure, availability, and documentation. Quality aspects of controlled vocabularies have already been discussed in standardized guidelines~\cite{ISO25964-1:2011,Z39.19:2005}, manuals~\cite{Aitchison2000,Harpring2010,Hedden2010,Svenonius2003}, tutorials~\cite{Soergel2002}, and scholarly articles \cite{Coronado2009,Kless2010}. These most often rely on manual, precise analysis of individual statements in the data, as in~\cite{spero2008}. Our work builds on this literature, but focuses on the less intellectually loaded checks, which can be automatized to assist vocabulary users or publishers. 

% Quality in SW / Linked Data Research

Data quality is also being discussed in Semantic Web and Linked Data research. Hogan et. al~\cite{Hogan2010} identify four categories of common errors and shortcomings in RDF documents and Heath and Bizer~\cite{Heath2011} summarize best practices for publishing data on the Web. Ontology evaluation, i.e., measuring the quality of an ontology, has also been discussed extensively~\cite{Vrandecic2010}. However, the authors focus on RDF datasets and ontologies in general. While we could use some criteria suggested here, such as consistent tagging of literals, these need to be completed by considering SKOS-specific properties.

One issue when assessing the quality of SKOS data is the so-called ``Open World Assumption'', which underlies the Web of Data itself. Established quality notions from closed-world systems, such as referential integrity or schema validation, do not hold anymore, because available information may be incomplete and non-explicitly stated facts cannot be determined as true or false. Work-arounds, often ad-hoc, are thus currently used to evaluate quality in Linked Data sets~\cite{Heath2011}, as is done in the (rule-based) SKOS tools mentioned above, or the Pellet ICV\footnote{\url{http://clarkparsia.com/pellet/icv/}}, which re-interprets OWL axioms with integrity constraint semantics.