KR Conference 2014 Conference Paper
The central problem in the database field is of course query answering, and in the presence of incompleteness, one looks for certain answers: those that do not depend on the interpretation of unknown data. The concept was first mentioned in (Grant 1977) and formally defined 35 years ago in (Lipski 1979) as follows. Assume that the semantics [[D]] of an incomplete database D is given as the set of all complete databases D0 which D can represent. Then the certain answer to Q on D is defined as cert∩ (Q, D) = T {Q(D0) | D0 ∈ [[D]]}. Since database queries produce collections (sets, multisets, etc.), it makes sense to talk about their intersection. This definition has been universally applied to all the semantics of incompleteness, and in all the scenarios such as those listed above. The intuition is that this gives us the set of tuples independent of the interpretation of the missing information in D. Such a universal adoption of this basic definition has another consequence: certain answers themselves contain no missing information. In fact many algorithms for computing certain answers have, as the last step, elimination of any objects (say, rows in relational databases) with missing data. The question that we address here is the following: Is this standard “one-size-fits-all” definition really the right one to use for all the semantics, and all the applications? The answer, as we shall argue, is negative: the standard intersection semantics, as well as the assumption that no missing information is present in the answers, leads to many problems, and crucially to producing meaningless query answers. To argue that this is the case, and to explain some basic ideas behind the alternative approach we present here, note that in the database field, one tends to operate with objects (i. e., relations, XML documents, graph databases, etc.); in particular, queries take objects and return objects. Thus, the idea behind certain answers is to find an object A representing the set of objects Q([[D]]) = {Q(D0) | D0 ∈ [[D]]}. Such an object A must contain information common to all the objects in Q([[D]]): that is, it must be no more informative than any of the objects in Q([[D]]). Now take a simple example: we have a relation R = {(1, 2), (3, ⊥)} in a database, where ⊥ represents a null, or a missing value. The query Q returns R itself. Then cert∩ (Q, D) = {(1, 2)} under every reasonable semantics of incompleteness. But is it less informative than all of Q(D0) for D0 ∈ [[D]]? The answer depends on the semantics. Under The standard way of answering queries over incomplete databases is to compute certain answers, defined as the intersection of query answers on all complete databases that the incomplete database represents. But is this universally accepted definition correct? We argue that this “one-size-fits-all” definition can often lead to counterintuitive or just plain wrong results, and propose an alternative framework for defining certain answers. The idea of the framework is to move away from the standard, in the database literature, assumption that query results be given in the form of a database object, and to allow instead two alternative representations of answers: as objects defining all other answers, or as knowledge we can deduce with certainty about all such answers. We show that the latter is often easier to achieve than the former, that in general certain answers need not be defined as intersection, and may well contain missing information in them. We also show that with a proper choice of semantics, we can often reduce computing certain answers – as either objects or knowledge – to standard query evaluation. We describe the framework in the most general way, applicable to a variety of data models, and test it on three concrete relational semantics of incompleteness: open, closed, and weak closed world.