Summarizing my PhD research

Oddly enough, I don't think I ever mentioned my PhD research on this blog. Part of the reason is that I easily get fascinated by technical questions, but I find it less natural to explain what I do in non-technical terms. However, I have to write the introduction to my PhD thesis, so I thought it was a nice excuse to start by summarizing here what I am doing.

For readers who don't know, I'm doing my PhD in Télécom ParisTech, in the DBWeb team. Before this, I stayed for some months in Tel Aviv to work with Tova Milo, and in Oxford to work with Michael Benedikt, which explains some of my current collaborations.

The general context: Uncertain data

The great success story of database research is the relational model. It is a rare example of something that is both extremely relevant in practice¹ and nice from a theoretical perspective, with close connections to logics and finite model theory. The relational model gives an abstract model of how data should be represented and queried, which allows people to use databases as a black box, leaving fine but domain-agnostic optimisations to database engines. Relational databases, like compilers, filesystems, or virtual memory, have become an abstraction that everyone relies on.

One thing that the relational model does not handle is uncertainty and incompleteness in the data. In the relational semantics, the database is perfectly accurate, and it contains all facts that may be of interest. In practice, however, this is rarely satisfactory:

Some values may be missing: for instance, the user did not fill in some fields when registering, or some fields were added since they registered.
The data may be incomplete: for instance, you are constructing your database by harvesting data from the Web, and you haven't harvested everything yet; but you still wonder whether the data that you have allows you to give a definitive answer to your query.
Data may be outdated: for instance, the user moved but they did not update their address.
Data may simply be wrong.

Indeed, nowadays, people manage data that may have been inferred by machine learning algorithms, generated by data mining techniques, or automatically extracted from natural language text. Unlike a hand-curated database, you can't just assume mistakes will not happen, or that you will fix all of them when you find them. The data is incomplete and contains errors, and you have to deal with that.

These important problems are already being addressed, but in unsatisfactory ways. The general domain-agnostic solutions are fairly naive: e.g., set a confidence threshold, discard everything below the threshold, consider everything above the threshold as certain. In domains where uncertainty is paramount and where simple solutions do not suffice, e.g., in character recognition (OCR) or speech recognition (ASR), people use more elaborate solutions, e.g., weighted automata. The problem is that these solutions are often domain-specific, i.e., they are tailored to the domain at hand. The vision of uncertain data management is that there should be principled and generic tools to manage such noisy and incomplete data in all domains where it occurs; in the same way that relational databases today are both highly optimized and domain-agnostic tools.

For now, the most visible aspect of uncertainty management in concrete databases are nulls, which are part of the SQL standard. NULL is a special value that indicates that a record does not have a value for a specific attribute: for instance, the attribute did not exist when the record was created. Of course, there are many problems that NULLs do not address:

NULL can only represent missing values in a record, not missing records ("There may be more entries that I don't know about"), or uncertainty about whether the record exists altogether ("I think I know this, but it may be wrong.")
NULL just means that we know nothing about the value, it doesn't allow us to say that there are multiple possible values, or to model them using a probability distribution, or that two unknown values must be the same.
The semantics of NULLs (as defined by the SQL standard) are very counter-intuitive. For instance, trying to select all values which are either equal to 42 or different to 42 will not return the values marked NULL. This is because NULLs are understood as a three-valued logic, but it still does not make sense in terms of the possible worlds : NULL represents an unknown value, and no matter the value, it must be equal to 42 or different from 42...

The semantics of NULLs for SQL databases are now set in stone by the standard, even though people have complained about their imperfections as early as 1977², and even though they still confuse people today. There is still much debate about what the semantics of NULLs could have been instead, see, e.g., the keynote by Libkin, which contains more counter-intuitive examples with NULLs, a review of other theoretical models of NULLs to address some of these problems, as well as proposed fixes.

Research is also studying uncertainty management approaches that have nothing to do with NULLs. For instance, incompleteness can be managed by designing a different semantics to evaluate queries on databases, which is called the open-world assumption, and posits that all missing data is unknown rather than being wrong. To represent uncertainty about the existence of tuples, one can use, e.g., tuple-independent databases: they are standard relational databases, but with records carrying an independent "probability of existence". This formalism cannot represent correlations between tuples (if this tuple is here, then this tuple must be here, or cannot be here), and the result of queries on such databases may not be representable in the same formalism, but more expressive formalisms exist, such as pc-tables. There are also formalisms that can represent uncertainty on other structures, such as XML documents³.

The goal would be to develop a system where relations can be marked as incomplete, and data can simply be marked as uncertain, or inserted directly with probabilities or confidence scores; the system would maintain these values during query evaluation, to be able to tell how certain its answers are. For now, though, real-world uncertain database systems are quite experimental. There are many reasons for that:

It's hard to define principled semantics on uncertain data.
It's hard to find ways to represent uncertainty about any kind of data.
Somewhat surprisingly, probabilistic calculations on data are often computationally expensive, much harder than without probabilities!

This is why we are still trying to understand the basic principles about the right way to represent uncertainty and query it in practice, which is the general area of my PhD. The main way in which it differs from existing work is that I focused on restricting the structure of uncertain data, so that it can be queried in a decidable and tractable fashion. More specifically, my PhD research has studied three main directions:

Connecting approaches for incomplete data reasoning that originated in various communities, to query incomplete data whose structure is constrained by expressive but decidable rule languages.
Representing uncertainty on new structures, namely, on order relations, which had not been investigated before.
Achieving computational efficiency for query evaluation, when we add qualitative uncertainty in the form of probabilistic distributions, by restricting the structure of data, e.g., imposing bounded treewidth.

Connecting different approaches to reason on incomplete data

The first direction deals with incompleteness in databases, and reasoning about an incomplete database in an open-world setting: What is possible? What is certain? The challenge is that this problem has been studied from various communities. On the one hand, database researchers have studied how to reason about databases. On the other hand, people who study reasoning and artificial intelligence are interested in drawing inference from data, and have tried to make their approach work on large data collections.

An important such community is that of description logics (DLs). The goal of description logics is to do reasoning, with the following goals:

Separate the data and the rules used to reason about the data.
Remain computationally tractable when the data is large, as long as the rules remain manageable.
Study precisely how the computational cost varies depending on which kind of reasoning operators⁴ you allow in the rules.

The data managed in the DL context differs from relational databases, however. The semantics is not exactly the same, but, most importantly, the data representation is different:

DLs represent data as a graph, where relations connect at most two objects. For instance, "John lives in Paris."
Relational databases, on the other hand, are a tabular formalism, where each row has a fixed schema that relates multiple objects at once. For instance, "John was born in Paris in 1942".

A different formalism, that works with relational databases, is that of existential rules (or tuple-generating dependencies in database parlance). Such rules allow you to deduce more facts from existing facts: "If someone was born in France and they have a parent born in France, then they are French." However, they cannot express other constraints such as negation: "It is not possible that someone is their own child." Further, existential rules can assert the existence of new objects: "If someone wrote a literary prize, then they wrote some book."

For people from databases, DLs, and existential rules, the goal is to design languages to reason about incomplete data, imposing the following properties:

principled, though the languages may make different semantic assumptions;
expressive, so they can accurately model the rules of the real world;
decidable, so that they are not so expressive that it would be fundamentally impossible to reason about them;
efficient, so they machines can reason with them with reasonable performance.

Description logics and existential rules

A first part of my PhD research is about bridging the gap between these communities. The goal is to design languages to reason about incomplete data that combine rules from these various communities. My work with Michael Benedikt (Oxford), Combining Existential Rules and Description Logics, connects the two formalisms I presented.

Say that you have a database, which you know is incomplete, and you have rules to reason about the implicit consequences of the data. Our work studies hybrid languages where you can express both existential rules and DL constraints. More precisely, we determine when such languages are decidable, and when they are not, i.e., it is fundamentally impossible to reason about them. The goal is to have the "best of both worlds": on parts of the data that can be represented as a graph, we would use the expressive DL rules; on parts of the data which are more naturally described as a relational database, we would have less expressive constraints inspired by existential rules.

In our work, we pinpointed which features of these two languages are dangerous and give an undecidable language if you mix them. The big problematic feature of DLs are functionality assertions: being able to say that, e.g., a person has only one place of birth. If we want to express such constraints, we must do away with two problematic features of existential rules. The first is exporting two variables: "If someone was born in some country, then that person has lived in that country." What we deduce should only depend on one element from the hypothesis: "If someone won a literary prize, then they wrote some book." The second is a restriction on the facts which the rules can deduce, which shouldn't form complex cyclic patterns in a certain technical sense.

Apart from these problematic features, however, we show that we can combine existential rules and expressive DL rules. This gives us hope: it could be possible to reconcile the two approaches and design very expressive languages that capture both kinds of constraints.

Open-world query answering and finiteness

I have explained how we connected two non-database approaches to reason about incomplete data. I also worked with Michael on reasoning about relational databases, in our work Finite Open-World Query Answering with Number Restrictions.

The general context is the same: we have incomplete data and we want to reason about its consequences, using logical rules. This time, however, our rule language is just defined using classical integrity constraints from relational database theory. Indeed, people have designed many ways to constrain the structure of databases to detect errors in the data: "Any customer in the Orders table must also appear in the Customers table". We use these as rules to reason about the complete state of the data, even though our database is incomplete and may violate them.

The language of constraints that we use includes functional dependencies, which are like⁵ the functionality assertions in DLs that I already mentioned. We also study unary inclusion dependencies, which are existential rules⁶ but of a very restricted kind.

The main problem comes from the semantics. In database theory, people usually assume databases to be finite. In particular, when using the rules to derive consequences, the set of consequences should be finite as well. This is justified by common sense ("the world is finite") but this problem is usually neglected in works about DLs and existential rules. So we studied finite reasoning in this database context, for the open-world query answering task of reasoning with our rules and incomplete information.

It is not hard to see that assuming finiteness makes a difference in some cases. Consider the following information about an organization:

Jane advises John (the database)
Each advisee is also the advisor of someone (an inclusion dependency)
Each advisee has a single advisor (a functional dependency)

Is it true, then, that someone advises Jane? The rule allows us to deduce that John, as he is advised by Jane, advises someone else, say Janice; Janice herself advises someone else, say Jack. In general, this could go on indefinitely, and we cannot deduce that someone advises Jane. However, if we also assume that the organization is finite, then it has to stop somewhere: someone along this chain (Jennifer, say) must advise someone that we already know about. And by the rule that no one is advised by two different people, we deduce that Jennifer must be advising Jane.

As we show in our work, the only difference that finiteness makes is that it causes more rules of the same kind to be added⁷. Once this is done, it turns out that we can essentially forget about the finiteness hypothesis, because we can no longer distinguish between finite and infinite consequences. This is surprisingly difficult to show, though; and to establish this correspondence, we need much more complex tools that those used in the infinite case to actually do the reasoning!

Representing uncertainty on ordered data

The second part of my PhD research deals with the representation of uncertainty on new kinds of data, namely, ordered data. Order can appear at different levels: on values ("retrieve all pairs of people when one is born before the other") or on records ("retrieve all page views that occurred yesterday"). In SQL databases, order is supported with the standard comparison operators on numbers ("WHERE length > 4200"), with sorting operators ("ORDER BY date"), and with the LIMIT operator to select the first results.

Order on facts

Why is it important to have uncertain ordered data? Well, an interesting phenomenon is that combining certain ordered data may cause uncertainty to appear. Consider a travel website where you can search for hotels and which ranks them by quality. You are a party of four, and you want either two twin rooms, or one room with four beds; but the website doesn't allow you to search for both possibilities at once⁸. So you do one search, and then the other, and you get two ordered lists of hotels. In real life you would then browse through both lists, but this is inefficient! What you would really want, in database parlance, is the union of these two lists, i.e., the list of hotels that have either two twin rooms or one 4-bed room. But how should this list be ordered? This depends on the website's notion of quality, which you don't know. Hence, the order on the union is uncertain. It is not fully unspecified, though: if two hotels occur only in one list, then their order relation should probably be reflected in the union. How can you formalize this?

To solve this kind of problems, people have studied rank aggregation: techniques to reconcile ordered lists of results. However, these methods are quantitative, i.e., if something appears near the top in most lists then it should appear near the top in the result. What we set out instead is to represent what we know for sure, i.e., the set of all possible consistent ways to reconcile the order, without trying to make a choice between them.

Couldn't we just use existing uncertainty representation frameworks? If you try, you will realize that they do not work well for ordered data. You don't want tuple-level uncertainty, because you know what the results are. You could say that the position of each result is uncertain: "this result is either the first or the third". Yet, when you know that hotel a must come before hotel b, you must introduce some complicated dependency between the possible ranks of a and b, and this gets very tedious.

With M. Lamine Ba (former fellow PhD student in Télécom, now in Qatar), Daniel Deutch (from Israel) and my advisor Pierre Senellart, we have studied this question in our work, Possible and Certain Answers for Queries over Order-Incomplete Data. We use the mathematical notion of a partial order as a way to represent uncertain ordered data. More precisely, we explain how we can apply to partial orders the usual operators of the relational algebra (select, project, union, product). We extend this to accumulation, to answer queries such as "is it certain that all hotels in this district are better than all hotels in that district?"

This poses several interesting challenges. The most technically interesting one is that of efficiency: how hard it is to compute the result of such queries? Somewhat surprisingly, we show that it is intractable (NP-hard). In fact, it is already hard to determine, given such a partially ordered database, whether some totally ordered database is possible⁹. In other words, it is intractable already to determine whether some list is a consistent way to reorder the data! To mitigate this, we study "simplicity measures" of the original ordered relations (are they totally ordered, totally unordered, or not too ordered, not too unordered?), and we see when queries become tractable in this case, intuitively because the result of integrating the relations must itself be simple.

Order on values

I also work on uncertainty on orders with Tova Milo and Yael Amsterdamer (from Tel Aviv). This time, however, the partial order itself is known, and we want to represent uncertainty on numerical values that satisfy the order.

Our work, Top-k Queries on Unknown Values under Order Constraints, is inspired by crowdsourcing scenarios. When working on data that you collect from the crowd, every query that you make introduces latency (you need to wait for the answers to arrive) and monetary costs (you have to pay the workers). Hence, you cannot simply acquire data about everything that you are interested in: instead, you must extrapolate from what you can afford. Following our earlier works on similar questions, we are interested about classifying an items in a taxonomy of products. In this example application, the order is induced by the monotonicity property: shirts are more specific than clothing, so if we classify an item as being a shirt, it must also be a piece of clothing.

The underlying technical problem is to interpolate missing values under total order constraints, which I think it is an interesting question in its own right. We study a principled way to define this formally, based on geometry in polytopes, and study again the computational complexity of this task. Why is this computationally intractable in general, and what are circumstances where it can reasonably be done? For instance, we can tractably perform the task when the taxonomy of products is a tree.

Reasoning about order

More recently, I also started working with Michael Benedikt on open-world query answering tasks that involve order relations. Again, this is useful in practice, because order often shows up in queries and rules; however it is challenging because it is hard to talk about orders in rules. Essentially, to write a rule that says that something is an order, you have to express transitivity: if a < b and b < c then a < c. Unfortunately, such rules are prohibited by many existing rule languages, because they could make the rules too expressive to be decidable. Hence, we study in which cases the specific axioms of order relations can be supported, without causing undecidability. We are still working on it and the results are not ready yet.

Tractability of probabilistic query evaluation

The last direction of my PhD research, and indeed the only one I really started during the PhD, is about the tractability of query evaluation on probabilistic database formalisms.

In fact, apart from the work with Tova and Yael and Pierre that deals with numerical values, everything that I described so far was about uncertainty in a logical sense, not in a quantitative sense. Things are either possible or impossible, either certain or not, but there is no numerical value that quantifies how likely something is over something else.

There are multiple reasons to focus on logical uncertainty. The first reason is that defining a set of possibilities is easier than additionally defining a probability distribution on them. It is hard to define meaningful and principled probabilities over the space of all possible consequences of an incomplete database, or over the linear extensions of a partial order.¹⁰ This is why people usually study probability distributions on comparatively simple models, e.g., tuple-independent databases (TID). With TID, the set of possible worlds has a clear structure (i.e., all subsets of the database) and a simple probability density: the probability of a subset is the probability of keeping exactly these facts and dropping the others, assuming independence.

The second difficulty is that, even on the TID model, computing probabilities is computationally intractable. Imagine that a dating website has a table indicating who sent a message to whom, in addition to a table indicating in which city users live. We pose a simple query asking for pairs of users that live in the same city, such that one messaged the other. On a standard relational database, this is easy to write in SQL, and can be efficiently evaluated. Now, imagine that we give a probability to each fact. Say, e.g., we are not sure about the possible cities a user may be in, and we only look at messages expressive a positive sentiment: in both cases we use machine learning to obtain probability values for each fact. Now, we want to find the probability that two people live in the same city and one sent the other a positive message.

As it turns out, this is much harder than plain query evaluation!¹¹ The problem is that correlations between facts can make the probability very intricate to compute: intuitively, we cannot hope to do much better than just enumerating the exponential number of possible subsets of the data and count over them. An important dichotomy result by Dalvi and Suciu classifies the queries (such as this example) for which this task is intractable, and the ones for which it is tractable.

With my advisor and Pierre Bourhis from Lille (whom I met at Oxford), we wondered whether everything was lost for hard queries. We had hope, because the result by Dalvi and Suciu allows arbitrary input databases, which is quite pessimistic: real data is usually simpler than the worst possible case. We thought of this in connection with a well-known result by Courcelle: some queries which are generally hard to evaluate become tractable when the data has bounded treewidth, intuitively meaning that it is close to a tree. In the probabilistic XML context, it had already been noted¹² that probabilistic evaluation is tractable on trees, e.g., on probabilistic XML data, even for very expressive queries. We thought that this could generalize to bounded treewidth data.

We show such a result in our work Provenance Circuits for Trees and Treelike Instances, which studies the more general question of computing provenance information. The field of provenance studies how to represent the dependency of query results on the initial data; it relates to probability evaluation because determining the probability of a query amounts to determining the probability of its provenance information, or lineage. Intuitively, the lineage of my example query would say "There is a match if: Jean messaged Jamie and Jean lives in Paris and Jamie lives in Paris, or..." To determine the probability that the query has a match, it suffices to evaluate the probability of this statement (which may still be intractable).

We rephrase probabilistic query evaluation in terms of provenance because there is a very neat presentation of provenance in terms as semiring annotations manipulated through relational algebra operators. Our work shows that how an analogous notion can be defined in the context of trees and automata, and can be "transported" through Courcelle's correspondence between bounded treewidth graphs and trees. As it turns out, the lineages that we can construct are circuits¹³ and themselves have bounded treewidth, so we incidentally obtain that probability evaluation is tractable on bounded treewidth instances. We extended this to show the tractability of many existing database formalisms, and the existence of tractable lineage representations.

In a more recent work, Tractable Lineages on Treelike Instances: Limits and Extensions, we show that bounded-treewidth tractability essentially cannot be improved: there are queries which are hard to evaluate for any restriction on input instances which does not imply that the treewidth is bounded (or that the instances are non-constructible in a strong sense). This lead us to investigate questions about the extraction of topological minors, using a recent breakthrough in this area. We developed a similar dichotomy result for other tasks, including a meta-dichotomy of the queries that are intractable in a certain sense¹⁴ on all unbounded-treewidth constructible instance families, a result reminiscent of the dichotomy of queries by Dalvi and Suciu, but with a very different landscape.

The hope would be to develop query evaluation methods on probabilistic data which use the structure of both the instance and query to achieve tractability, or to make well-principled approximations. We hope that such techniques could lead to realistic ways to evaluate queries on probabilistic databases.

Other things

This post only presents my past or current work that fits well with the general story that I plan to tell about my PhD research. I'm sorry to my past and current collaborators who are not featured in the list. In particular, some of my other collaborators are:

Luis Galárraga and Fabian Suchanek at Télécom;
Benoît Groz from LRI in Paris and Sophie Tison from Lille;
David Naccache and other fine people from ENS and beyond;
Tang Ruiming and Stéphane Bressan at National University of Singapore.

It is true that NoSQL databases have become popular as well, but it says something about the ubiquity of SQL that we refer to other databases as being "not SQL". Besides, the new trend of NewSQL databases further suggests that the relational model is here to stay. To give a practical example, Android and Firefox contain SQL database implementations, e.g., with sqlite. ↩
An example from 1977 is John Grant, Null values in a relational data base, which complains about the semantics. (Sadly, this paper seems paywalled, but I just give it for the reference and the date, not for its specific contents.) ↩
If you want to know more about uncertainty management frameworks on relational databases, a good reference is the Probabilistic Databases book by Suciu, Olteanu, Ré and Koch (but it doesn't seem available online). For probabilistic models, a good reference is this article by Benny Kimelfeld and my advisor Pierre Senellart. If you're curious about the connections between the two, I wrote an article with Pierre about precisely that. ↩
An example of an expensive feature is disjunction: if you can infer from the data that something or something else is true, it becomes harder to reason about the possible consequences of your data. ↩
Functional dependencies are slightly more complicated than DLs functionality assertions. This is because databases are tabular, so we can write functional dependencies like "the pair (first name, last name) determines the customer ID in the Customers table". This means that the Customers table cannot have two rows with the same first and last name but different customer IDs. Such rules do not make sense with DLs, because the "rows" have at most two columns. ↩
In particular, they cannot export two variables, because as I pointed out, this causes undecidability with functional dependencies. The other restrictions are that they can only deduce a single fact (so they respect the second condition about non-cyclic heads), they only use a single fact as hypothesis, and they do not any variable appear at multiple positions of an atom. ↩
In the example, we would deduce that each advisor is an advisee, and also that no advisor advises two different people. ↩
Yes, I speak from experience, and I have a specific website in mind, but I won't do them the favor of offering them free advertising as a reward for bad usability. ↩
Formally, we want to determine whether a labeled partial order has a compatible total order whose label sequence falls in a certain language. This interesting direction was not explored by standard poset theory, as far as we know, probably because it is too closely related to computational questions.
Intuitively, hardness is caused by the ambiguity introduced by duplicate values. In other words, if two different hotels have the same name in the input lists, it is hard to determine how they should be paired with the elements of the candidate combined list that we are checking. ↩
These are hard questions, but it doesn't mean that they wouldn't be interesting to study — quite the opposite in fact. :) For probabilities on linear extensions, see internship offer 2. For probabilities on infinite instance collections, as it turns out, we seriously tried to define such things via recursive Markov chains before we got sidetracked by simpler questions that were also unsolved. ↩
Formally, evaluating this query is #P-complete in the input database, whereas it is in AC⁰ if the input database is not probabilistic. ↩
Cohen, Kimelfeld, and Sagiv, Running Tree Automata on Probabilistic XML. ↩
Circuit representations of provenance are a recent idea by Daniel Deutch, Tova Milo, Sudeepa Roy, and Val Tannen. The fact that we use their work, however, is not connected to my collaborations with Daniel or Tova. It's a small world. ↩
Formally, they do not have polynomial-width OBDD representations of their provenance, which is a common reason for queries to be tractable; they may still be tractable for other reasons, though we do not show it. ↩