Oddly enough, I don't think I ever mentioned my PhD research on this blog.
Part of the reason is that I easily get fascinated by technical
questions, but I find it less natural to explain what I do in non-technical
terms.
However, I have to write the introduction to my PhD thesis,
so I thought it was a nice excuse to start by summarizing here what I am doing.

For readers
who don't know, I'm doing my PhD in Télécom
ParisTech, in the DBWeb
team. Before this, I stayed for some months in Tel Aviv
to work with Tova Milo, and in Oxford to work
with Michael Benedikt, which
explains some of my current collaborations.

**Shameless ad:** If you are
currently pursuing a computer science Master and would like to work at Télécom
ParisTech for your Master's internship on topics like those presented here,
please don't hesitate to apply to our offers for 2015—2016!
Offer 1,
Offer 2,
Offer 3.

## The general context: Uncertain data

The great success story of database research is the
relational model. It is a rare
example of something that is both extremely relevant in
practice^{1} and nice from
a theoretical perspective, with close connections to logics and finite model
theory. The relational model gives an abstract model of how
data should be represented and queried, which allows people to use databases as a
black box, leaving fine but domain-agnostic optimisations to database engines.
Relational databases, like compilers, filesystems, or virtual memory,
have become an abstraction that everyone relies on.

One thing that the relational model does *not* handle is uncertainty and
incompleteness in the data. In the relational semantics,
the database is perfectly accurate, and it contains all facts that may be of
interest. In practice, however, this is rarely satisfactory:

- Some values may be
**missing**: for instance, the user did not fill in some fields
when registering, or some fields were added since they registered.
- The data may be
**incomplete**: for instance, you are constructing your database
by harvesting data from the Web, and you haven't harvested everything yet;
but you still wonder whether the data that you have allows you to give a
definitive answer to your query.
- Data may be
**outdated**: for instance, the user moved but they did not update
their address.
- Data may simply be
**wrong**.

Indeed, nowadays, people manage data that may have been inferred by machine
learning algorithms, generated
by data mining techniques, or automatically
extracted from natural
language text. Unlike a hand-curated database, you can't just assume mistakes will
not happen, or that you will fix all of them when you find them. The data is
incomplete and contains errors, and you have to deal with that.

These important problems are already being addressed, but in unsatisfactory
ways. The general domain-agnostic solutions are fairly naive: e.g., set a confidence
threshold, discard everything below the threshold, consider everything above the
threshold as certain. In domains where uncertainty is paramount and where simple solutions do not suffice, e.g., in
character recognition (OCR) or
speech recognition
(ASR), people use
more elaborate solutions, e.g., weighted
automata.
The problem is that these solutions are often *domain-specific*, i.e., they are
tailored to the domain at hand. The vision of uncertain data
management is that there should be principled and generic tools to manage such
noisy and incomplete data in all domains where it occurs; in the same way that
relational databases today are both highly optimized and domain-agnostic tools.

For now, the most visible aspect of uncertainty management in concrete databases are
nulls, which are part of the SQL
standard. NULL is a special value that indicates that
a record does not have a value for a specific attribute: for instance, the
attribute did not exist when the record was created. Of course, there are many
problems that NULLs do not address:

- NULL can only represent missing values in a record, not missing
records ("There may be more entries that I don't know about"),
or uncertainty about whether the record exists altogether ("I think I know
this, but it may be wrong.")
- NULL just means that we know nothing about the value, it doesn't allow us to
say that there are multiple possible values, or to model them using a
probability distribution, or that two unknown values must be the same.
- The semantics of NULLs (as defined by the SQL standard) are very
counter-intuitive. For instance, trying to select all values which are
either
*equal to 42* or *different to 42* will not return the values marked
NULL. This is because NULLs are understood as a three-valued
logic, but it still does
not make sense in terms of the possible worlds : NULL represents an unknown
value, and no matter the value, it must be equal to 42 or different from
42...

The semantics of NULLs for SQL databases are now set in stone by the standard,
even though people have complained about their imperfections as early as
1977^{2}, and even though they
still
confuse
people today.
There is still much debate about what the semantics of NULLs could have been
instead,
see, e.g.,
the keynote by
Libkin, which contains more
counter-intuitive examples with NULLs, a review of other theoretical models of
NULLs to address some of these problems, as well as proposed fixes.

Research is also studying uncertainty management approaches that have nothing to
do with NULLs.
For instance, *incompleteness* can be managed by designing a different
semantics to evaluate queries on databases, which is called the *open-world
assumption*,
and posits that all missing data is unknown rather than being wrong. To represent
uncertainty about the *existence* of tuples, one can use, e.g., *tuple-independent
databases*: they are standard relational databases, but with records
carrying an independent "probability of existence". This formalism cannot
represent *correlations* between tuples (if this tuple is here, then this tuple
must be here, or cannot be here), and the result of queries on such databases
may not be representable in the same formalism, but more expressive formalisms
exist, such as *pc-tables*.
There are also
formalisms that can represent uncertainty on other structures, such as
XML documents^{3}.

The goal would be to develop a system where relations can be marked as
incomplete, and data can simply be marked as
uncertain, or inserted directly with probabilities or confidence scores; the
system would maintain these values during query evaluation, to be able to tell
how certain its answers are. For now, though, real-world uncertain database
systems are quite experimental.
There are many reasons for that:

- It's hard to
*define* principled semantics on uncertain data.
- It's hard to find ways to
*represent* uncertainty about any kind of data.
- Somewhat surprisingly, probabilistic calculations on data are often
*computationally
expensive*, much harder than without probabilities!

This is why we are still trying to understand the basic principles about the right way to
represent uncertainty and query it in practice, which is the general area of my
PhD.
The main way in which it differs from existing work is that I focused on
restricting the *structure* of uncertain data, so that it can be
queried in a decidable and tractable fashion. More specifically, my PhD research has
studied three main directions:

- Connecting approaches for
incomplete data reasoning that originated in various communities,
to query incomplete data whose
*structure* is constrained by expressive but decidable rule
languages.
- Representing uncertainty on new
structures, namely, on order relations, which had not been investigated
before.
- Achieving computational
efficiency for query evaluation, when we
add qualitative uncertainty in the form of probabilistic distributions, by
restricting the
*structure* of data, e.g., imposing bounded treewidth.

## Connecting different approaches to reason on incomplete data

The first direction deals with incompleteness in databases, and reasoning
about an incomplete database in an open-world setting: What is *possible*? What is
*certain*? The challenge is that this problem has been studied from various
communities. On the one hand, database researchers have studied how to reason
about databases. On the other hand, people who study reasoning and artificial
intelligence are interested in drawing inference from data, and have tried to
make their approach work on large data collections.

An important such community is that of description
logics (DLs). The goal of
description logics is to do reasoning, with the following goals:

- Separate the data and the rules used to reason about the data.
- Remain computationally tractable when the data is large, as long as the rules
remain manageable.
- Study precisely how the computational cost varies depending on which kind of
reasoning operators
^{4} you allow in the rules.

The data managed in the DL context differs from relational databases, however.
The semantics is not exactly the same, but, most
importantly, the data representation is different:

**DLs** represent data as a
**graph**,
where relations connect at most two objects. For instance, "John lives in
Paris."
**Relational databases**, on the other hand, are a **tabular** formalism, where each
row has a fixed schema that relates multiple objects at once. For instance,
"John was born in Paris in 1942".

A different formalism, that works with relational databases, is that of
*existential rules* (or *tuple-generating
dependencies* in
database parlance). Such rules allow you to deduce
more facts from existing facts: "If someone was born in France and they have a parent
born in France, then they are French." However, they cannot express other constraints
such as *negation*: "It is *not* possible that someone is their own child." Further, *existential rules* can assert the
*existence* of new objects: "If someone wrote a literary prize, then they wrote
*some* book."

For people from databases, DLs, and existential rules, the goal is to design
languages to reason about incomplete data, imposing the following properties:

**principled**, though the languages may make different semantic assumptions;
**expressive**, so they can accurately model the rules of the real world;
**decidable**, so that they are not so expressive that it would be
fundamentally impossible to
reason about them;
**efficient**, so they machines can reason with them with
reasonable performance.

### Description logics and existential rules

A first part of my PhD research is about bridging the gap between these
communities. The goal is to design languages to reason about incomplete data
that *combine* rules from these various communities. My work
with Michael Benedikt (Oxford),
*Combining
Existential Rules and Description Logics*,
connects the two formalisms I presented.

Say that you have a database, which you know is incomplete, and you have rules
to reason about the implicit consequences of the data.
Our work studies hybrid languages where you can express both existential
rules and DL constraints. More precisely, we determine when such languages are
decidable, and when they are
not, i.e., it is fundamentally impossible to reason about them. The goal is to have
the "best of both worlds":
on parts of the data that can be represented as a graph, we would use the
expressive DL rules; on parts of the data which are more naturally described as
a relational database, we would have less expressive constraints inspired by
existential rules.

In our work, we pinpointed which features of these two languages are dangerous
and give an undecidable language if you mix them. The big problematic feature
of DLs are *functionality assertions*: being able to say that, e.g., a person
has *only one* place of birth. If we want to express such constraints, we must do
away with two problematic features of existential rules. The first is
*exporting two variables*: "If *someone* was born in *some* country, then *that
person* has lived in *that country*." What we deduce should only depend on one
element from the hypothesis: "If *someone* won a literary prize, then *they* wrote
some book." The second is a restriction on the facts which the rules
can deduce, which shouldn't form complex **cyclic** patterns in a certain technical sense.

Apart from these problematic features, however, we show that we can combine
existential rules and expressive DL rules. This gives us hope: it could be
possible to reconcile the two approaches and design very expressive languages that
capture both kinds of constraints.

### Open-world query answering and finiteness

I have explained how we connected two *non-database* approaches to reason about
incomplete data. I also worked with Michael on reasoning
about relational databases, in our work *Finite Open-World
Query Answering with Number Restrictions*.

The general context is the same: we have incomplete data and we want to reason
about its consequences, using logical rules. This time, however, our rule
language is just defined using classical integrity
constraints from relational
database theory. Indeed, people have designed many ways to constrain the
structure of databases to detect errors in the data: "Any customer in the Orders
table must also appear in the Customers table". We use these as rules to reason
about the *complete* state of the data, even though our database is incomplete
and may violate them.

The language of constraints that we use includes functional
dependencies, which are
like^{5} the functionality assertions in DLs that I already
mentioned. We also study unary
inclusion
dependencies, which are
existential rules^{6} but of a very restricted kind.

The main problem comes from the semantics.
In database theory, people usually assume databases to be *finite*. In particular,
when using the rules to derive consequences, the set of consequences should be
finite as well. This is
justified by common sense ("the world is finite") but this problem is usually
neglected in works about DLs and existential rules. So we studied *finite
reasoning* in this database context, for the open-world query answering task of
reasoning with our rules and incomplete information.

It is not hard to see that assuming finiteness makes a difference in some cases.
Consider the following information about an organization:

- Jane advises John (the database)
- Each advisee is also the advisor of someone (an inclusion dependency)
- Each advisee has a single advisor (a functional dependency)

Is it true, then, that someone advises Jane? The rule allows us to deduce that
John, as he is advised by Jane, advises someone else, say Janice; Janice herself
advises someone else, say Jack. In general, this could go on indefinitely, and
we cannot deduce that someone advises Jane.
However, if we also assume that the organization is finite, then it has to stop
somewhere: someone along this chain (Jennifer, say) must advise someone that we
already know about. And by the rule that no one is advised by two different
people, we deduce that Jennifer must be advising Jane.

As we show in our work, the only difference that finiteness makes, is that it
causes more rules to be added^{7}.
Once this is done, it turns out that we can
essentially *forget* about the finiteness hypothesis, because we can no longer
distinguish between finite and infinite consequences. This is surprisingly
difficult to show, though; and to establish this correspondence, we need much
more complex tools that those used in the infinite case to actually do the
reasoning!

## Representing uncertainty on ordered data

The second part of my PhD research deals with the representation of uncertainty
on new kinds of data, namely, ordered data. Order can appear at different levels:
on values ("retrieve all pairs of people when one is born before the other")
or on records ("retrieve all page views that occurred yesterday").
In SQL databases, order is supported with the standard comparison operators on
numbers ("WHERE length > 4200"), with sorting operators ("ORDER BY date"), and
with the LIMIT operator to select the first results.

### Order on facts

Why is it important to have *uncertain* ordered data? Well, an interesting
phenomenon is that combining certain ordered data may cause uncertainty to
appear. Consider a travel website where you can search for hotels and which
ranks them by quality. You are a party of four, and you want either two twin
rooms, or one room with four beds; but the website doesn't allow you to search
for both possibilities at once^{8}. So you do one search, and then the other, and you get two
ordered lists of hotels. In real life you would then browse through both lists, but
this is inefficient! What you would really want, in database parlance, is the
union
of these two lists, i.e., the list of hotels that have either two twin rooms or
one 4-bed room. But *how should this list be ordered*? This depends on the
website's notion of quality, which you don't know. Hence, the order on the
union is *uncertain*. It is not fully unspecified, though: if two hotels occur
only in one list, then their order relation should probably be reflected in the
union. How can you formalize this?

To solve this kind of problems, people have studied *rank
aggregation*: techniques to reconcile ordered lists of results. However, these
methods are *quantitative*, i.e., if something appears near the top in most
lists then it should appear near the top in the result. What we set out instead
is to *represent* what we know for sure, i.e., the set of all
possible *consistent* ways to reconcile the order, without trying to make a
choice between them.

Couldn't we just use existing uncertainty representation frameworks? If you try,
you will realize that they do not work well for ordered data. You don't want
tuple-level uncertainty, because you know what the results are. You could say
that the *position* of each result is uncertain: "this result is either the first
or the third". Yet, when you know that hotel *a* must come *before* hotel *b*,
you must introduce some complicated dependency between the possible ranks of *a*
and *b*, and this gets very tedious.

With M. Lamine Ba (former fellow PhD
student in Télécom, now in Qatar),
Daniel Deutch (from Israel) and
my advisor Pierre Senellart,
we have studied this question in our work,
*Possible and Certain Answers for Queries over Order-Incomplete Data*.
We use the mathematical notion of a partial
order as
a way to represent uncertain ordered data. More precisely, we explain how we can
apply to partial orders the usual operators of the relational
algebra (select, project,
union, product). We extend this to
*accumulation*, to answer
queries such as "is it certain that all hotels in this district are better than
all hotels in that district?"

This poses several interesting challenges. The most technically interesting one
is that of
*efficiency*: how hard it is to compute the result of such queries? Somewhat
surprisingly, we show that it is intractable
(NP-hard). In fact, it is already
hard to determine, given such a *partially ordered database*, whether some
totally ordered database is possible^{9}. In other words, it is intractable already
to determine whether some list is a consistent way to reorder the
data!
To mitigate this, we study "simplicity
measures" of the original ordered relations (are they totally ordered, totally
unordered, or not too ordered, not too unordered?), and we see when queries become
tractable in this case, intuitively because the result of integrating the
relations must itself be simple.

### Order on values

I also work on uncertainty on orders with
Tova Milo and
Yael Amsterdamer (from Tel Aviv). This
time, however, the partial order itself is known, and we want to represent uncertainty
on *numerical values* that satisfy the order.

Our work,
*Top-k Queries on Unknown Values under Order
Constraints*, is
inspired by crowdsourcing
scenarios. When working
on data that you collect from the crowd, every query that you make introduces
latency (you need to wait for the answers to arrive) and monetary costs (you
have to pay the workers). Hence, you cannot
simply acquire data about everything that you are interested in:
instead, you must extrapolate
from what you can afford. Following our earlier
works
on similar questions,
we are interested about classifying an
items in a taxonomy of products. In this example application, the order is
induced by the *monotonicity
property*: shirts are more specific than clothing, so if we classify an item as
being a shirt, it must also be a piece of clothing.

The underlying technical problem is to
interpolate missing values under
total order constraints, which I think it is an interesting question in its own
right.
We study a principled way to define this formally, based on geometry in
polytopes, and study again the computational complexity of this task. Why is this computationally
intractable in general, and what are circumstances where it can reasonably be
done? For instance, we can tractably perform the task when the taxonomy of
products is a tree.

### Reasoning about order

More recently, I also started working with Michael
Benedikt on open-world query answering
tasks that involve order relations. Again,
this is useful in practice, because order often shows up in queries and rules;
however it is challenging because it is hard to talk about orders in rules.
Essentially, to write a rule that says that something is an order, you have to express
transitivity: if *a* <
*b* and *b* < *c* then *a* < *c*.
Unfortunately, such rules are prohibited by many
existing rule languages, because they could make the rules too expressive to be
decidable. Hence, we study in which cases the specific axioms of order relations
can be supported, without causing undecidability.
We are still working on it and the results are not ready yet.

## Tractability of probabilistic query evaluation

The last direction of my PhD research, and indeed the only one I really
*started* during the PhD, is about the tractability of query evaluation on
probabilistic database formalisms.

In fact, apart from the work with Tova and Yael and Pierre that deals with
numerical values, everything that I described so far was about uncertainty in a
*logical* sense, not in a *quantitative* sense. Things are either possible or
impossible, either certain or not, but there is no numerical value that quantifies how
likely something is over something else.

There are multiple reasons to focus on logical uncertainty.
The first reason is that defining a set of possibilities is easier than
additionally defining a probability distribution on them. It is hard to define meaningful and
principled probabilities over the space of all possible consequences of an
incomplete database, or over the linear extensions of a partial
order.^{10} This is
why people usually study probability distributions on comparatively simple
models, e.g., tuple-independent databases (TID). With TID, the set of possible
worlds has a clear structure (i.e., all subsets of the database) and a simple
probability density: the probability of a subset is the probability of keeping
exactly these facts and dropping the others, assuming independence.

The second difficulty is that, even on the TID model, computing probabilities is
computationally intractable. Imagine
that a dating website has a table indicating who sent a message to whom, in addition to a
table indicating in which city users live. We pose a
simple query asking for pairs of users that live in the same city, such that one
messaged the other. On a standard
relational database, this is easy to write in SQL, and can be efficiently
evaluated. Now, imagine
that we give a *probability* to each fact. Say, e.g., we are not sure about the
possible cities a user may be in, and we only look at messages expressive a positive
sentiment: in both cases we
use machine learning to obtain probability values for each fact.
Now, we want to
find the *probability* that two people live in the same city and one sent the
other a positive message.

As it turns out, this is *much* harder than plain query
evaluation!^{11}
The
problem is that correlations between facts can make the probability very
intricate to compute: intuitively, we cannot hope to do much better than just
enumerating the exponential number of possible subsets of the data and count
over them.
An important
dichotomy result
by Dalvi and
Suciu classifies the queries (such as
this example) for which this task is intractable, and the ones for which it is
tractable.

With my advisor and Pierre Bourhis
from Lille (whom I met at Oxford), we wondered whether everything was lost for
hard queries. We had hope, because the result by
Dalvi and
Suciu
allows arbitrary input databases, which is quite pessimistic: real data is usually simpler
than the worst possible case. We thought of this in connection with a
well-known result by
Courcelle:
some queries which are generally hard to evaluate become tractable when the data
has bounded treewidth, intuitively
meaning that it is close to a tree. In the probabilistic XML context, it
had already been noted^{12} that probabilistic evaluation is
tractable on trees, e.g., on probabilistic XML data, even for very expressive
queries. We thought that this could generalize to bounded treewidth data.

We show such a result in our work
*Provenance Circuits for Trees and Treelike
Instances*, which
studies the more general question of computing
*provenance information*. The field of *provenance* studies how to represent the
dependency of query results on the initial data; it relates to probability evaluation
because determining the probability of a query amounts to determining the
probability of its provenance information, or lineage. Intuitively, the
lineage of my example query would say "There is a match if: Jean messaged Jamie
and Jean lives in Paris and Jamie lives in Paris,
or..." To determine the probability that the query has a match, it suffices to
evaluate the probability of this statement (which may still be intractable).

We rephrase probabilistic query evaluation in terms of provenance because there is a very
neat presentation
of provenance in terms as semiring
annotations manipulated through relational
algebra operators. Our work shows that
how an analogous notion can be defined in the context of trees and automata, and
can be "transported" through Courcelle's correspondence between bounded treewidth
graphs and trees. As it turns out, the lineages that we can construct are
circuits^{13} and themselves have bounded treewidth, so we
incidentally obtain that probability evaluation is tractable on bounded
treewidth instances. We extended this to show the tractability of many
existing database formalisms, and the existence of tractable lineage
representations.

In a more recent work,
*Tractable Lineages on Treelike Instances: Limits and
Extensions*,
we show that bounded-treewidth tractability essentially cannot be
improved: there are queries which are hard to evaluate for *any* restriction on
input instances which does not imply that the treewidth is bounded (or that the
instances are non-constructible in a strong sense). This lead us to investigate
questions about the extraction of
topological
minors, using a
recent
breakthrough in this area. We developed a
similar dichotomy result for other tasks, including a *meta-dichotomy* of the
queries that are intractable in a certain sense^{14} on all unbounded-treewidth constructible instance
families, a result reminiscent of the dichotomy of queries by
Dalvi and
Suciu, but with a very different
landscape.

The hope would be to develop query evaluation methods on
probabilistic data which use the *structure* of both the instance and query to
achieve tractability, or to make well-principled approximations. We hope that
such techniques could lead to realistic ways to evaluate queries on
probabilistic databases.

## Other things

This post only presents my past or current work that fits well
with the general story that I plan to tell about my PhD research. I'm sorry to
my past and current collaborators who are not featured in the list. In
particular, some of my other collaborators are: