My research is in theoretical computer science, i.e., I prove mathematical results on abstract topics inspired by computers. I specifically work in database theory, which studies problems inspired by data management: how to efficiently evaluate complex queries over large datasets. I typically publish at open-access conferences like ICDT and ICALP and journals like LMCS.
This page is an high-level overview of my research, grouped by its main themes. I also have a chronological list of publications, which includes errata notes. You may also be interested by my list of open questions, or my list of talks. Last, for completeness, I have a list of old drafts.
Enumeration studies how to handle queries that return a very large number of results. Instead of computing all results, we want to enumerate them efficiently. In other words, we first compute a compressed representation of the results of the query. Then, we can produce the results one after the other, in a streaming fashion. We try to minimize both the preprocessing time required to produce the compressed representation, and the maximal delay between two consecutive answers when enumerating.
My approach for enumeration is to use variants of circuits as the compressed representations. To do so, we use restricted circuit classes inspired by the field of knowledge compilation. (I also study them for their own sake.) I started working on this in 2017 with Pierre Bourhis, Louis Jachiet, and Stefan Mengel. In an ICALP paper, we explained how to efficiently enumerate the satisfying valuations of certain classes of circuits, and showed how this recaptures a known enumeration theorem: we can enumerate the results of any fixed query in monadic second-order logic over trees with the best possible complexity, i.e., preprocessing linear in the size of the input tree, and delay linear in each produced output.
We showed in an ICDT 2019 paper with Matthias Niewerth how to use a similar approach to enumerate the results of such queries (represented as automata with capture variables) over words, while ensuring polynomial-time complexity in the automaton. There is an implementation of this approach, with benchmarks.
Combining both approaches, we showed in a 2019 paper how to achieve this enumeration result over trees, while remaining polynomial in a tree automaton representation of the query. We also explained how this structure can be maintained under changes to the underlying tree (but there is an important erratum to that result as of 2022).
In a 2022 paper, with Louis Jachiet, Martín Muñoz and Cristian Riveros, we went beyond regular languages and automata: we showed efficient enumeration approaches for context-free grammars with capture variables.
Dynamic data is data that can be updated, i.e., modified by the user. When evaluating a query over dynamic data, we may need to recompute the query result every time the data is modified: this is called maintaining the query.
I have studied how to enumerate query results over dynamic data, as explained above. But I have also studied the problem of maintaining Boolean queries, i.e., queries simply returning a yes or no answer. With Louis Jachiet and Charles Paperman, we studied the problem of maintaining membership to a fixed regular language. Specifically, we are given a word, which may be changed by character substitutions, and we want to re-compute after each substitution if the word falls in the fixed language or not. This is always possible in logarithmic time in the input word, but we studied in which cases it can be done more efficiently. Our work was presented at ICALP 2021 (best paper award for track B).
Probabilistic data is data whose exact state is unknown. The more common model is to have data items (e.g., database tuples, table rows, graph edges) annotated with a probability of existence, where we assume that the probabilities are independent. This implicitly represents a probability distribution, where we keep each item with the indicated probability. In this model, the task is to compute the probability that the data satisfies a specific query. This is related to the study of data provenance, i.e., describing which facts in the data are used to obtain a query result.
I worked on this problem since my PhD with Pierre Senellart and Pierre Bourhis. We have shown in 2015 how to compute circuit representations of provenance, and how to solve probabilistic query evaluation, when the input data has bounded treewidth, i.e., is intuitively close to a tree. We also showed in 2016 that, on probabilistic graphs and under some technical hypotheses, this was the best possible result in the sense that probabilistic query evaluation is always intractable if the treewidth is not bounded.
With our PhD student Mikaël Monet, we showed in 2017 other complexity results for this problem under the measure of combined complexity, and applied some circuit tools for the efficient evaluation of the Datalog query language.
More recently, I showed in 2020 with İsmail İlkan Ceylan that queries that are homomorphism-closed are never tractable on probabilistic graphs unless they are part of the known tractable queries in the already-studied class of unions of conjunctive queries. This work was presented at ICDT 2020 (best paper award). I also showed in 2021 with Benny Kimelfeld that evaluating non-hierarchical self-join-free conjunctive queries on probabilistic data is also intractable even in the unweighted case, i.e., all facts have the same probability 1/2.
In a 2018 article with Florent Capelli, Mikaël Monet, and Pierre Senellart, we studied how to connect these restricted circuit classes to graph-theoretic measures over the circuit, e.g., its treewidth or pathwidth.
This research started with my PhD work in 2017 with M. Lamine Ba, Daniel Deutch and Pierre Senellart about query evaluation on ordered data where the order relation is partial. I then studied a much more abstract version of this problem with Charles Paperman: we fix a regular language, and are given as input a directed acyclic graph whose elements are labeled with letters of the alphabet of that language. We must determine if we can achieve a word of the language as a topological sort of the graph. This task is computationally intractable in general, but we tried to determine for which languages it was tractable. Our work was presented at ICALP 2018.
I have been working on the complexity of reasoning with logical languages. In particular, I have studied a task commonly investigated in database theory: open-world query answering. In this problem, we want to determine if a query is entailed by a database under some logical constraints. I have studied this with Michael Benedikt, Pierre Bourhis, and Michael Vanden Boom for so-called frontier-guarded constraints (a restricted class ensuring that the problem is decidable), extended to express additional features such as transitive closure: this work was published at JAIR. I have also studied this problem with Michael Benedikt in the case which only looks at entailment over finite structures, showing how to solve this problem for a language of constraints featuring functional dependencies and unary inclusion dependencies; and also studied how to work on this problem with the two paradigms of description logics and of existential rules.