This page is a collection of open problems in theoretical computer science. I have not investigated all of them thoroughly, but I find them interesting. It also features a list of other lists of open problems.
Some problems are very precisely formulated, others are fuzzier. The categories are here for convenience but they are mostly random.
Some of these problems are existing conjectures, though I avoided those that everyone should know, like P vs NP. For problems which are well-known conjectures with existing descriptions that are better than what I could write, I have a dedicated category. Other problems are open problems left as future work in publications that I found interesting. Yet other problems come from my research: however I usually do not list problems that I am working on right now, because the phrasing of the problem (or my understanding of possible solutions) could change too quickly for me to keep this page in sync. Also, some of my questions on TCS.SE may not be immediately reflected here, as I first wait to see whether I get an answer.
For more information about directions for new research (featuring questions that are less precise but more related to my interests), you can look at my internship and PhD offers. If you are a student interested in working with me, not all problems on that page are a good choice: the problems in this list on which I believe I could reasonably supervise someone are marked with an (*).
If you know the answer to a problem, or if you have new information about it, I would be very interested to know! You can write me at a3nm. @a3nm.net
This document is likely to contains errors of various kinds, and I make no promises about its correctness. Please report any mistakes to the address above.
With Yael Amsterdamer and Tova Milo, we worked on the problem of learning monotone predicates on lattices via crowd queries. We used the result that it is #P-hard to count the number of antichains in a poset1. However, our results could be strengthened if we knew that hardness also held for distributive lattices. Is it also #P-hard to count antichains in distributive lattices?
More generally, the question is: on which classes of posets is it #P-hard to count antichains?
Our work about learning monotone predicates on lattices via crowd queries shows that (under some formulations of the problem) it is computationally intractable to do so in a way that minimizes the oracle complexity (number of oracle queries). However, we were unable to analyse the oracle performance of non-optimal strategies which are computationally inexpensive.
Formally, the task is the following: we are given a poset (V,<) and we wish to learn exactly a Boolean predicate P:V↦{0,1} which is monotone: if P(x)=1 and x<y then P(y)=1. A simple algorithm is to pick (uniformly at random) an element x such that P(x) is unknown, evaluate P(x) with the oracle, integrate the consequences of the answer (set P(y)=0 for all y<x if P(x)=0, set P(y)=1 for all y>x if P(x)=1), and repeat while there are unclassified element. Are there posets where the performance of this strategy (expected number of oracle queries) is significantly worse than the (computationally intractable) optimal strategy?
More generally, are there learning algorithms for this task which are computationally tractable and yet have near-optimal oracle complexity?
Our work on order-incomplete data studies labeled posets, which consist of a poset (P,<) and a function μ from P to some finite alphabet Σ. The label of a linear extension x1,…,xn of (P,<,μ) is the word μ(x1)⋯μ(xn). A simple question is whether we can efficiently determine, given a labeled poset (P,<,μ) and a word w in Σ∗, whether (P,<,μ) has a linear extension of label w. Our work shows that this problem is hard even for comparatively simple P, and shows classes of labeled posets where it is tractable. It also studies a generalization of this question where you do not consider just the free monoid Σ∗ on Σ, but an arbitrary monoid structure: given a monoid element w and a labeled poset, we ask whether there is a linear extension whose label evaluates to w according to the monoid law.
Of course, another variant of this problem is to consider a language L on Σ, for instance a regular language, and ask whether the labeled poset has a linear extension whose label is in L. The more general question is: for which regular languages is this class tractable? Could one show a dichotomy result separating tractable and intractable languages?
For instance, with the alphabet Σ={a,b} and language L=(ab)∗, I do not know whether it is tractable, given an input poset (P,<,μ), to test whether it has a linear extension where we alternate between elements labeled a and elements labeled b.
A related question is to characterize the complexity of enumerating linear extensions of labeled posets: the problem can be done in constant amortized delay for classical posets3, but I am not aware of analogous results for labeled posets.
With M. Lamine Ba, Daniel Deutch and Pierre Senellart, we worked on how to represent and query order-incomplete data. In this context, we studied labeled posets (defined as above) and study the question of how to consolidate duplicates in a labeled poset. By this, we mean that we want to compute a labeled poset where the element labels are unique (all elements with the same label are collapsed to the same element), and the remaining order relations between the collapsed elements are sensible.
However, if you see a labeled poset as a way to represent a set of possible orders (its linear extensions), and if you want the duplicate removal operation to act consistently when looking at each linear extension (i.e., the linear extensions of the duplicate consolidation should be the result of removing duplicates in the linear extensions of the labeled posets), then it is not clear at all how to proceed. In fact, the result of duplicate consolidation may even not be representable as a labeled poset.
Is there a formalism that can represent the result of duplicate consolidation on labeled posets?
Define labeled posets as in the previous section. Consider a sequence S of words given as input. A labeled poset (P,<,μ) represents S if S is exactly the set of labels that can be achieved by the linear extensions of the poset. Of course, a sequence S cannot be thus represented unless all its sequences are built using the same multiset of elements, which should be the image of μ (seen as a multiset). However, if S respects this condition, it does not seem easy to determine whether it can be represented or not.
What is the complexity of determining, given a set of sequences, whether it can be represented by a labeled poset in this sense?
In our work with Yael Amsterdamer, Tova Milo, and Pierre Senellart, we study an interpolation scheme on partial orders, defined as the center of mass of the convex polytope defined by the order constraints. As it turns out, however, the scheme is not stable: if we fix some variables to their interpolation result, the interpolation of other variables may change.
Is there a principled stable interpolation scheme on posets? Is it unique?
In our work with Yael Amsterdamer, Tova Milo, and Pierre Senellart, we consider partial orders whose Hasse diagram (or its reverse) is a directed tree. We do not know whether our study would extend to partial orders whose Hasse diagram has bounded treewidth.
Is there a structure on the set of linear extensions of a poset whose Hasse diagram has bounded treewidth? Alternatively, is there a structure on the convex polytope defined by such order constraints?
A related question: are there languages for which constrained topological sorting is tractable on partial orders whose Hasse diagram is a tree, or has bounded treewidth, but is hard on arbitrary DAGs?
For all n∈N, what is the minimal number of elements of a poset with exactly n linear extensions, if one exists? What is the asymptotic growth of this function?
The analogous question for antichains has been studied4, with posets of size logarithmic in the desired number of antichains. However, the argument is much simpler, because the posets can be easily constructed from smaller posets. By contrast, the number of linear extensions of a poset seems hard to determine from elementary "constituent parts".
Given a directed graph G, a source vertex s, and a sink vertex t reachable from s, we want to know if there is a simple directed path from s to t whose length is greater than that of the shortest path from s to t. Is this problem in polynomial time?
A seemingly related question is that of simple paths on DAGs with backward edges. Consider an input DAG G, two vertices s and t, and additional back edges such that whenever (u,v) is an additional edge then u is reachable from v in G. Is it NP-hard to determine whether there is a simple path from s to t that uses at least one additional edge?
An undirected graph has diameter 2 if, for every pair of vertices u≠v, there is a path of length at most 2 connecting u and v (i.e., with at most one intermediate vertex). A 3-coloring of a graph is a function mapping each vertex to a value in {1,2,3} such that no two adjacent vertices are mapped to the same color.
Given a graph of diameter 2, can we decide in polynomial time whether it admits a 3-coloring?
We are given a directed graph G whose edges are annotated with independent probabilities of existence, and we want to estimate the probability of obtaining a subgraph of G which contains a directed cycle. Is there an FPRAS for this task?
Given an undirected graph and two nodes, can one decide in polynomial time whether there is a simple path of length 0 modulo 3 between the two nodes?
The problem without the "simple" requirement is clearly in PTIME, and the same problem on a directed graph is NP-hard by a reduction from the 2 disjoint paths problem. The question is also open for paths of length p modulo q for other values of p and q. It can be determined in PTIME if all simple paths between two vertices of an undirected graphs are of length p modulo q, solving the problem for q=28.
Up to subdividing each edge by 2 and connecting the source and target nodes by an edge, we can reduce to the problem of deciding if an undirected graph has a simple cycle of length 2p+1 mod 2q. This question is also open: a related comment is at the end of these slides (2015). A sufficient condition for tractability was given in 20049, but it does not cover the right cases, so the problem remains open.
I have written (in 2023) a more detailed writeup on the related work around this problem.
The hunters and rabbit problem is a pursuit-evasion game played on a undirected graph. A strategy for k hunters is a sequence s1,…,sn of subsets of k vertices. A strategy is winning if, for every walk v1,…,vn in the graph (not necessarily simple), there is i such that vi∈si. Intuitively, the evader is walking on the graph (it has to move at each turn by following exactly one edge) and must avoid the k vertices that the pursuers examine at each time step.
The evasion number of a graph G is the smallest k for which there is a winning strategy with k hunters. What is the complexity, given an undirected graph, of computing its evasion number? e.g., it is the case that, for any constant k, we can recognize the graphs of evasion number k in polynomial time?
This relates to the question of whether graphs with evasion number ≥k can be somehow characterized, similarly to the links between treewidth and pursuit-evasion via havens. It also relates to the question of how the maximal length of a strategy can be bounded as a function of the graph size and of k.
Treewidth measures how much an undirected graph is close to a tree. It is known that, for any fixed k∈N, we can check in linear time in an input graph G whether its treewidth is ≤k; but that, when both k and G are given as input, it is NP-hard to determine whether G has treewidth ≤k.
Is this last statement still true if G is planar, or can the treewidth of a planar graph G be computed in PTIME?
The parameterized cliquewidth computation problem for k∈N asks, given an input graph G, whether G has cliquewidth ≤k. Is this problem in PTIME for any fixed k? Is there k∈N such that the problem is NP-hard?
Membership in PTIME was recently shown17 for k≤3, but the problem is open for larger k. In particular, we do not know whether the problem is in FPT. Membership in FPT is known for some restricted classes of graphs of unbounded cliquewidth18, and membership in cubic FPT is known for the related parameter of rankwidth19.
The grid minor theorem20 of Robertson and Seymour shows that, if a family of graphs has unbounded treewidth, then one can find arbitrary large grid graphs as minors of the family. Specifically, the result can be stated as follows21: there exists a function f:N>0→N>0 such that, for every g∈N>0, every graph of treewidth ≥g has the (g×g)-grid as a minor.
The best known upper bound21 on f is ~O(g19), where the ~O notation neglects polylogarithmic factors. The best known lower bound22 is Ω(g2logg).
What is the correct bound on the function f?
Continuing on the grid minor theorem from the previous entry, the following result is known23: there is c∈N such that for any n∈N, for any graph G of treewidth ≥nc, one can find the n×n grid as a minor of G. This work shows how the minor can be found in randomized polynomial time.
Is it possible to extract the minor in deterministic PTIME?
We extend monadic second-order logic (MSO) over graphs with constructs to check the equality of the cardinalities of second-order variables, and check that their cardinalities are in fixed recursive sets. We fix a formula Φ in this logic and a treewidth bound k∈N. Is it PTIME in an input graph G of treewidth ≤k to check whether it satisfies Φ?
Courcelle's theorem states that this is true (and in linear time) for normal MSO without the extension.
A labeling of a directed acyclic graph (DAG) G=(V,E) is a function μ from V to some set Σ of labels such that the labels of a pair of vertices suffice to test reachability. Formally, there is a decoding function f:Σ2→{0,1} independent of G such that for any two vertices u,v∈V, we have f(μ(u),μ(v))=1 iff v is reachable from u in G.
I am interested in labeling schemes where Σ is Nd for a certain d, and where for simplicity we assume that μ cannot use the same number twice: there are no u≠v in V such that, letting ℓ=μ(u) and ℓ′=μ(v), we have ℓi=ℓ′j for some 1≤i<j≤d. Further, the decoding function f can only use the labels to do comparisons on their components, i.e., f, when applied on (ℓ,ℓ′) can only see whether ℓi<ℓ′j for all 1≤i,j≤d. Given d∈N and a choice of decoding function f, the labelable graphs of f are those for which there is a reachability labeling scheme for f.
These definitions generalize several known notions:
The case d=1 is uninteresting: there is only a single possible scheme f, and the labelable graphs are the directed path graphs.
For d=2, considering the decoding function f(ℓ,ℓ′) that checks whether ℓ2<ℓ′1, the labelable graphs are those that represent interval orders
For d=2, considering the decoding function f(ℓ,ℓ′) which is true iff ℓ1<ℓ′1 and ℓ2<ℓ′2, the labelable graphs are those that represent posets with order dimension at most two. More generally, for arbitrary d∈N, for the function that takes the AND of the pointwise comparability relations, the labelable graphs are those that represent a partial order with dimension ≤d.
For other values of d and decoding functions, how can we characterize the labelable graphs? In particular, for a given value of d, are some decoding functions more expressive than others? (For d=2 and the two decoding functions that I presented, interval orders and orders of dimension ≤2 are incomparable.) Further, is there d∈N and some decoding function such that any graph has a reachability labeling with Nd? (This sounds extremely unlikely but I'm not sure of how to prove it.)
Let G=(V,E) be a directed acyclic graph (DAG). A Boolean realizer of G is intuitively a way to label each vertex of G with integers, such that reachability in G can be determined only by looking at some Boolean function over pairwise comparisons of the vertex labels. Formally, a Boolean realizer of G of dimension d∈N consists of a Boolean function ϕ on d variables, and of d labellings ℓ1,…,ℓd, each of which is an injective function from V to N: we require that for any two vertices u,v∈V, letting xi for all 1≤i≤d be 1 if ℓi(u)<ℓi(v) and 0 otherwise, we have ϕ(x1,…,xd)=1 iff v is reachable from u in G. The Boolean dimension of G is then the smallest d for which G has a Boolean realizer of dimension d.
Is the Boolean dimension of planar DAGs bounded? In other words, is there a constant k∈N such that, whenever G is planar (i.e., it can be drawn in a way such that no two edges cross), then the dimension of G is no greater than k?
A simplicial decomposition of a graph G is a tree decomposition such that, for any two adjacent bags b and b′ whose set of vertices is respectively X and X′, the subgraph induced by G on X∩X′ is a clique. The width of the decomposition is the size of the largest bag minus one, as usual for treewidth. The simplicial width of G is the smallest possible width of a simplicial decomposition. What is the complexity, given a graph G, of computing its simplicial width and a simplicial decomposition? One can ask the question when allowing arbitrary G, or when assuming that the simplicial width of the input graph is bounded by a constant.
Note that there are known results33 on the complexity of computing a clique minimal separator decomposition, i.e., a simplicial decomposition where we minimize the size of the cliques, not of the bags.
What is the complexity of determining if an input undirected graph can be covered by cycles, each of which have length at least 5? The problem is known to be PTIME for a length constraint of "at least 3" or "at least 4", and NP-hard for a length constraint of "at least k" for all k≥6.
Consider a set of d-dimensional vectors V=v1,…,vn of numbers (or indeed any totally ordered set). Let us assume for simplicity that all the vil are distinct for 1≤i≤n and 1≤l≤d. In this context, we say that vi dominates vj if vil≥vjl for all 1≤l≤d. The skyline (or Pareto frontier) of V is the set of maximal vectors for the domination relation.
Assume that the only way you can access the vectors of V is by atomic comparisons, i.e., you can evaluate whether vil<vjm for any 1≤i,j≤n, 1≤l,m≤d. What are bounds on the minimal number of comparisons required to determine the skyline? Such bounds can be stated as a function of k and n, or as a function of the output size (the actual skyline).
Groz and Milo10 have studied this problem when the comparisons are additionally noisy, but good bounds for d>3 are not yet known, even without noise.
Consider a set Σ of tuple-generating dependencies (TGDs) and a set Φ of equality-generating dependencies (EGDs). We say that a Boolean conjunctive query (CQ) Q is entailed by a relational database instance I and by Σ∧Φ if, for every superinstance I′⊇I such that I′ satisfies Σ∧Φ, it is the case that I′ satisfies Q. Intuitively, Q is implied by the instance I and the constraints Σ∧Φ under open-world semantics: Q is true on all superinstances of I that satisfy the constraints Σ∧Φ.
We call Σ and Φ separable if, for any Boolean CQ Q and instance I that satisfies Φ, the following equivalence holds: Q is entailed by I and Σ∧Φ iff Q is entailed by I and Σ. In other words, the equality-generating dependencies Φ can be checked separately on I, and have no impact afterwards in terms of entailment.
For general TGDs and EGDs, the problem is undecidable11, but this is not so surprising as reasoning with TGDs by themselves is also undecidable if arbitrary TGDs are allowed. Are there less expressive languages for which separability is decidable? One could think, e.g., of inclusion dependencies and functional dependencies, maybe with some restrictions.
A linear rule is a tuple-generating dependency (TGD) with exactly one atom in the body and one atom in the head. A transitivity assertion on an arity-two relation R is a TGD of the form ∀xyz R(x,y)∧R(y,z)⇒R(x,z).
It is decidable25 whether an instance I and constraints Σ of linear rules and transitivity assertions entail a CQ Q, in the sense of the previous problem, but under some restrictions (all relations have arity at most two, or Q consists of a single atom, or an additional condition holds). Does decidability hold in the general case of arbitrary CQ and arbitrary arity signatures for linear rules and transitivity assertions?
With Michael Benedikt we introduced a language formed of expressive constraints on an arity-two signature (including number restrictions, namely counting quantifiers) and tuple-generating dependencies on arbitrary arity predicates, also supporting functional dependencies of a restricted kind on such predicates. We showed that it was decidable to determine (in the sense above) whether a query is entailed by an instance and such constraints, which we call the open-world query answering (OWQA) problem.
This raises the question of whether a more expressive and more uniform such language could be designed. It should also have decidable entailment, and should also feature (1) arbitrary arity constraints, (2) number restrictions such as functional dependencies, and (3) expressive logical operators such as disjunction. Can such a language be designed? Relevant results include the following (beyond the ones shown in the paper mentioned above):
A general idea would be to design a language that generalizes frontier-one TGDs to have disjunction and some kind of "non-conflicting" counting quantification, and establish decidability.
Following this paper, another natural direction would be to study finite model reasoning for such a language. However, one should first show the decidability of the finite implication problem for it: decide whether a set of constraints of the language implies other constraints. Yet, as far as I know, the only decidability result for finite implication on an arbitrary-arity language that features number restrictions only covers40 unary inclusion dependencies and functional dependencies. Can finite implication be decided for a more expressive language of this form? E.g., in the spirit of existing works41 but for arbitrary arity constraints.
This question uses the definition of OWQA from the previous question. It was recently shown42 that the finite satisfiability problem is decidable for the two-variable guarded fragment of first-order logic with counting quantifiers (GC²) to which one adds path-functional dependencies. Is the same true of the finite OWQA problem? This is already known if there are no path dependencies26.
If this is true, it would imply an independent proof of the result of this paper when assuming that all functional dependencies are unary. Indeed, unary inclusion dependencies and functional dependencies on arbitrary signatures can clearly be equivalently translated to GC² plus path functional dependencies (with paths of length 2). To re-prove this result for arbitrary functional dependencies, one would need some generalization of path functional dependencies (to capture the image of this translation).
Is conjunctive query containment under bag semantics decidable, and what is its complexity?
For k∈N, the path query qk of length k on a directed graph G=(V,E) returns the set of pairs of nodes (u,v)∈V2 such that there is a path of length k from u to v. A union of path queries qS is defined by a (possibly infinite) set S of integers and returns the sets of pairs of nodes (u,v) such that there is a path from u to v whose length is in S.
A set of union of path queries Q=qS1,…,qSn determines a path query qk if, for all finite graphs G and G′, if qSi(G)=qSi(G′) for all 1≤i≤n, then qk(G)=qk(G′). Under mild assumptions on the representation of infinite sets in union of path queries, the determinacy problem was shown44 to be decidable but only for path queries qk where k is larger than some function of Q. Is it decidable to determine whether a path query is determined by a set of union of path queries, without this restriction?
Another question is whether this result extends to more general query languages. For instance, is it decidable to determine whether a union of path queries is determined by a set of union of path queries?
A natural further generalization of this problem is going from paths of a fixed length to paths labeled by some language, when edges are labeled by letters of some alphabet. Formally, we define a regular path query qL as a query on edge-labeled graphs that returns the pairs of vertices connected by a path in L, where L is a regular language over the alphabet of edge labels. In this context, however, it was shown that it was undecidable whether a regular path query was determined by a set of regular path queries45, and it is even undecidable whether using only finite regular languages46 (or, equivalently, union of labeled path queries, i.e., unions of queries returning all pairs of vertices separated by a path labeled by some finite word). So with these generalisations of the above problem being undecidable, it is not clear whether we should expect decidability for (unlabeled) path queries, or unions thereof.
Thanks to Nadime Francis and to Bartosz Bednarczyk for helping me to prepare this entry.
A tuple-independent database (TID) is a probabilistic database consisting of a relational database I where each tuple is given a probability in [0,1]. The semantics is a probability distribution on subsets of I (the possible worlds), obtained by considering that each tuple is either kept or removed with the indicated probability, independently across tuples.
A Boolean union of conjunctive queries (UCQ) is a disjunction of conjunctive queries with no free variables. The probability evaluation problem for a fixed Boolean UCQ q asks, given a TID I, what is the probability that I satisfies q, meaning, what is the total probability mass of possible worlds of I that satisfy q. A dichotomy result is known47: Boolean UCQs are partitioned between those for which the probability evaluation problem can be solved in polynomial time, and those where it is #P-hard.
The lineage of a Boolean query q on a TID instance I is a Boolean formula whose variables are the facts of I, such that a valuation makes the circuit evaluate to true iff the corresponding possible world (defined by keeping exactly the facts set to true) satisfies q. One way to show that a Boolean UCQ is tractable is to show that, on any instance, one can represent its lineage in a form that allows for tractable probability evaluation.
A d-DNNF¬ is a Boolean circuit (with AND, OR, NOT gates, and input gates) such that every AND gate is on disjoint set of variables (meaning, there is no input gate reachable from two distinct children of an AND gate), and every OR gate is exclusive (meaning, there is no valuation of the input gates that makes two distinct children of an OR gate evaluate to true). A Boolean UCQ has polynomial-size d-DNNF¬ if there is a constant c∈N such that, for any TID instance I, there is a d-DNNF¬ of size O(|I|c) that expresses the lineage of q on I.
It is known48 that many Boolean UCQs for which probabilistic query evaluation is in PTIME have polynomial-size d-DNNF¬, but the converse is open. Is there a Boolean UCQ for which probabilistic query evaluation is in PTIME, but that has no polynomial-size d-DNNF¬? In particular, is this the case of the conjectured48 counterexample Q9? If this can be shown, is it possible to design a generalization of d-DNNF¬ that still enjoys probabilistic query evaluation, and is such that any Boolean UCQ with PTIME probabilistic query evaluation has a polynomial-size lineage in this generalization?
Update: the most recent work in this area is this paper by Mikaël Monet
We talk of a Boolean conjunctive query Q being entailed by an instance I and set Σ of tuple-generating dependencies (TGDs) if for any instance I′⊇I such that I′⊨Σ, we have I′⊨Q.
The set Σ of tuple-generating dependencies (TGDs) has bounded derivation depth if, for any conjunctive query Q, there exists a union of conjunctive queries Q′ such that, for any relational instance I, the following equivalence holds: Q is entailed by I and Σ iff I⊨Q′.
The set Σ of TGDs is finitely controllable iff for any instance I and Boolean conjunctive query Q, the following equivalence holds: Q is entailed by I and Σ iff Q is entailed by I and Σ over finite models (impose that I′ is finite in the definition above).
Is it the case that if a set of TGDs has bounded derivation depth then it is finitely controllable? In the work where this conjecture was posed49, it was solved for signatures of arity two, but the general question remains open.
This question refers to the dichotomy47 on the Boolean UCQs for which the probability evaluation problem is in PTIME: see this question for background. The queries which are tractable according to this dichotomy are safe.
What is the complexity, given a Boolean UCQ, to determine whether it is safe?
The best known algorithm for this is super-exponential15.
The query evaluation problem asks, given a Boolean conjunctive query Q and a relational database I, whether Q holds on I. The treewidth of I is the treewidth of its Gaifman graph, and the treewidth of Q is that of its canonical instance (i.e., we just see the query as a database on its variables). What is the parameterized complexity of the query evaluation problem when parameterized by the treewidth of I and of Q? (In other words, the parameter of the problem is an upper bound on the treewidth of both I and Q.)
In this paper with Pierre Bourhis and Pierre Senellart, we showed (Theorem 4.2) that there is a query for which the probabilistic query evaluation problem (see this question for the definition) is intractable on any unbounded-treewidth instance family. However, the lower bound of this result has some limitations, and I do not know whether they can be lifted:
In the same paper as in the previous question, we showed (in Lemma 8.2) that, for any graph signature, there is a UCQ with inequalities Q such that, given any instance I, the width of any OBDD representing the lineage of Q on I (see this question for the definition) is bounded by an exponential function of the treewidth of I: specifically, there is a constant c∈N>0 such that it is in Ω(2(width(I))1/c).
Can the same result be shown for other lineage representations? In particular, can it be shown for d-DNNFs?
Update: our ICDT'19 paper with Mikaël Monet and Pierre Senellart shows a similar lower bound for the class of d-SDNNF (structured d-DNNF). The question of showing a similar bound for d-DNNF is very challenging and relates to the open problem of separating DNFs in general and d-DNNFs.
The prime CNF of a monotone Boolean function over variables x1,…,xn is the (unique up to order) expression of the function as a conjunction of disjunctions of the xi from which no variable occurrence can be removed. The dual of a monotone Boolean function f is the monotone Boolean function mapping x1,…,xn to ¬f(¬x1,…,¬xn). By De Morgan's laws, a Boolean expression for the dual can be obtained by replacing ∧'s by ∨'s and vice-versa in the prime CNF of f.
The Dual problem asks, given two prime CNFs, whether the functions defined by these CNFs are dual of one another. It is known to be in quasipolynomial time16. Is the Dual problem in PTIME?
Shannon proved in 1949 with a counting argument that most Boolean functions cannot be represented by a circuit of linear size. However, we do not know yet of any explicit Boolean function for which no linear size circuit exists. Can we construct such a function?
Boolean functions can be represented as circuits or as formulae. Circuits seem much more concise, because they can reuse common subexpressions. Yet the best conciseness gap known is the following: there are Boolean functions that can be represented by linear circuits but for which any formula representation has size at least n3−o(1) (over circuits with AND, OR, and NOT). Can we do better?
A Boolean circuit is structured if there is a fixed full binary tree whose leaves are labeled by the variables (the vtree) and a mapping from the gates of the circuit to the nodes of the tree such that every variable is mapped to the leaf corresponding to itself and the inputs to every gate g are mapped to descendants of the node to which g is mapped. A Boolean circuit is smooth if, intuitively, no variable is omitted, i.e., whenever we take the disjunction of two gates then the set of variables reachable from each gate is the same.
Given a Boolean structured circuit, what is the complexity of computing an equivalent circuit which is structured but still smooth? What about the same question while preserving other desirable circuit properties such as being deterministic?
A circuit is decomposable if the inputs to every AND-gate depend on pairwise disjoint sets of variables. What can be said about the same question for smoothing decomposable circuits? There is a partial result (applying only to a certain class of "smoothing-gate algorithms") as Theorem 5.2 in the paper above.
An n-superpermutation is a word w over {1,…,n} such that each permutation of {1,…,n} occurs as a subsequence of w. What is the length of the shortest n-superpermutations as a function of n?
Which languages can be recognized by a (not necessarily uniform) family of deterministic finite automata of polynomial size?
A primitive word is a word that cannot be represented as a power of another word. Is it true that, on any alphabet with more than one letter, the set of all primitive words is not context-free?
Fix an alphabet Σ. We say that a word w∈Σ∗ is the shuffle of two words u,v∈Σ∗ if we can obtain w by interleaving u and v. A word w∈Σ∗ is a shuffle square if there is a word u∈Σ∗ such that we can obtain w as the shuffle of u and u. (Note that this implies that, for each letter a∈Σ, there is an even number of occurrences of a in w. While the number of occurrences of each letter in u are uniquely defined from w, the order of the letters is not.) We say that a word is shuffle-square-free if it contains no substring which is a shuffle-square.
If the alphabet Σ has at least 6 letters then it is known that there exist infinitely many shuffle-square-free words50, following an earlier result51 on alphabet size 7. If the alphabet Σ has 3 letters or less, then it is known52 that such words do not exist: the longest such words on Σ={a,b,c} are abcacbacabc, acbabcabacb, bacbcabcbac, bcabacbabca, cabcbacbcab, cbacabcacba, of length 11 each. (I'm adding them to this entry to ensure that people interested in this computation can easily find the page. ;)) The question is open for alphabets of size 4 and 5.
Are there infinitely many shuffle-square-free words on an alphabet of size 4?
Given a tree alphabet Σ, we define the alphabet ¯Σ to consist of {a∣a∈Σ}∪{¯a∣a∈Σ}. Given an unranked tree T on a tree alphabet Σ, the XML representation of T is the word on ¯Σ recursively defined as follows: the coding of a leaf labeled a∈Σ is a¯a, and the coding of an internal node labeled a∈Σ with children n1,…,nk is ac1⋯ck¯a where each ci is the coding of ni.
Given a tree automaton A on unranked trees on Σ, it defines a so-called regular tree language L(A), and we denote by XML(L(A)) the word language on ¯Σ of the XML codings of trees of L(A). Of course, for all regular tree languages except finite ones, XML(L(A)) is not regular as a word language, because a word automaton cannot check if the opening and closing tags (i.e., the a's and ¯a's) are properly nested. The question is intuitively to understand the regular tree languages for which matching the opening and closing tags is the only difficulty.
Formally, we say that a word automaton A weakly recognises a tree language L if, given a word w on ¯Σ which is the XML representation of some tree T, then A accepts w iff T∈L. (The behavior of A on words of ¯Σ that do not represent any tree is not specified.)
Given a tree automaton A, is it decidable to determine if there exists a word automaton that weakly recognizes the language XML(L(A))?
The problem, in the specific case of DTDs (a special case of tree automata), has been shown53 to be equivalent to a variant of the word problem for groups, whose decidability status is open. The status of the general problem is also open.
Thanks to Bartosz Bednarczyk for pointing out this problem to me, and to Charles Paperman for the problem phrasing used here.
The Why-provenance54 of a Boolean query q in the positive relational algebra on a relational instance I is a Boolean function ϕ defined on the facts of I such that, for any valuation ν mapping each fact of I to true or false, ν(ϕ) is true iff {F∈I∣ν(F) is true} satisfies q.
It is known54 that, for any such fixed query q, its Why-provenance on I can be represented as a Boolean formula of polynomial size in the instance I. Are there instance families with known lower bounds on the representation of Why-provenance? In particular, is there a query q and family of relational instances I1,…,In,…, such that, letting ϕi for all i be the Why-provenance of q on In, the size of ϕi is superlinear in Ii, namely, there is no constant K such that, for all i, the provenance ϕi can be written as a Boolean formula of size less than K⋅|I|, where |I| is the number of facts of I.
The question generalizes to other representations of provenance, such as provenance circuits55, or to provenance expressed in different provenance semirings54, such as the universal semiring N[X] of polynomials in X (standing for the facts of I).
We have a certain number m of machines, and a number n of jobs. All jobs have the same duration (an integer p), and each job has a minimum start date and a maximal end date, both of which are integers. The scheduling problem asks whether there is a way to schedule all n jobs on the m machines. We can phrase it in two variants: the decision variant simply asks whether it is possible, and the computation variant asks us to compute a possible schedule, i.e., a partition of the n jobs into m sequences such that each machine can perform its sequence of m jobs in a way that respects the start date and end dates of all jobs, and the job duration. The problem can equivalently be phrased with jobs of unit duration and start and end dates which are rationals.
Somewhat surprisingly (to me at least), this problem can be solved in PTIME. The best known algorithm60 (which additionally optimizes an additional criterion) runs in time O(min(1,pm)n2). What is the most efficient algorithm for this problem?
For the case m=1, a more efficient algorithm is known61, which runs in O(nlogn). In this setting, there is a clear lower bound of Ω(nlogn) for the computation problem, as the scheduling problem for jobs with disjoint intervals amounts to sorting their bounds; but for the decision problem, this is unclear.
Thanks to the judges of the 2017 ACM-ICPC World Finals for bringing this problem to my attention (see Problem H, "Scenery").
In this question, all numbers are represented as rationals. Consider a convex polytope described as an intersection of half-spaces, each of which is described as inequalities between a linear function of the coordinates and a constant (e.g., x1+742x2≤3). We assume that the polytope is bounded, i.e., it is not infinite. Given such a representation, we wish to compute the volume of the polytope. It turns out62 that this cannot be done in polynomial time (or even in nondeterministic PTIME) because the volume is not always of polynomial length in the input description. Hence the question: For which classes of polytopes is the volume always of polynomial length in the input?
Here are links to some other specific open questions found on other places on the Web:
This section regroups problems that occurred in a previous version of this list, and were solved since then.
It is known that a maximum matching can be found in PTIME. In particular, this is the case of bipartite graphs. However, what about the situation where we are allowed to remove vertices in one of the parts, and where we want to maximize the number of vertices that must remain unmatched in a maximum matching? What is the complexity of this problem?
Proved to be in PTIME by Chao Xu in this TCS.SE answer. Thanks!
The query determinacy problem asks, given a set Q1,…,Qn of conjunctive queries (CQs) and an additional CQ Q, whether the Qi determine Q, i.e., whether, for any two (possibly infinite) relational databases I and I′, having Qi(I)=Qi(I′) for all 1≤i≤n implies we must have Q(I)=Q(I′). The query determinacy problem was shown43 to be undecidable but this work left open the problem of finite determinacy, which is defined as determinacy except that I and I′ must be finite.
Is finite determinacy also undecidable for conjunctive queries?
Proved by the authors of the original work43 in a new paper. As this was accepted for publication at PODS 2016, I think we can say that this has been accepted by the community, although I haven't personally checked the proof.
Grötzch's theorem says that planar graphs without 3-cycles (triangles) are 3-colorable. Is every planar graph without 4-cycles and 5-cycles 3-colorable?
Disproved in this paper. I haven't personally checked it, but it has been published in the Journal of Combinatorial Theory, so I'm assuming it is correct.
Letting Γ be a digraph, the CSP for Γ, written CSP(Γ), is the following problem: given a digraph G, decide whether there is a homomorphism from G to Γ. The Feder-Vardi conjecture34 asks: Is there a dichotomy on Γ, with CSP(Γ) being always either in PTIME or NP-hard?
An analogous result is already known for undirected graphs35, for some restricted classes of directed graphs36, and when Γ has two elements37 or more recently three elements38.
Proved: two recent preprints by Bulatov and by Zhuk claim a proof of this result. The papers were accepted for publication at FOCS 2017 so I'm classifying this as solved (accepted by the community), even though it may still be a good idea to verify these proofs in more detail (see this post). I have not personally checked any of this.
Thanks to Mikaël Monet for pointing out that the conjecture for digraphs implies39 the conjecture for relational structures.
It is #P-hard to count the number of linear extensions of an input poset2. The proof uses posets of height 3, and leaves open the natural question: is it #P-hard to count linear extensions in posets of height 2?
An equivalent formulation is in terms of directed bipartite graphs (all edges go from one part to the other part): given a directed bipartite graph as input, is it #P-hard to count the number of topological sorts of the graph?
Solved in this paper (shown to be #P-hard). Note that I haven't personally checked it. Thanks to Kuldeep S. Meel for pointing me to that paper.
This is similar to the question above, but for order dimension: is it #P-hard to count linear extensions in posets of dimension 2?
Solved in the same paper as above (shown to be #P-hard).
The so-called monk problem12 is a pursuit-evasion game played on a directed graph. A strategy for k pursuers is a sequence s1,…,sn of subsets of k vertices. A strategy is winning if, for every (non-simple) walk v1,…,vn in the graph, there is i such that vi∈si. Intuitively, the evader is walking on the graph (it has to move at each turn by following exactly one edge) and must avoid the k vertices that the pursuers examine at each time step.
The evasion number of a digraph G is the smallest k for which the pursuers have a winning strategy. What is the computational complexity, given an input digraph, of computing its evasion number?
Solved in this paper: it is already NP-hard to check if an input digraph has evasion number 1.
This section contains former questions which I no longer think are interesting or sensible, or for which my interest has waned.
The fluted fragment is a fragment of function-free first-order logic for which the satisfiability problem is decidable13 (but non-elementary). This definition is reminiscent of inversion-free queries, which are known14 to be exactly the UCQs that admit OBDD lineages on TID databases (see this question for relevant definitions).
Is there any connection between these two classes?
The original version of this question incorrectly claimed that inversion-free queries are the positive existential fragment of fluted queries, but this is in fact not correct: there are non-hierarchical CQs, e.g., ∃xyR(x),S(x,y),T(y), which are fluted. So while there is still a connection in the definitions, the link is no longer as compelling, and I don't think the question is so interesting anymore.
Thanks to the anonymous Reviewer 3 of our paper at PODS 2016 for pointing out this connection. Thanks to Bartosz Bednarczyk for bringing back this problem to my attention, which made me realize the error.
The PrXML formalism is a way to represent probability distribution on labeled, unranked, unordered trees56. It does so by representing uncertainty with special kinds of nodes:
We give names to families of probabilistic XML documents according to the nodes that are allowed, e.g., PrXMLind, mux refers to probabilistic documents where nodes of type ind and mux are allowed. A family is efficiently translatable in another if any document of the first family can be rewritten to a document in the second. Some families are known to be efficiently translatable in others57, but open questions remain: is PrXMLexp efficiently translatable in PrXMLmux,det, or even to PrXMLcie?
Thanks to Pierre Senellart for helping me to prepare this entry.
Using the previous definitions, the semantic equivalence problem for two PrXML documents asks whether the two documents define the same probability distribution on outcomes (same possible worlds, with same probabilities). Given two PrXMLcie documents, what is the probability of deciding whether they are semantically equivalent? The problem is in EXPTIME but no lower bound or hardness result is known58. What is the complexity of this problem?
The structural equivalence problem asks, given two PrXMLcie documents, whether, for each valuation of the variables, the corresponding valuations of both documents are the same. (Note that the probabilities assigned to the events is irrelevant.) The problem is in coNP, but no lower bound or hardness result is known59. What is the complexity of this problem?
Thanks to Pierre Senellart for helping me to prepare this entry.
J. S. Provan, M. O. Ball. The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected, 1983. ↩
G. Brightwell, P. Winkler. Counting Linear Extensions is #P-Complete, 1991. ↩
G. Pruesse, F. Ruskey. Generating Linear Extensions Fast, 1994. (closed access) ↩
K. Ragnarsson, B. Eileen Tenner. Obtainable Sizes of Topologies on Finite Sets, 2008. To understand the connection, remember that a topology with m points and k open sets is the same as a poset with m elements and k order ideals (as they state in the abstract), and this is the same as a poset with m elements and k antichains (take the set of maximal elements of each order ideal). Their Theorem on page 2 gives the logarithmic bound. We independently derived a similar result with a similar proof but a worse bound in our paper with Yael Amsterdamer and Tova Milo, Lemma B.2, and the size of the posets is in O(log2n). ↩
S. Akmal, V. V. Williams, R. Williams, Z. Xu., Faster Detours in Undirected Graphs, 2023. Look for "it remains open whether k-Longest Detour is in P even for the special case of k = 1!" ↩
M. Piecyk. C2k+1-coloring of bounded-diameter graphs, 2024. ↩
W. Martens, T. Popp. The Complexity of Regular Trail and Simple Path Queries on Undirected Graphs, 2022. ↩
E. M. Arkin, C. H. Papadimitriou, M. Yannakakis. Modularity of Cycles and Paths in Graphs, 1991. ↩
E. Hemaspaandra, H. Spakowski, M. Thakur. Complexity of Cycle Length Modularity Problems in Graphs, 2004. ↩
B. Groz, T. Milo. Skyline Queries with Noisy Comparisons, 2015. ↩
A. Calì, G. Gottlob, G. Orsi, A. Pieris. On the Interaction of Existential Rules and Equality Constraints in Ontology Querying, 2012. (closed access) ↩
B. Fredriksson, E. Lundberg. The Monk Problem: Verifier, heuristics and graph decompositions for a pursuit-evasion problem with a node-located evader, 2015. ↩
I. Pratt-Hartmann, W. Szwast, L. Tendera. Quine’s Fluted Fragment is Non-Elementary, 2016. ↩
A. Jha, D. Suciu. Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams, 2011. See Theorem 4.2. ↩
D. Suciu, D. Olteanu, C. Ré, C. Koch. Probabilistic Databases, 2011. (Closed access.) See end of Section 4.1.6.1. ↩↩
T. Eiter, K. Makino, G. Gottlob. Computational Aspects of Monotone Dualization: A Brief Survey, 2008. ↩
D. G. Corneil, M. Habib, J.-M. Lalignel, B. Reed, U. Rotics. Polynomial-time recognition of clique-width ≤ 3 graphs, 2012 (only available behind a paywall). ↩↩
P. Heggernes, D. Meister, U. Rotics. Computing the clique-width of large path powers in linear time via a new characterisation of clique-width), ↩
P. Hliněný, S. Oum. Finding Branch-Decompositions and Rank-Decompositions, 2007. ↩
N. Roberston, P. D. Seymour. Graph Minors. V. Excluding a Planar Graph, 1986. ↩
J. Chuzhoy, Improved Bounds for the Excluded Grid Theorem, 2016. See Theorem 1.1, and the three following paragraphs. ↩↩↩↩
N. Roberston, P. D. Seymour. R. Thomas. Quickly Excluding a Planar Graph, 1994. See last paragraph of Section 1. ↩
C. Chekuri, J. Chuzhoy. Polynomial Bounds for the Grid-Minor Theorem, 2014. ↩
C. Chekuri, personal communication, 2015. (That's academic lingo to say that we emailed them and they don't know yet.) ↩
J.-F. Baget, M. Bienvenu, M.-L. Mugnier, S. Rocher. Combining Existential Rules and Transitivity: Next Steps, 2015. ↩
I. Pratt-Hartmann. Data-complexity of the two-variable fragment with counting quantifiers, 2009. ↩↩
V. Bárány, G. Gottlob, M. Otto. Querying the Guarded Fragment, 2014. ↩
A. Calì, G. Gottlob, A. Pieris. Towards more expressive ontology languages: The query answering problem, 2012. See Section 6.2.2. ↩
J. Mitchell. The Implication Problem for Functional and Inclusion Dependencies, 1983 ↩
A. Calì, D. Lembo, R. Rosati. Query rewriting and answering under constraints in data integration systems, Theorem 3.4. However, the proof of this result has a slight inaccuracy; see Appendix A.2 of this paper for details. ↩
G. Gambosi, J. Nešetřil, M. Talamo. On locally presented posets, 1990. ↩
S. Felsner, T. Mészáros, P. Micek. Boolean dimension and tree-width, 2018. ↩↩
H.-G. Leimer. Optimal decomposition by clique separators, 1993. ↩
T. Feder, M. Y. Vardi. The Computational Structure of Monotone Monadic SNP and Constraint Satisfaction: A Study through Datalog and Group Theory, 1998. ↩
P. Hell, J. Nešetřil. On the complexity of H-coloring, 1990. ↩
L. Barto, M. Kozik, M. Maróti, T. Niven. CSP dichotomy for special triads, 2009. ↩
T. J. Schaeffer. The complexity of satisfiability problems, 1978. ↩
A. A. Bulatov. A dichotomy theorem for constraints on a three-element set, 2002. ↩
T. Feder, M. Y. Vardi. The Computational Structure of Monotone Monadic SNP and Constraint Satisfaction: A Study through Datalog and Group Theory, 1998. See Theorem 10 and subsequent theorems. ↩
S. S. Cosmadakis, P. C. Kanellakis, M. Y. Vardi. Polynomial-time implication problems for unary inclusion dependencies (only available behind a paywall), 1990. ↩
Y. Ibañez-García, C. Lutz, T. Schneider. Finite Model Reasoning in Horn Description Logics, 2014. ↩
G. Kourtis, I. Pratt-Hartmann. Adding Path-Functional Dependencies to the Guarded Two-Variable Fragment with Counting, 2016. ↩
T. Gogacz, J. Marcinkowski. The Hunt for a Red Spider: Conjunctive Query Determinacy Is Undecidable, 2015. ↩↩
N. Francis. Asymptotic Determinacy of Path Queries using Union-of-Paths Views, 2015. ↩
G. Głuch, J. Marcinkowski, P. Ostropolski-Nalewaja. Can One Escape Red Chains? Regular Path Queries Determinacy is Undecidable, 2018. ↩
G. Głuch, J. Marcinkowski, P. Ostropolski-Nalewaja. The First Order Truth behind Undecidability of Regular Path Queries Determinacy, 2019. ↩
N. Dalvi, D. Suciu. The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries, 2012. ↩↩
A. Jha, D. Suciu. Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams, 2011. ↩↩
T. Gogacz, J. Marcinkowski. On the BDD/FC Conjecture, 2014. ↩
L. Bulteau, V. Jugé, S. Vialette. On shuffled-square-free words, 2023. ↩
G. Guégan, P. Ochem. A Short Proof that Shuffle Squares are 7-Avoidable, 2016. ↩
L. Bulteau, S. Vialette. Recognizing Binary Shuffle Squares is NP-Hard, 2019 (only available behind [a paywall}(https://www.sciencedirect.com/science/article/pii/S0304397519300258)). Look for "Exhaustive enumeration shows..." ↩
L. Segoufin, C. Sirangelo. Constant-Memory Validation of Streaming XML Documents Against DTDs, 2007. ↩
T. Green, G. Karvounarakis, V. Tannen. Provenance Semirings, 2007. ↩↩↩
D. Deutch, T. Milo, S. Roy, V. Tannen. Circuits for Datalog Provenance, 2014. ↩
B. Kimelfeld, P. Senellart. Probabilistic XML: Models and Complexity, 2013. ↩
E. Kharlamov, W. Nutt, P. Senellart. Updating Probabilistic XML, 2010. See in particular Figure 3. ↩
P. Senellart. Understanding the Hidden Web. PhD thesis, 2007. See Section 2.6.2. ↩
P. Senellart. Understanding the Hidden Web. PhD thesis, 2007. See Section 2.4. ↩
A. López-Ortiz, C.-G. Quimper. A Fast Algorithm for Multi-Machine Scheduling Problems with Jobs of Equal Processing Times, 2011. ↩
M.R. Garey, D.S. Johnson, B.B. Simons, R.E. Tarjan. Scheduling Unit–Time Tasks with Arbitrary Release Times and Deadlines, 1981 (only available behind a paywall). ↩
J. Lawrence. Polytope volume computation, 1991. See pages 11 and 12. ↩