This page is a collection of open problems in theoretical computer science. I have not investigated all of them thoroughly, but I find them interesting. It also features a list of other lists of open problems.

Some problems are very precisely formulated, others are fuzzier. The categories are here for convenience but they are mostly random.

Some of these problems are existing conjectures, though I avoided those that everyone should know, like P vs NP. For problems which are well-known conjectures with existing descriptions that are better than what I could write, I have a dedicated category. Other problems are open problems left as future work in publications that I found interesting. Yet other problems come from my research: however I usually do not list problems that I am working on right now, because the phrasing of the problem (or my understanding of possible solutions) could change too quickly for me to keep this page in sync. Also, some of my questions on TCS.SE may not be immediately reflected here, as I first wait to see whether I get an answer.

If you know the answer to a problem, or if you have new information about it, I would be very interested to know! You can write me at a3nm<REMOVETHIS>

This document is likely to contains errors of various kinds, and I make no promises about its correctness. Please report any mistakes to the address above.


Complexity of counting antichains in restricted poset classes

With Yael Amsterdamer and Tova Milo, we worked on the problem of learning monotone predicates on lattices via crowd queries. We used the result that it is #P-hard to count the number of antichains in a poset1. However, our results could be strengthened if we knew that hardness also held for distributive lattices. Is it also #P-hard to count antichains in distributive lattices?

More generally, the question is: on which classes of posets is it #P-hard to count antichains?

Tradeoffs between computational and oracle complexity to learn monotone predicates on posets

Our work about learning monotone predicates on lattices via crowd queries shows that (under some formulations of the problem) it is computationally intractable to do so in a way that minimizes the oracle complexity (number of oracle queries). However, we were unable to analyse the oracle performance of non-optimal strategies which are computationally inexpensive.

Formally, the task is the following: we are given a poset (V,<) and we wish to learn exactly a Boolean predicate P:V{0,1} which is monotone: if P(x)=1 and x<y then P(y)=1. A simple algorithm is to pick (uniformly at random) an element x such that P(x) is unknown, evaluate P(x) with the oracle, integrate the consequences of the answer (set P(y)=0 for all y<x if P(x)=0, set P(y)=1 for all y>x if P(x)=1), and repeat while there are unclassified element. Are there posets where the performance of this strategy (expected number of oracle queries) is significantly worse than the (computationally intractable) optimal strategy?

More generally, are there learning algorithms for this task which are computationally tractable and yet have near-optimal oracle complexity?

Complexity of finding linear extensions of a labeled poset in a regular language

Our work on order-incomplete data studies labeled posets, which consist of a poset (P,<) and a function μ from P to some finite alphabet Σ. The label of a linear extension x1,,xn of (P,<,μ) is the word μ(x1)μ(xn). A simple question is whether we can efficiently determine, given a labeled poset (P,<,μ) and a word w in Σ, whether (P,<,μ) has a linear extension of label w. Our work shows that this problem is hard even for comparatively simple P, and shows classes of labeled posets where it is tractable. It also studies a generalization of this question where you do not consider just the free monoid Σ on Σ, but an arbitrary monoid structure: given a monoid element w and a labeled poset, we ask whether there is a linear extension whose label evaluates to w according to the monoid law.

Of course, another variant of this problem is to consider a language L on Σ, for instance a regular language, and ask whether the labeled poset has a linear extension whose label is in L. The more general question is: for which regular languages is this class tractable? Could one show a dichotomy result separating tractable and intractable languages?

For instance, with the alphabet Σ={a,b} and language L=(ab), I do not know whether it is tractable, given an input poset (P,<,μ), to test whether it has a linear extension where we alternate between elements labeled a and elements labeled b.

A related question is to characterize the complexity of enumerating linear extensions of labeled posets: the problem can be done in constant amortized delay for classical posets3, but I am not aware of analogous results for labeled posets.

Representing the result of duplicate consolidation in labeled posets

With M. Lamine Ba, Daniel Deutch and Pierre Senellart, we worked on how to represent and query order-incomplete data. In this context, we studied labeled posets (defined as above) and study the question of how to consolidate duplicates in a labeled poset. By this, we mean that we want to compute a labeled poset where the element labels are unique (all elements with the same label are collapsed to the same element), and the remaining order relations between the collapsed elements are sensible.

However, if you see a labeled poset as a way to represent a set of possible orders (its linear extensions), and if you want the duplicate removal operation to act consistently when looking at each linear extension (i.e., the linear extensions of the duplicate consolidation should be the result of removing duplicates in the linear extensions of the labeled posets), then it is not clear at all how to proceed. In fact, the result of duplicate consolidation may even not be representable as a labeled poset.

Is there a formalism that can represent the result of duplicate consolidation on labeled posets?

Testing whether a set of words can be the set of linear extensions of a labeled poset

Define labeled posets as in the previous section. Consider a sequence S of words given as input. A labeled poset (P,<,μ) represents S if S is exactly the set of labels that can be achieved by the linear extensions of the poset. Of course, a sequence S cannot be thus represented unless all its sequences are built using the same multiset of elements, which should be the image of μ (seen as a multiset). However, if S respects this condition, it does not seem easy to determine whether it can be represented or not.

What is the complexity of determining, given a set of sequences, whether it can be represented by a labeled poset in this sense?

Stable interpolation on partial orders

In our work with Yael Amsterdamer, Tova Milo, and Pierre Senellart, we study an interpolation scheme on partial orders, defined as the center of mass of the convex polytope defined by the order constraints. As it turns out, however, the scheme is not stable: if we fix some variables to their interpolation result, the interpolation of other variables may change.

Is there a principled stable interpolation scheme on posets? Is it unique?

Representing bounded treewidth partial orders

In our work with Yael Amsterdamer, Tova Milo, and Pierre Senellart, we consider partial orders whose Hasse diagram (or its reverse) is a directed tree. We do not know whether our study would extend to partial orders whose Hasse diagram has bounded treewidth.

Is there structure on the set of linear extensions of a poset whose Hasse diagram has bounded treewidth? Alternatively, is there a structure on the convex polytope defined by such order constraints?

Smallest posets with prescribed number of linear extensions

For all nN, what is the minimal number of elements of a poset with exactly n linear extensions, if one exists? What is the asymptotic growth of this function?

The analogous question for antichains has been studied4, with posets of size logarithmic in the desired number of antichains. However, the argument is much simpler, because the posets can be easily constructed from smaller posets. By contrast, the number of linear extensions of a poset seems hard to determine from elementary "constituent parts".


The monk problem

The so-called monk problem5 is a pursuit-evasion game played on a directed graph. A strategy for k pursuers is a sequence s1,,sn of subsets of k vertices. A strategy is winning if, for every (non-simple) walk v1,,vn in the graph, there is i such that visi. Intuitively, the evader is walking on the graph (it has to move at each turn by following exactly one edge) and must avoid the k vertices that the pursuers examine at each time step.

The evasion number of a digraph G is the smallest k for which the pursuers have a winning strategy. What is the computational complexity, given an input digraph, of computing its evasion number? This relates to the question of whether graphs with evasion number k can be somehow characterised, similarly to the links between treewidth and pursuit-evasion via havens. It also relates to the question of how the maximal length of a strategy can be bounded as a function of the graph size and of k.

Simple path on DAG with backward edges

Consider an input DAG G, two vertices s and t, and additional back edges such that whenever (u,v) is an additional edge then u is reachable from v in G. Is it NP-hard to determine whether there is a simple path from s to t that uses at least one additional edge?

Computing the treewidth of planar graphs

Treewidth measures how much an undirected graph is close to a tree. It is known that, for any fixed kN, we can check in linear time in an input graph G whether its treewidth is k; but that, when both k and G are given as input, it is NP-hard to determine whether G has treewidth k.

Is this last statement still true if G is planar, or can the treewidth of a planar graph G be computed in PTIME?

Computing cliquewidth

The parameterized cliquewidth computation problem for kN asks, given an input graph G, whether G has cliquewidth k. Is this problem in PTIME for any fixed k? Is there kN such that the problem is NP-hard?

Membership in PTIME was recently shown12 for k3, but the problem is open for larger k. In particular, we do not know whether the problem is in FPT. Membership in FPT is known for some restricted classes of graphs of unbounded cliquewidth13, and membership in cubic FPT is known for the related parameter of rankwidth14.

Optimal bound on the size of a grid minor

The grid minor theorem15 of Robertson and Seymour shows that, if a family of graphs has unbounded treewidth, then one can find arbitrary large grid graphs as minors of the family. Specifically, the result can be stated as follows16: there exists a function f:N>0N>0 such that, for every gN>0, every graph of treewidth g has the (g×g)-grid as a minor.

The best known upper bound16 on f is ~O(g19), where the ~O notation neglects polylogarithmic factors. The best known lower bound17 is Ω(g2logg).

What is the correct bound on the function f?

Extracting a polynomial grid minor in PTIME

Continuing on the grid minor theorem from the previous entry, the following result is known18: there is cN such that for any nN, for any graph G of treewidth nc, one can find the n×n grid as a minor of G. This work shows how the minor can be found in randomized polynomial time.

Is it possible to extract the minor in deterministic PTIME?

Monadic second-order logic with cardinality predicates

We extend monadic second-order logic (MSO) over graphs with constructs to check the equality of the cardinalities of second-order variables, and check that their cardinalities are in fixed recursive sets. We fix a formula Φ in this logic and a treewidth bound kN. Is it PTIME in an input graph G of treewidth k to check whether it satisfies Φ?

Courcelle's theorem states that this is true (and in linear time) for normal MSO without the extension.

Graph reachability labellings with total orders

A labeling of a directed acyclic graph (DAG) G=(V,E) is a function μ from V to some set Σ of labels such that the labels of a pair of vertices suffice to test reachability. Formally, there is a decoding function f:Σ2{0,1} independent of G such that for any two vertices u,vV, we have f(μ(u),μ(v))=1 iff v is reachable from u in G.

I am interested in labeling schemes where Σ is Nd for a certain d, and where for simplicity we assume that μ cannot use the same number twice: there are no uv in V such that, letting =μ(u) and =μ(v), we have i=j for some 1i<jd. Further, the decoding function f can only use the labels to do comparisons on their components, i.e., f, when applied on (,) can only see whether i<j for all 1i,jd. Given dN and a choice of decoding function f, the labelable graphs of f are those for which there is a reachability labeling scheme for f.

These definitions generalize several known notions:

For other values of d and decoding functions, how can we characterize the labelable graphs? In particular, for a given value of d, are some decoding functions more expressive than others? (For d=2 and the two decoding functions that I presented, interval orders and orders of dimension 2 are incomparable.) Further, is there dN and some decoding function such that any graph has a reachability labeling with Nd? (This sounds extremely unlikely but I'm not sure of how to prove it.)

Boolean dimension of planar posets

Let G=(V,E) be a directed acyclic graph (DAG). A Boolean realizer of G is intuitively a way to label each vertex of G with integers, such that reachability in G can be determined only by looking at some Boolean function over pairwise comparisons of the vertex labels. Formally, a Boolean realizer of G of dimension dN consists of a Boolean function ϕ on d variables, and of d labellings 1,,d, each of which is an injective function from V to N: we require that for any two vertices u,vV, letting xi for all 1id be 1 if i(u)<i(v) and 0 otherwise, we have ϕ(x1,,xd)=1 iff v is reachable from u in G. The Boolean dimension of G is then the smallest d for which G has a Boolean realizer of dimension d.

Is the Boolean dimension of planar DAGs bounded? In other words, is there a constant kN such that, whenever G is planar (i.e., it can be drawn in a way such that no two edges cross), then the dimension of G is no greater than k?

Complexity of computing a simplicial decomposition

A simplicial decomposition of a graph G is a tree decomposition such that, for any two adjacent bags b and b whose set of vertices is respectively X and X, the subgraph induced by G on XX is a clique. The width of the decomposition is the size of the largest bag minus one, as usual for treewidth. The simplicial width of G is the smallest possible width of a simplicial decomposition. What is the complexity, given a graph G, of computing its simplicial width and a simplicial decomposition? One can ask the question when allowing arbitrary G, or when assuming that the simplicial width of the input graph is bounded by a constant.

Note that there are known results28 on the complexity of computing a clique minimal separator decomposition, i.e., a simplicial decomposition where we minimize the size of the cliques, not of the bags.

Databases and logic

Oracle complexity of skyline queries

Consider a set of d-dimensional vectors V=v1,,vn of numbers (or indeed any totally ordered set). Let us assume for simplicity that all the vil are distinct for 1in and 1ld. In this context, we say that vi dominates vj if vilvjl for all 1ld. The skyline (or Pareto frontier) of V is the set of maximal vectors for the domination relation.

Assume that the only way you can access the vectors of V is by atomic comparisons, i.e., you can evaluate whether vil<vjm for any 1i,jn, 1l,md. What are bounds on the minimal number of comparisons required to determine the skyline? Such bounds can be stated as a function of k and n, or as a function of the output size (the actual skyline).

Groz and Milo6 have studied this problem when the comparisons are additionally noisy, but good bounds for d>3 are not yet known, even without noise.

Constraint classes where separability is decidable

Consider a set Σ of tuple-generating dependencies (TGDs) and a set Φ of equality-generating dependencies (EGDs). We say that a Boolean conjunctive query (CQ) Q is entailed by a relational database instance I and by ΣΦ if, for every superinstance II such that I satisfies ΣΦ, it is the case that I satisfies Q. Intuitively, Q is implied by the instance I and the constraints ΣΦ under open-world semantics: Q is true on all superinstances of I that satisfy the constraints ΣΦ.

We call Σ and Φ separable if, for any Boolean CQ Q and instance I that satisfies Φ, the following equivalence holds: Q is entailed by I and ΣΦ iff Q is entailed by I and Σ. In other words, the equality-generating dependencies Φ can be checked separately on I, and have no impact afterwards in terms of entailment.

For general TGDs and EGDs, the problem is undecidable7, but this is not so surprising as reasoning with TGDs by themselves is also undecidable if arbitrary TGDs are allowed. Are there less expressive languages for which separability is decidable? One could think, e.g., of inclusion dependencies and functional dependencies, maybe with some restrictions.

Open-world query answering with linear rules and transitivity assertions

A linear rule is a tuple-generating dependency (TGD) with exactly one atom in the body and one atom in the head. A transitivity assertion on an arity-two relation R is a TGD of the form xyz R(x,y)R(y,z)R(x,z).

It is decidable20 whether an instance I and constraints Σ of linear rules and transitivity assertions entail a CQ Q, in the sense of the previous problem, but under some restrictions (all relations have arity at most two, or Q consists of a single atom, or an additional condition holds). Does decidability hold in the general case of arbitrary CQ and arbitrary arity signatures for linear rules and transitivity assertions?

Decidable unary language with number restrictions

With Michael Benedikt we introduced a language formed of expressive constraints on an arity-two signature (including number restrictions, namely counting quantifiers) and tuple-generating dependencies on arbitrary arity predicates, also supporting functional dependencies of a restricted kind on such predicates. We showed that it was decidable to determine (in the sense above) whether a query is entailed by an instance and such constraints, which we call the open-world query answering (OWQA) problem.

This raises the question of whether a more expressive and more uniform such language could be designed. It should also have decidable entailment, and should also feature (1) arbitrary arity constaints, (2) number restrictions such as functional dependencies, and (3) expressive logical operators such as disjunction. Can such a language be designed? Relevant results include the following (beyond the ones shown in the paper mentioned above):

A general idea would be to design a language that generalizes frontier-one TGDs to have disjunction and some kind of "non-conflicting" counting quantification, and establish decidability.

Following this paper, another natural direction would be to study finite model reasoning for such a language. However, one should first show the decidability of the finite implication problem for it: decide whether a set of constraints of the language implies other constraints. Yet, as far as I know, the only decidability result for finite implication on an arbitrary-arity language that features number restrictions only covers35 unary inclusion dependencies and functional dependencies. Can finite implication be decided for a more expressive language of this form? E.g., in the spirit of existing works36 but for arbitrary arity constraints.

Decidability of finite query answering with path-functional dependencies and two-variable guarded constraints

This question uses the definition of OWQA from the previous question. It was recently shown37 that the finite satisfiability problem is decidable for the two-variable guarded fragment of first-order logic with counting quantifiers (GC²) to which one adds path-functional dependencies. Is the same true of the finite OWQA problem? This is already known if there are no path dependencies21.

If this is true, it would imply an independent proof of the result of this paper when assuming that all functional dependencies are unary. Indeed, unary inclusion dependencies and functional dependencies on arbitrary signatures can clearly be equivalently translated to GC² plus path functional dependencies (with paths of length 2). To re-prove this result for arbitrary functional dependencies, one would need some generalization of path functional dependencies (to capture the image of this translation).

Decidability of conjunctive query containment under bag semantics

Is conjunctive query containment under bag semantics decidable, and what is its complexity?

Determinacy of path queries by unions of path views

For kN, the path query qk of length k on a directed graph G=(V,E) returns the set of pairs of nodes (u,v)V2 such that there is a path of length k from u to v. A union of path queries qS is defined by a (possibly infinite) set S of integers and returns the sets of pairs of nodes (u,v) such that there is a path from u to v whose length is in S.

A set of union of path queries Q=qS1,,qSn determines a path query qk if, for all finite graphs G and G, if qSi(G)=qSi(G) for all 1in, then qk(G)=qk(G). Under mild assumptions on the representation of infinite sets in union of path queries, the determinacy problem was shown39 to be decidable but only for path queries qk where k is larger than some function of Q. Is it decidable to determine whether a path query is determined by a set of union of path queries, without this restriction?

Another question is whether this result extends to more general query languages. For instance, is it decidable to determine whether a union of path queries is determined by a set of union of path queries?

A natural further generalization of this problem is going from paths of a fixed length to paths labeled by some language, when edges are labeled by letters of some alphabet. Formally, we define a regular path query qL as a query on edge-labeled graphs that returns the pairs of vertices connected by a path in L, where L is a regular language over the alphabet of edge labels. In this context, however, it was shown that it was undecidable whether a regular path query was determined by a set of regular path queries40, and it is even undecidable whether using only finite regular languages41 (or, equivalently, union of labeled path queries, i.e., unions of queries returning all pairs of vertices separated by a path labeled by some finite word). So with these generalisations of the above problem being undecidable, it is not clear whether we should expect decidability for (unlabeled) path queries, or unions thereof.

Thanks to Nadime Francis and to Bartosz Bednarczyk for helping me to prepare this entry.

Do tractable queries on probabilistic instances have tractable lineages?

A tuple-independent database (TID) is a probabilistic database consisting of a relational database I where each tuple is given a probability in [0,1]. The semantics is a probability distribution on subsets of I (the possible worlds), obtained by considering that each tuple is either kept or removed with the indicated probability, independently across tuples.

A Boolean union of conjunctive queries (UCQ) is a disjunction of conjunctive queries with no free variables. The probability evaluation problem for a fixed Boolean UCQ q asks, given a TID I, what is the probability that I satisfies q, meaning, what is the total probability mass of possible worlds of I that satisfy q. A dichotomy result is known42: Boolean UCQs are partitioned between those for which the probability evaluation problem can be solved in polynomial time, and those where it is #P-hard.

The lineage of a Boolean query q on a TID instance I is a Boolean formula whose variables are the facts of I, such that a valuation makes the circuit evaluate to true iff the corresponding possible world (defined by keeping exactly the facts set to true) satisfies q. One way to show that a Boolean UCQ is tractable is to show that, on any instance, one can represent its lineage in a form that allows for tractable probability evaluation.

A d-DNNF¬ is a Boolean circuit (with AND, OR, NOT gates, and input gates) such that every AND gate is on disjoint set of variables (meaning, there is no input gate reachable from two distinct children of an AND gate), and every OR gate is exclusive (meaning, there is no valuation of the input gates that makes two distinct children of an OR gate evaluate to true). A Boolean UCQ has polynomial-size d-DNNF¬ if there is a constant cN such that, for any TID instance I, there is a d-DNNF¬ of size O(|I|c) that expresses the lineage of q on I.

It is known43 that many Boolean UCQs for which probabilistic query evaluation is in PTIME have polynomial-size d-DNNF¬, but the converse is open. Is there a Boolean UCQ for which probabilistic query evaluation is in PTIME, but that has no polynomial-size d-DNNF¬? In particular, is this the case of the conjectured43 counterexample Q9? If this can be shown, is it possible to design a generalization of d-DNNF¬ that still enjoys probabilistic query evaluation, and is such that any Boolean UCQ with PTIME probabilistic query evaluation has a polynomial-size lineage in this generalization?

Does bounded derivation depth imply finite controllability?

We talk of a Boolean conjunctive query Q being entailed by an instance I and set Σ of tuple-generating dependencies (TGDs) if for any instance II such that IΣ, we have IQ.

The set Σ of tuple-generating dependencies (TGDs) has bounded derivation depth if, for any conjunctive query Q, there exists a union of conjunctive queries Q such that, for any relational instance I, the following equivalence holds: Q is entailed by I and Σ iff IQ.

The set Σ of TGDs is finitely controllable iff for any instance I and Boolean conjunctive query Q, the following equivalence holds: Q is entailed by I and Σ iff Q is entailed by I and Σ over finite models (impose that I is finite in the definition above).

Is it the case that if a set of TGDs has bounded derivation depth then it is finitely controllable? In the work where this conjecture was posed44, it was solved for signatures of arity two, but the general question remains open.

What is the connection between the fluted fragment and inversion-free queries?

The fluted fragment is a fragment of function-free first-order logic for which the satisfiability problem is decidable8 (but non-elementary). The positive formulae of the fluted fragment (i.e., those which do not use negation) were independently studied as inversion-free queries, which are known9 to be exactly the UCQs that admit OBDD lineages on TID databases (see this question for relevant definitions).

Is there any connection between these two results? Can one be obtained from the other? Does the result on OBDD extend to queries with negation?

Thanks to the anonymous Reviewer 3 of our paper at PODS 2016 for pointing out this connection.

What is the complexity of testing if a query is safe?

This question refers to the dichotomy42 on the Boolean UCQs for which the probability evaluation problem is in PTIME: see this question for background. The queries which are tractable according to this dichotomy are safe.

What is the complexity, given a Boolean UCQ, to determine whether it is safe?

The best known algorithm for this is super-exponential10.

Complexity of query evaluation parameterized by treewidth

The query evaluation problem asks, given a Boolean conjunctive query Q and a relational database I, whether Q holds on I. The treewidth of I is the treewidth of its Gaifman graph, and the treewidth of Q is that of its canonical instance (i.e., we just see the query as a database on its variables). What is the parameterized complexity of the query evaluation problem when parameterized by the treewidth of I and of Q? (In other words, the parameter of the problem is an upper bound on the treewidth of both I and Q.)

How can one strengthen lower bounds for probabilistic query evaluation on unbounded-treewidth families?

In this paper with Pierre Bourhis and Pierre Senellart, we showed (Theorem 4.2) that there is a query for which the probabilistic query evaluation problem (see this question for the definition) is intractable on any unbounded-treewidth instance family. However, the lower bound of this result has some limitations, and I do not know whether they can be lifted:

Lower bounds on lineage sizes

In the same paper as in the previous question, we showed (in Lemma 8.2) that, for any graph signature, there is a UCQ with inequalities Q such that, given any instance I, the width of any OBDD representing the lineage of Q on I (see this question for the definition) is bounded by an exponential function of the treewidth of I: specifically, there is a constant cN>0 such that it is in Ω(2(width(I))1/c).

Can the same result be shown for other lineage representations? In particular, can it be shown for FBDDs, or for d-DNNFs?

Boolean functions and circuits

Monotone dualization

The prime CNF of a monotone Boolean function over variables x1,,xn is the (unique up to order) expression of the function as a conjunction of disjunctions of the xi from which no variable occurrence can be removed. The dual of a monotone Boolean function f is the monotone Boolean function mapping x1,,xn to ¬f(¬x1,,¬xn). By De Morgan's laws, a Boolean expression for the dual can be obtained by replacing 's by 's and vice-versa in the prime CNF of f.

The Dual problem asks, given two prime CNFs, whether the functions defined by these CNFs are dual of one another. It is known to be in quasipolynomial time11. Is the Dual problem in PTIME?

Explicit Boolean functions with supralinear circuits

Shannon proved in 1949 with a counting argument that most Boolean functions cannot be represented by a circuit of linear size. However, we do not know yet of any explicit Boolean function for which no linear size circuit exists. Can we construct such a function?

Conciseness gap between formulae and circuits

Boolean functions can be represented as circuits or as formulae. Circuits seem much more concise, because they can reuse common subexpressions. Yet the best conciseness gap known is the following: there are Boolean functions that can be represented by linear circuits but for which any formula representation has size at least n3o(1) (over circuits with AND, OR, and NOT). Can we do better?

Formal languages

Shortest complete sequences

An n-complete sequence is a word w over {1,,n} such that each permutation of {1,,n} occurs as a subsequence of w. What is the length of the shortest n-complete subsequences as a function of n?

Languages recognized by polynomial-size DFAs

Which languages can be recognized by a (not necessarily uniform) family of deterministic finite automata of polynomial size?

Context-freeness of primitive words

A primitive word is a word that cannot be represented as a power of another word. Is it true that, on any alphabet with more than one letter, the set of all primitive words is not context-free?

Words without shuffle squares

Fix an alphabet Σ. We say that a word wΣ is the shuffle of two words u,vΣ if we can obtain w by interleaving u and v. A word wΣ is a shuffle square if there is a word uΣ such that we can obtain w as the shuffle of u and u. (Note that this implies that, for each letter aΣ, there is an even number of occurrences of a in w. While the number of occurrences of each letter in u are uniquely defined from w, the order of the letters is not.) We say that a word is shuffle-square-free if it contains no substring which is a shuffle-square.

If the alphabet Σ has at least 7 letters then it is known that there exist infinitely many shuffle-square-free words45. If the alphabet Σ has 3 letters or less, then it is known46 that such words do not exist: the longest such words on Σ={a,b,c} are abcacbacabc, acbabcabacb, bacbcabcbac, bcabacbabca, cabcbacbcab, cbacabcacba, of length 11 each. (I'm adding them to this entry to ensure that people interested in this computation can easily find the page. ;)) The question is open for alphabets of size 4, 5, and 6.

Are there infinitely many shuffle-square-free words on an alphabet of size 4?

Which regular tree languages can be recognised by a word automaton?

Given a tree alphabet Σ, we define the alphabet ¯Σ to consist of {aaΣ}{¯aaΣ}. Given an unranked tree T on a tree alphabet Σ, the XML representation of T is the word on ¯Σ recursively defined as follows: the coding of a leaf labeled aΣ is a¯a, and the coding of an internal node labeled aΣ with children n1,,nk is ac1ck¯a where each ci is the coding of ni.

Given a tree automaton A on unranked trees on Σ, it defines a so-called regular tree language L(A), and we denote by XML(L(A)) the word language on ¯Σ of the XML codings of trees of L(A). Of course, for all regular tree languages except finite ones, XML(L(A)) is not regular as a word language, because a word automaton cannot check if the opening and closing tags (i.e., the a's and ¯a's) are properly nested. The question is intuitively to understand the regular tree languages for which matching the opening and closing tags is the only difficulty.

Formally, we say that a word automaton A weakly recognises a tree language L if, given a word w on ¯Σ which is the XML representation of some tree T, then A accepts w iff TL. (The behavior of A on words of ¯Σ that do not represent any tree is not specified.)

Given a tree automaton A, is it decidable to determine if there exists a word automaton that weakly recognizes the language XML(L(A))?

The problem has been shown decidable in the case of DTDs47, which are a special case of tree automata, but the status of the general problem is open.

Thanks to Bartosz Bednarczyk for pointing out this problem to me, and to Charles Paperman for the problem phrasing used here.


Lower bounds on representations of provenance

The Why-provenance48 of a Boolean query q in the positive relational algebra on a relational instance I is a Boolean function ϕ defined on the facts of I such that, for any valuation ν mapping each fact of I to true or false, ν(ϕ) is true iff {FIν(F) is true} satisfies q.

It is known48 that, for any such fixed query q, its Why-provenance on I can be represented as a Boolean formula of polynomial size in the instance I. Are there instance families with known lower bounds on the representation of Why-provenance? In particular, is there a query q and family of relational instances I1,,In,, such that, letting ϕi for all i be the Why-provenance of q on In, the size of ϕi is superlinear in Ii, namely, there is no constant K such that, for all i, the provenance ϕi can be written as a Boolean formula of size less than K|I|, where |I| is the number of facts of I.

The question generalizes to other representations of provenance, such as provenance circuits49, or to provenance expressed in different provenance semirings48, such as the universal semiring N[X] of polynomials in X (standing for the facts of I).

Compactness difference between probabilistic XML document formalisms

The PrXML formalism is a way to represent probability distribution on labeled, unranked, unordered trees50. It does so by representing uncertainty with special kinds of nodes:

We give names to families of probabilistic XML documents according to the nodes that are allowed, e.g., PrXMLind, mux refers to probabilistic documents where nodes of type ind and mux are allowed. A family is efficiently translatable in another if any document of the first family can be rewritten to a document in the second. Some families are known to be efficiently translatable in others51, but open questions remain: is PrXMLexp efficiently translatable in PrXMLmux,det, or even to PrXMLcie?

Thanks to Pierre Senellart for helping me to prepare this entry.

Complexity of testing the equivalence of PrXMLcie

Using the previous definitions, the semantic equivalence problem for two PrXML documents asks whether the two documents define the same probability distribution on outcomes (same possible worlds, with same probabilities). Given two PrXMLcie documents, what is the probability of deciding whether they are semantically equivalent? The problem is in EXPTIME but no lower bound or hardness result is known52. What is the complexity of this problem?

The structural equivalence problem asks, given two PrXMLcie documents, whether, for each valuation of the variables, the corresponding valuations of both documents are the same. (Note that the probabilities assigned to the events is irrelevant.) The problem is in coNP, but no lower bound or hardness result is known53. What is the complexity of this problem?

Thanks to Pierre Senellart for helping me to prepare this entry.

Complexity of multi-machine scheduling of jobs with start dates, end dates, and equal duration

We have a certain number m of machines, and a number n of jobs. All jobs have the same duration (an integer p), and each job has a minimum start date and a maximal end date, both of which are integers. The scheduling problem asks whether there is a way to schedule all n jobs on the m machines. We can phrase it in two variants: the decision variant simply asks whether it is possible, and the computation variant asks us to compute a possible schedule, i.e., a partition of the n jobs into m sequences such that each machine can perform its sequence of m jobs in a way that respects the start date and end dates of all jobs, and the job duration. The problem can equivalently be phrased with jobs of unit duration and start and end dates which are rationals.

Somewhat surprisingly (to me at least), this problem can be solved in PTIME. The best known algorithm54 (which additionally optimizes an additional criterion) runs in time O(min(1,pm)n2). What is the most efficient algorithm for this problem?

For the case m=1, a more efficient algorithm is known55, which runs in O(nlogn). In this setting, there is a clear lower bound of Ω(nlogn) for the computation problem, as the scheduling problem for jobs with disjoint intervals amounts to sorting their bounds; but for the decision problem, this is unclear.

Thanks to the judges of the 2017 ACM-ICPC World Finals for bringing this problem to my attention (see Problem H, "Scenery").

Which convex polytopes have volumes of polynomial bit-length?

In this question, all numbers are represented as rationals. Consider a convex polytope described as an intersection of half-spaces, each of which is described as inequalities between a linear function of the coordinates and a constant (e.g., x1+742x23). We assume that the polytope is bounded, i.e., it is not infinite. Given such a representation, we wish to compute the volume of the polytope. It turns out56 that this cannot be done in polynomial time (or even in nondeterministic PTIME) because the volume is not always of polynomial length in the input description. Hence the question: For which classes of polytopes is the volume always of polynomial length in the input?

Well-known conjectures

Other lists of open problems

Here are links to some other specific open questions found on other places on the Web:

Formerly open problems

This section regroups problems that occurred in a previous version of this list, and were solved since then.

Complexity of an assignment problem with subsets

It is known that a maximum matching can be found in PTIME. In particular, this is the case of bipartite graphs. However, what about the situation where we are allowed to remove vertices in one of the parts, and where we want to maximize the number of vertices that must remain unmatched in a maximum matching? What is the complexity of this problem?

Proved to be in PTIME by Chao Xu in this TCS.SE answer. Thanks!

Decidability of conjunctive query determinacy in the finite

The query determinacy problem asks, given a set Q1,,Qn of conjunctive queries (CQs) and an additional CQ Q, whether the Qi determine Q, i.e., whether, for any two (possibly infinite) relational databases I and I, having Qi(I)=Qi(I) for all 1in implies we must have Q(I)=Q(I). The query determinacy problem was shown38 to be undecidable but this work left open the problem of finite determinacy, which is defined as determinacy except that I and I must be finite.

Is finite determinacy also undecidable for conjunctive queries?

Proved by the authors of the original work38 in a new paper. As this was accepted for publication at PODS 2016, I think we can say that this has been accepted by the community, although I haven't personally checked the proof.

Steinberg's conjecture

Grötzch's theorem says that planar graphs without 3-cycles (triangles) are 3-colorable. Is every planar graph without 4-cycles and 5-cycles 3-colorable?

Disproved in this paper. I haven't personally checked it, but it has been published in the Journal of Combinatorial Theory, so I'm assuming it is correct.

Feder-Vardi conjecture

Letting Γ be a digraph, the CSP for Γ, written CSP(Γ), is the following problem: given a digraph G, decide whether there is a homomorphism from G to Γ. The Feder-Vardi conjecture29 asks: Is there a dichotomy on Γ, with CSP(Γ) being always either in PTIME or NP-hard?

An analogous result is already known for undirected graphs30, for some restricted classes of directed graphs31, and when Γ has two elements32 or more recently three elements33.

Proved: two recent preprints by Bulatov and by Zhuk claim a proof of this result. The papers were accepted for publication at FOCS 2017 so I'm classifying this as solved (accepted by the community), even though it may still be a good idea to verify these proofs in more detail (see this post). I have not personally checked any of this.

Thanks to Mikaël Monet for pointing out that the conjecture for digraphs implies34 the conjecture for relational structures.

Complexity of counting linear extensions in posets of height 2

It is #P-hard to count the number of linear extensions of an input poset2. The proof uses posets of height 3, and leaves open the natural question: is it #P-hard to count linear extensions in posets of height 2?

An equivalent formulation is in terms of directed bipartite graphs (all edges go from one part to the other part): given a directed bipartite graph as input, is it #P-hard to count the number of topological sorts of the graph?

Solved in this paper (shown to be #P-hard). Note that I haven't personally checked it, and it does not seem to have been peer-reviewed yet. Thanks to Kuldeep S. Meel for pointing me to that paper.

Complexity of counting linear extensions in posets of dimension 2

This is similar to the question above, but for order dimension: is it #P-hard to count linear extensions in posets of dimension 2?

Solved in the same paper as above (shown to be #P-hard).

  1. J. S. Provan, M. O. Ball. The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected, 1983. 

  2. G. Brightwell, P. Winkler. Counting Linear Extensions is #P-Complete, 1991. 

  3. G. Pruesse, F. Ruskey. Generating Linear Extensions Fast, 1994. (closed access) 

  4. K. Ragnarsson, B. Eileen Tenner. Obtainable Sizes of Topologies on Finite Sets, 2008. To understand the connection, remember that a topology with m points and k open sets is the same as a poset with m elements and k order ideals (as they state in the abstract), and this is the same as a poset with m elements and k antichains (take the set of maximal elements of each order ideal). Their Theorem on page 2 gives the logarithmic bound. We independently derived a similar result with a similar proof but a worse bound in our paper with Yael Amsterdamer and Tova Milo, Lemma B.2, and the size of the posets is in O(log2n)

  5. B. Fredriksson, E. Lundberg. The Monk Problem: Verifier, heuristics and graph decompositions for a pursuit-evasion problem with a node-located evader, 2015. 

  6. B. Groz, T. Milo. Skyline Queries with Noisy Comparisons, 2015. 

  7. A. Calì, G. Gottlob, G. Orsi, A. Pieris. On the Interaction of Existential Rules and Equality Constraints in Ontology Querying, 2012. 

  8. I. Pratt-Hartmann, W. Szwast, L. Tendera. Quine’s Fluted Fragment is Non-elementary, 2016. 

  9. A. Jha, D. Suciu. Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams, 2011. See Theorem 4.2. 

  10. D. Suciu, D. Olteanu, C. Ré, C. Koch. Probabilistic Databases, 2011. See end of Section 

  11. T. Eiter, K. Makino, G. Gottlob. Computational Aspects of Monotone Dualization: A Brief Survey, 2008. 

  12. D. G. Corneil, M. Habib, J.-M. Lalignel, B. Reed, U. Rotics. Polynomial-time recognition of clique-width ≤ 3 graphs, 2012 (only available behind a paywall). 

  13. P. Heggernes, D. Meister, U. Rotics. Computing the clique-width of large path powers in linear time via a new characterisation of clique-width)

  14. P. Hliněný, S. Oum. Finding Branch-Decompositions and Rank-Decompositions, 2007. 

  15. N. Roberston, P. D. Seymour. Graph Minors. V. Excluding a Planar Graph, 1986. 

  16. J. Chuzhoy, Improved Bounds for the Excluded Grid Theorem, 2016. See Theorem 1.1, and the three following paragraphs. 

  17. N. Roberston, P. D. Seymour. R. Thomas. Quickly Excluding a Planar Graph, 1994. See last paragraph of Section 1. 

  18. C. Chekuri, J. Chuzhoy. Polynomial Bounds for the Grid-Minor Theorem, 2014. 

  19. C. Chekuri, personal communication, 2015. (That's academic lingo to say that we emailed them and they don't know yet.) 

  20. J.-F. Baget, M. Bienvenu, M.-L. Mugnier, S. Rocher. Combining Existential Rules and Transitivity: Next Steps, 2015. 

  21. I. Pratt-Hartmann. Data-complexity of the two-variable fragment with counting quantifiers, 2009. 

  22. V. Bárány, G. Gottlob, M. Otto. Querying the Guarded Fragment, 2014. 

  23. A. Calì, G. Gottlob, A. Pieris. Towards more expressive ontology languages: The query answering problem, 2012. See Section 6.2.2. 

  24. J. Mitchell. The Implication Problem for Functional and Inclusion Dependencies, 1983 

  25. A. Calì, D. Lembo, R. Rosati. Query rewriting and answering under constraints in data integration systems, Theorem 3.4. However, the proof of this result has a slight inaccuracy; see Appendix A.2 of this paper for details. 

  26. G. Gambosi, J. Nešetřil, M. Talamo. On locally presented posets, 1990. 

  27. S. Felsner, T. Mészáros, P. Micek. Boolean dimension and tree-width, 2018. 

  28. H.-G. Leimer. Optimal decomposition by clique separators, 1993. 

  29. T. Feder, M. Y. Vardi. The Computational Structure of Monotone Monadic SNP and Constraint Satisfaction: A Study through Datalog and Group Theory, 1998. 

  30. P. Hell, J. Nešetřil. On the complexity of H-coloring, 1990. 

  31. L. Barto, M. Kozik, M. Maróti, T. Niven. CSP dichotomy for special triads, 2009. 

  32. T. J. Schaeffer. The complexity of satisfiability problems, 1978. 

  33. A. A. Bulatov. A dichotomy theorem for constraints on a three-element set, 2002. 

  34. T. Feder, M. Y. Vardi. The Computational Structure of Monotone Monadic SNP and Constraint Satisfaction: A Study through Datalog and Group Theory, 1998. See Theorem 10 and subsequent theorems. 

  35. S. S. Cosmadakis, P. C. Kanellakis, M. Y. Vardi. Polynomial-time implication problems for unary inclusion dependencies (only available behind a paywall), 1990. 

  36. Y. Ibañez-García, C. Lutz, T. Schneider. Finite Model Reasoning in Horn Description Logics, 2014. 

  37. G. Kourtis, I. Pratt-Hartmann. Adding Path-Functional Dependencies to the Guarded Two-Variable Fragment with Counting, 2016. 

  38. T. Gogacz, J. Marcinkowski. The Hunt for a Red Spider: Conjunctive Query Determinacy Is Undecidable, 2015. 

  39. N. Francis. Asymptotic Determinacy of Path Queries using Union-of-Paths Views, 2015. 

  40. G. Głuch, J. Marcinkowski, P. Ostropolski-Nalewaja. Can One Escape Red Chains? Regular Path Queries Determinacy is Undecidable, 2018. 

  41. G. Głuch, J. Marcinkowski, P. Ostropolski-Nalewaja. The First Order Truth behind Undecidability of Regular Path Queries Determinacy, 2019. 

  42. N. Dalvi, D. Suciu. The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries, 2012. 

  43. A. Jha, D. Suciu. Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams, 2011. 

  44. T. Gogacz, J. Marcinkowski. On the BDD/FC Conjecture, 2014. 

  45. G. Guégan, P. Ochem. A Short Proof that Shuffle Squares are 7-Avoidable, 2016. 

  46. L. Bulteau, S. Vialette. Recognizing Binary Shuffle Squares is NP-Hard, 2019 (only available behind [a paywall}( Look for "Exhaustive enumeration shows..." 

  47. L. Segoufin, C. Sirangelo. Constant-Memory Validation of Streaming XML Documents Against DTDs, 2007. 

  48. T. Green, G. Karvounarakis, V. Tannen. Provenance Semirings, 2007. 

  49. D. Deutch, T. Milo, S. Roy, V. Tannen. Circuits for Datalog Provenance, 2014. 

  50. B. Kimelfeld, P. Senellart. Probabilistic XML: Models and Complexity, 2013. 

  51. E. Kharlamov, W. Nutt, P. Senellart. Updating Probabilistic XML, 2010. See in particular Figure 3. 

  52. P. Senellart. Understanding the Hidden Web. PhD thesis, 2007. See Section 2.6.2. 

  53. P. Senellart. Understanding the Hidden Web. PhD thesis, 2007. See Section 2.4. 

  54. A. López-Ortiz, C.-G. Quimper. A Fast Algorithm for Multi-Machine Scheduling Problems with Jobs of Equal Processing Times, 2011. 

  55. M.R. Garey, D.S. Johnson, B.B. Simons, R.E. Tarjan. Scheduling Unit–Time Tasks with Arbitrary Release Times and Deadlines, 1981 (only available behind a paywall). 

  56. J. Lawrence. Polytope volume computation, 1991. See pages 11 and 12.