a3nm's blog

A list of open research questions

Researchers publish papers about the problems that they manage to solve, but they say much less about the other problems, those that remain open even after a long fight. These problems are sometimes briefly mentioned in papers, others are folklore, and only the most famous end up on Wikipedia. Still, I didn't like it that I did not have a central public list of the open problems I had heard about or struggled with.

So I compiled open problems from my own papers, problems that seemed interesting and were left open by other papers, some of my unanswered questions on CStheory, and other sources.

Here is the list, which I will try to keep up-to-date over time to add new problems and update those on which progress is made. Comments and feedback welcome!

Summarizing my PhD research

Oddly enough, I don't think I ever mentioned my PhD research on this blog. Part of the reason is that I easily get fascinated by technical questions, but I find it less natural to explain what I do in non-technical terms. However, I have to write the introduction to my PhD thesis, so I thought it was a nice excuse to start by summarizing here what I am doing.

For readers who don't know, I'm doing my PhD in Télécom ParisTech, in the DBWeb team. Before this, I stayed for some months in Tel Aviv to work with Tova Milo, and in Oxford to work with Michael Benedikt, which explains some of my current collaborations.

Shameless ad: If you are currently pursuing a computer science Master and would like to work at Télécom ParisTech for your Master's internship on topics like those presented here, please don't hesitate to apply to our offers for 2015—2016! Offer 1, Offer 2, Offer 3.

The general context: Uncertain data

The great success story of database research is the relational model. It is a rare example of something that is both extremely relevant in practice1 and nice from a theoretical perspective, with close connections to logics and finite model theory. The relational model gives an abstract model of how data should be represented and queried, which allows people to use databases as a black box, leaving fine but domain-agnostic optimisations to database engines. Relational databases, like compilers, filesystems, or virtual memory, have become an abstraction that everyone relies on.

One thing that the relational model does not handle is uncertainty and incompleteness in the data. In the relational semantics, the database is perfectly accurate, and it contains all facts that may be of interest. In practice, however, this is rarely satisfactory:

  • Some values may be missing: for instance, the user did not fill in some fields when registering, or some fields were added since they registered.
  • The data may be incomplete: for instance, you are constructing your database by harvesting data from the Web, and you haven't harvested everything yet; but you still wonder whether the data that you have allows you to give a definitive answer to your query.
  • Data may be outdated: for instance, the user moved but they did not update their address.
  • Data may simply be wrong.

Indeed, nowadays, people manage data that may have been inferred by machine learning algorithms, generated by data mining techniques, or automatically extracted from natural language text. Unlike a hand-curated database, you can't just assume mistakes will not happen, or that you will fix all of them when you find them. The data is incomplete and contains errors, and you have to deal with that.

These important problems are already being addressed, but in unsatisfactory ways. The general domain-agnostic solutions are fairly naive: e.g., set a confidence threshold, discard everything below the threshold, consider everything above the threshold as certain. In domains where uncertainty is paramount and where simple solutions do not suffice, e.g., in character recognition (OCR) or speech recognition (ASR), people use more elaborate solutions, e.g., weighted automata. The problem is that these solutions are often domain-specific, i.e., they are tailored to the domain at hand. The vision of uncertain data management is that there should be principled and generic tools to manage such noisy and incomplete data in all domains where it occurs; in the same way that relational databases today are both highly optimized and domain-agnostic tools.

For now, the most visible aspect of uncertainty management in concrete databases are nulls, which are part of the SQL standard. NULL is a special value that indicates that a record does not have a value for a specific attribute: for instance, the attribute did not exist when the record was created. Of course, there are many problems that NULLs do not address:

  • NULL can only represent missing values in a record, not missing records ("There may be more entries that I don't know about"), or uncertainty about whether the record exists altogether ("I think I know this, but it may be wrong.")
  • NULL just means that we know nothing about the value, it doesn't allow us to say that there are multiple possible values, or to model them using a probability distribution, or that two unknown values must be the same.
  • The semantics of NULLs (as defined by the SQL standard) are very counter-intuitive. For instance, trying to select all values which are either equal to 42 or different to 42 will not return the values marked NULL. This is because NULLs are understood as a three-valued logic, but it still does not make sense in terms of the possible worlds : NULL represents an unknown value, and no matter the value, it must be equal to 42 or different from 42...

The semantics of NULLs for SQL databases are now set in stone by the standard, even though people have complained about their imperfections as early as 19772, and even though they still confuse people today. There is still much debate about what the semantics of NULLs could have been instead, see, e.g., the keynote by Libkin, which contains more counter-intuitive examples with NULLs, a review of other theoretical models of NULLs to address some of these problems, as well as proposed fixes.

Research is also studying uncertainty management approaches that have nothing to do with NULLs. For instance, incompleteness can be managed by designing a different semantics to evaluate queries on databases, which is called the open-world assumption, and posits that all missing data is unknown rather than being wrong. To represent uncertainty about the existence of tuples, one can use, e.g., tuple-independent databases: they are standard relational databases, but with records carrying an independent "probability of existence". This formalism cannot represent correlations between tuples (if this tuple is here, then this tuple must be here, or cannot be here), and the result of queries on such databases may not be representable in the same formalism, but more expressive formalisms exist, such as pc-tables. There are also formalisms that can represent uncertainty on other structures, such as XML documents3.

The goal would be to develop a system where relations can be marked as incomplete, and data can simply be marked as uncertain, or inserted directly with probabilities or confidence scores; the system would maintain these values during query evaluation, to be able to tell how certain its answers are. For now, though, real-world uncertain database systems are quite experimental. There are many reasons for that:

  • It's hard to define principled semantics on uncertain data.
  • It's hard to find ways to represent uncertainty about any kind of data.
  • Somewhat surprisingly, probabilistic calculations on data are often computationally expensive, much harder than without probabilities!

This is why we are still trying to understand the basic principles about the right way to represent uncertainty and query it in practice, which is the general area of my PhD. The main way in which it differs from existing work is that I focused on restricting the structure of uncertain data, so that it can be queried in a decidable and tractable fashion. More specifically, my PhD research has studied three main directions:

  1. Connecting approaches for incomplete data reasoning that originated in various communities, to query incomplete data whose structure is constrained by expressive but decidable rule languages.
  2. Representing uncertainty on new structures, namely, on order relations, which had not been investigated before.
  3. Achieving computational efficiency for query evaluation, when we add qualitative uncertainty in the form of probabilistic distributions, by restricting the structure of data, e.g., imposing bounded treewidth.

Connecting different approaches to reason on incomplete data

The first direction deals with incompleteness in databases, and reasoning about an incomplete database in an open-world setting: What is possible? What is certain? The challenge is that this problem has been studied from various communities. On the one hand, database researchers have studied how to reason about databases. On the other hand, people who study reasoning and artificial intelligence are interested in drawing inference from data, and have tried to make their approach work on large data collections.

An important such community is that of description logics (DLs). The goal of description logics is to do reasoning, with the following goals:

  • Separate the data and the rules used to reason about the data.
  • Remain computationally tractable when the data is large, as long as the rules remain manageable.
  • Study precisely how the computational cost varies depending on which kind of reasoning operators4 you allow in the rules.

The data managed in the DL context differs from relational databases, however. The semantics is not exactly the same, but, most importantly, the data representation is different:

  • DLs represent data as a graph, where relations connect at most two objects. For instance, "John lives in Paris."
  • Relational databases, on the other hand, are a tabular formalism, where each row has a fixed schema that relates multiple objects at once. For instance, "John was born in Paris in 1942".

A different formalism, that works with relational databases, is that of existential rules (or tuple-generating dependencies in database parlance). Such rules allow you to deduce more facts from existing facts: "If someone was born in France and they have a parent born in France, then they are French." However, they cannot express other constraints such as negation: "It is not possible that someone is their own child." Further, existential rules can assert the existence of new objects: "If someone wrote a literary prize, then they wrote some book."

For people from databases, DLs, and existential rules, the goal is to design languages to reason about incomplete data, imposing the following properties:

  • principled, though the languages may make different semantic assumptions;
  • expressive, so they can accurately model the rules of the real world;
  • decidable, so that they are not so expressive that it would be fundamentally impossible to reason about them;
  • efficient, so they machines can reason with them with reasonable performance.

Description logics and existential rules

A first part of my PhD research is about bridging the gap between these communities. The goal is to design languages to reason about incomplete data that combine rules from these various communities. My work with Michael Benedikt (Oxford), Combining Existential Rules and Description Logics, connects the two formalisms I presented.

Say that you have a database, which you know is incomplete, and you have rules to reason about the implicit consequences of the data. Our work studies hybrid languages where you can express both existential rules and DL constraints. More precisely, we determine when such languages are decidable, and when they are not, i.e., it is fundamentally impossible to reason about them. The goal is to have the "best of both worlds": on parts of the data that can be represented as a graph, we would use the expressive DL rules; on parts of the data which are more naturally described as a relational database, we would have less expressive constraints inspired by existential rules.

In our work, we pinpointed which features of these two languages are dangerous and give an undecidable language if you mix them. The big problematic feature of DLs are functionality assertions: being able to say that, e.g., a person has only one place of birth. If we want to express such constraints, we must do away with two problematic features of existential rules. The first is exporting two variables: "If someone was born in some country, then that person has lived in that country." What we deduce should only depend on one element from the hypothesis: "If someone won a literary prize, then they wrote some book." The second is a restriction on the facts which the rules can deduce, which shouldn't form complex cyclic patterns in a certain technical sense.

Apart from these problematic features, however, we show that we can combine existential rules and expressive DL rules. This gives us hope: it could be possible to reconcile the two approaches and design very expressive languages that capture both kinds of constraints.

Open-world query answering and finiteness

I have explained how we connected two non-database approaches to reason about incomplete data. I also worked with Michael on reasoning about relational databases, in our work Finite Open-World Query Answering with Number Restrictions.

The general context is the same: we have incomplete data and we want to reason about its consequences, using logical rules. This time, however, our rule language is just defined using classical integrity constraints from relational database theory. Indeed, people have designed many ways to constrain the structure of databases to detect errors in the data: "Any customer in the Orders table must also appear in the Customers table". We use these as rules to reason about the complete state of the data, even though our database is incomplete and may violate them.

The language of constraints that we use includes functional dependencies, which are like5 the functionality assertions in DLs that I already mentioned. We also study unary inclusion dependencies, which are existential rules6 but of a very restricted kind.

The main problem comes from the semantics. In database theory, people usually assume databases to be finite. In particular, when using the rules to derive consequences, the set of consequences should be finite as well. This is justified by common sense ("the world is finite") but this problem is usually neglected in works about DLs and existential rules. So we studied finite reasoning in this database context, for the open-world query answering task of reasoning with our rules and incomplete information.

It is not hard to see that assuming finiteness makes a difference in some cases. Consider the following information about an organization:

  • Jane advises John (the database)
  • Each advisee is also the advisor of someone (an inclusion dependency)
  • Each advisee has a single advisor (a functional dependency)

Is it true, then, that someone advises Jane? The rule allows us to deduce that John, as he is advised by Jane, advises someone else, say Janice; Janice herself advises someone else, say Jack. In general, this could go on indefinitely, and we cannot deduce that someone advises Jane. However, if we also assume that the organization is finite, then it has to stop somewhere: someone along this chain (Jennifer, say) must advise someone that we already know about. And by the rule that no one is advised by two different people, we deduce that Jennifer must be advising Jane.

As we show in our work, the only difference that finiteness makes, is that it causes more rules to be added7. Once this is done, it turns out that we can essentially forget about the finiteness hypothesis, because we can no longer distinguish between finite and infinite consequences. This is surprisingly difficult to show, though; and to establish this correspondence, we need much more complex tools that those used in the infinite case to actually do the reasoning!

Representing uncertainty on ordered data

The second part of my PhD research deals with the representation of uncertainty on new kinds of data, namely, ordered data. Order can appear at different levels: on values ("retrieve all pairs of people when one is born before the other") or on records ("retrieve all page views that occurred yesterday"). In SQL databases, order is supported with the standard comparison operators on numbers ("WHERE length > 4200"), with sorting operators ("ORDER BY date"), and with the LIMIT operator to select the first results.

Order on facts

Why is it important to have uncertain ordered data? Well, an interesting phenomenon is that combining certain ordered data may cause uncertainty to appear. Consider a travel website where you can search for hotels and which ranks them by quality. You are a party of four, and you want either two twin rooms, or one room with four beds; but the website doesn't allow you to search for both possibilities at once8. So you do one search, and then the other, and you get two ordered lists of hotels. In real life you would then browse through both lists, but this is inefficient! What you would really want, in database parlance, is the union of these two lists, i.e., the list of hotels that have either two twin rooms or one 4-bed room. But how should this list be ordered? This depends on the website's notion of quality, which you don't know. Hence, the order on the union is uncertain. It is not fully unspecified, though: if two hotels occur only in one list, then their order relation should probably be reflected in the union. How can you formalize this?

To solve this kind of problems, people have studied rank aggregation: techniques to reconcile ordered lists of results. However, these methods are quantitative, i.e., if something appears near the top in most lists then it should appear near the top in the result. What we set out instead is to represent what we know for sure, i.e., the set of all possible consistent ways to reconcile the order, without trying to make a choice between them.

Couldn't we just use existing uncertainty representation frameworks? If you try, you will realize that they do not work well for ordered data. You don't want tuple-level uncertainty, because you know what the results are. You could say that the position of each result is uncertain: "this result is either the first or the third". Yet, when you know that hotel a must come before hotel b, you must introduce some complicated dependency between the possible ranks of a and b, and this gets very tedious.

With M. Lamine Ba (former fellow PhD student in Télécom, now in Qatar), Daniel Deutch (from Israel) and my advisor Pierre Senellart, we have studied this question in our work, Possible and Certain Answers for Queries over Order-Incomplete Data. We use the mathematical notion of a partial order as a way to represent uncertain ordered data. More precisely, we explain how we can apply to partial orders the usual operators of the relational algebra (select, project, union, product). We extend this to accumulation, to answer queries such as "is it certain that all hotels in this district are better than all hotels in that district?"

This poses several interesting challenges. The most technically interesting one is that of efficiency: how hard it is to compute the result of such queries? Somewhat surprisingly, we show that it is intractable (NP-hard). In fact, it is already hard to determine, given such a partially ordered database, whether some totally ordered database is possible9. In other words, it is intractable already to determine whether some list is a consistent way to reorder the data! To mitigate this, we study "simplicity measures" of the original ordered relations (are they totally ordered, totally unordered, or not too ordered, not too unordered?), and we see when queries become tractable in this case, intuitively because the result of integrating the relations must itself be simple.

Order on values

I also work on uncertainty on orders with Tova Milo and Yael Amsterdamer (from Tel Aviv). This time, however, the partial order itself is known, and we want to represent uncertainty on numerical values that satisfy the order.

Our work, Top-k Queries on Unknown Values under Order Constraints, is inspired by crowdsourcing scenarios. When working on data that you collect from the crowd, every query that you make introduces latency (you need to wait for the answers to arrive) and monetary costs (you have to pay the workers). Hence, you cannot simply acquire data about everything that you are interested in: instead, you must extrapolate from what you can afford. Following our earlier works on similar questions, we are interested about classifying an items in a taxonomy of products. In this example application, the order is induced by the monotonicity property: shirts are more specific than clothing, so if we classify an item as being a shirt, it must also be a piece of clothing.

The underlying technical problem is to interpolate missing values under total order constraints, which I think it is an interesting question in its own right. We study a principled way to define this formally, based on geometry in polytopes, and study again the computational complexity of this task. Why is this computationally intractable in general, and what are circumstances where it can reasonably be done? For instance, we can tractably perform the task when the taxonomy of products is a tree.

Reasoning about order

More recently, I also started working with Michael Benedikt on open-world query answering tasks that involve order relations. Again, this is useful in practice, because order often shows up in queries and rules; however it is challenging because it is hard to talk about orders in rules. Essentially, to write a rule that says that something is an order, you have to express transitivity: if a < b and b < c then a < c. Unfortunately, such rules are prohibited by many existing rule languages, because they could make the rules too expressive to be decidable. Hence, we study in which cases the specific axioms of order relations can be supported, without causing undecidability. We are still working on it and the results are not ready yet.

Tractability of probabilistic query evaluation

The last direction of my PhD research, and indeed the only one I really started during the PhD, is about the tractability of query evaluation on probabilistic database formalisms.

In fact, apart from the work with Tova and Yael and Pierre that deals with numerical values, everything that I described so far was about uncertainty in a logical sense, not in a quantitative sense. Things are either possible or impossible, either certain or not, but there is no numerical value that quantifies how likely something is over something else.

There are multiple reasons to focus on logical uncertainty. The first reason is that defining a set of possibilities is easier than additionally defining a probability distribution on them. It is hard to define meaningful and principled probabilities over the space of all possible consequences of an incomplete database, or over the linear extensions of a partial order.10 This is why people usually study probability distributions on comparatively simple models, e.g., tuple-independent databases (TID). With TID, the set of possible worlds has a clear structure (i.e., all subsets of the database) and a simple probability density: the probability of a subset is the probability of keeping exactly these facts and dropping the others, assuming independence.

The second difficulty is that, even on the TID model, computing probabilities is computationally intractable. Imagine that a dating website has a table indicating who sent a message to whom, in addition to a table indicating in which city users live. We pose a simple query asking for pairs of users that live in the same city, such that one messaged the other. On a standard relational database, this is easy to write in SQL, and can be efficiently evaluated. Now, imagine that we give a probability to each fact. Say, e.g., we are not sure about the possible cities a user may be in, and we only look at messages expressive a positive sentiment: in both cases we use machine learning to obtain probability values for each fact. Now, we want to find the probability that two people live in the same city and one sent the other a positive message.

As it turns out, this is much harder than plain query evaluation!11 The problem is that correlations between facts can make the probability very intricate to compute: intuitively, we cannot hope to do much better than just enumerating the exponential number of possible subsets of the data and count over them. An important dichotomy result by Dalvi and Suciu classifies the queries (such as this example) for which this task is intractable, and the ones for which it is tractable.

With my advisor and Pierre Bourhis from Lille (whom I met at Oxford), we wondered whether everything was lost for hard queries. We had hope, because the result by Dalvi and Suciu allows arbitrary input databases, which is quite pessimistic: real data is usually simpler than the worst possible case. We thought of this in connection with a well-known result by Courcelle: some queries which are generally hard to evaluate become tractable when the data has bounded treewidth, intuitively meaning that it is close to a tree. In the probabilistic XML context, it had already been noted12 that probabilistic evaluation is tractable on trees, e.g., on probabilistic XML data, even for very expressive queries. We thought that this could generalize to bounded treewidth data.

We show such a result in our work Provenance Circuits for Trees and Treelike Instances, which studies the more general question of computing provenance information. The field of provenance studies how to represent the dependency of query results on the initial data; it relates to probability evaluation because determining the probability of a query amounts to determining the probability of its provenance information, or lineage. Intuitively, the lineage of my example query would say "There is a match if: Jean messaged Jamie and Jean lives in Paris and Jamie lives in Paris, or..." To determine the probability that the query has a match, it suffices to evaluate the probability of this statement (which may still be intractable).

We rephrase probabilistic query evaluation in terms of provenance because there is a very neat presentation of provenance in terms as semiring annotations manipulated through relational algebra operators. Our work shows that how an analogous notion can be defined in the context of trees and automata, and can be "transported" through Courcelle's correspondence between bounded treewidth graphs and trees. As it turns out, the lineages that we can construct are circuits13 and themselves have bounded treewidth, so we incidentally obtain that probability evaluation is tractable on bounded treewidth instances. We extended this to show the tractability of many existing database formalisms, and the existence of tractable lineage representations.

In a more recent work, Tractable Lineages on Treelike Instances: Limits and Extensions, we show that bounded-treewidth tractability essentially cannot be improved: there are queries which are hard to evaluate for any restriction on input instances which does not imply that the treewidth is bounded (or that the instances are non-constructible in a strong sense). This lead us to investigate questions about the extraction of topological minors, using a recent breakthrough in this area. We developed a similar dichotomy result for other tasks, including a meta-dichotomy of the queries that are intractable in a certain sense14 on all unbounded-treewidth constructible instance families, a result reminiscent of the dichotomy of queries by Dalvi and Suciu, but with a very different landscape.

The hope would be to develop query evaluation methods on probabilistic data which use the structure of both the instance and query to achieve tractability, or to make well-principled approximations. We hope that such techniques could lead to realistic ways to evaluate queries on probabilistic databases.

Other things

This post only presents my past or current work that fits well with the general story that I plan to tell about my PhD research. I'm sorry to my past and current collaborators who are not featured in the list. In particular, some of my other collaborators are:

  1. It is true that NoSQL databases have become popular as well, but it says something about the ubiquity of SQL that we refer to other databases as being "not SQL". Besides, the new trend of NewSQL databases further suggests that the relational model is here to stay. To give a practical example, Android and Firefox contain SQL database implementations, e.g., with sqlite

  2. An example from 1977 is John Grant, Null values in a relational data base, which complains about the semantics. (Sadly, this paper seems paywalled, but I just give it for the reference and the date, not for its specific contents.) 

  3. If you want to know more about uncertainty management frameworks on relational databases, a good reference is the Probabilistic Databases book by Suciu, Olteanu, and Koch (but it doesn't seem available online). For probabilistic models, a good reference is this article by Benny Kimelfeld and my advisor Pierre Senellart. If you're curious about the connections between the two, I wrote an article with Pierre about precisely that. 

  4. An example of an expensive feature is disjunction: if you can infer from the data that something or something else is true, it becomes harder to reason about the possible consequences of your data. 

  5. Functional dependencies are slightly more complicated than DLs functionality assertions. This is because databases are tabular, so we can write functional dependencies like "the pair (first name, last name) determines the customer ID in the Customers table". This means that the Customers table cannot have two rows with the same first and last name but different customer IDs. Such rules do not make sense with DLs, because the "rows" have at most two columns. 

  6. In particular, they cannot export two variables, because as I pointed out, this causes undecidability with functional dependencies. The other restrictions are that they can only deduce a single fact (so they respect the second condition about non-cyclic heads), they only use a single fact as hypothesis, and they do not any variable appear at multiple positions of an atom. 

  7. In the example, we would deduce that each advisor is an advisee, and also that no advisor advises two different people. 

  8. Yes, I speak from experience, and I have a specific website in mind, but I won't do them the favor of offering them free advertising as a reward for bad usability. 

  9. Formally, we want to determine whether a labeled partial order has a compatible total order whose label sequence falls in a certain language. This interesting direction was not explored by standard poset theory, as far as we know, probably because it is too closely related to computational questions.
    Intuitively, hardness is caused by the ambiguity introduced by duplicate values. In other words, if two different hotels have the same name in the input lists, it is hard to determine how they should be paired with the elements of the candidate combined list that we are checking. 

  10. These are hard questions, but it doesn't mean that they wouldn't be interesting to study — quite the opposite in fact. :) For probabilities on linear extensions, see internship offer 2. For probabilities on infinite instance collections, as it turns out, we seriously tried to define such things via recursive Markov chains before we got sidetracked by simpler questions that were also unsolved. 

  11. Formally, evaluating this query is #P-complete in the input database, whereas it is in AC0 if the input database is not probabilistic. 

  12. Cohen, Kimelfeld, and Sagiv, Running Tree Automata on Probabilistic XML, which sadly seems paywalled

  13. Circuit representations of provenance are a recent idea by Daniel Deutch, Tova Milo, Sudeepa Roy, and Val Tannen. The fact that we use their work, however, is not connected to my collaborations with Daniel or Tova. It's a small world. 

  14. Formally, they do not have polynomial-width OBDD representations of their provenance, which is a common reason for queries to be tractable; they may still be tractable for other reasons, though we do not show it. 

Lettre ouverte à ma députée sur le vote prorogeant l'état d'urgence

To my English-speaking readers: France is currently busy destroying public freedoms in the wake of the november attacks. Both the lower and upper legislative chambers approved by a near-unanimous vote a 3-month state of emergency. This seems sufficiently dire to me to write to my representative and post the letter on this blog.

Je ne parle pas beaucoup de politique ici, mais ce qui se passe en ce moment me semble suffisamment grave pour être mentionné. Ce jeudi, l'Assemblée nationale a voté trois mois d'état d'urgence, à l'unanimité moins six voix, suivie vendredi par le Sénat en un vote unanime. Ma députée, Julie Sommaruga, ne faisait pas partie des six qui ont eu le courage de s'opposer, et je lui ai donc écrit pour lui manifester ma désapprobation. (Les liens sont des ajouts par rapport à la version envoyée.)

Objet : Lettre ouverte sur votre vote prorogeant l'état d'urgence

Madame la députée,

C'est vous qui me représentez à l'Assemblée nationale, et je souhaitais réagir à votre vote sur le projet de loi prorogeant l'état d'urgence.

Le 16 novembre, sur votre site Web, votre réaction aux attentats du 13 se concluait ainsi : « Notre fermeté, notre unité, notre sang-froid et notre fidélité à nos valeurs républicaines triompheront de la barbarie, vaincront le terrorisme. »

Le 19 novembre, vous avez permis à l'Assemblée d'entériner, pour trois mois, un état d'urgence qui représente une grave atteinte aux libertés fondamentales. Vous avez permis à l'exécutif d'assigner à résidence sur la base de comportements, de perquisitionner, d'interdire manifestations et rassemblements ; tout cela au nom de la menace terroriste, cette perpétuelle urgence des dernières années et de celles à venir.

Madame la députée, les valeurs dont vous vous revendiquez sont précisément celles qui m'amènent à désapprouver votre vote. Ce n'est pas agir avec fermeté, que de céder à la faiblesse de ceux qui veulent qu'on les rassure, ceux qui sont réconfortés par un spectacle de mesures sécuritaires inefficaces, ceux-là même qui applaudissent les actions militaires revanchardes et aveugles où nous nous enlisons déjà. Ce n'est pas faire preuve de sang-froid, que d'adopter si tôt de telles mesures, alors que nous sommes tous encore sous le choc de ces atrocités et du tapage sensationnaliste des médias.

Il faut se méfier de cette unité à laquelle on nous demande de croire, comme si le débat politique devait se suspendre sous la violence de ces crimes, et une opinion commune s'imposer immédiatement à chacun. N'est-il pas un peu suspect, finalement, que nous nous soyons tous mis d'accord si vite ? Ne pensez-vous pas, comme moi, que l'on entendrait des avis dissidents s'élever, s'il leur était permis de manifester, s'il leur était possible de s'opposer à l'alarmisme ambiant sans qu'on leur objecte le respect dû aux victimes ?

Comme vous, je crois à nos valeurs républicaines, comme vous je suis convaincu que c'est elles qui doivent nous guider en ces temps difficiles. Pourtant, nombre d'entre elles me semblent gravement menacées par le texte que vous avez défendu. Ainsi, le pouvoir législatif, que vous exercez en me représentant, ne bafoue-t-il pas le principe de la séparation des pouvoirs, quand il accède aux demandes d'un pouvoir exécutif qui veut court-circuiter le pouvoir judiciaire ? Peut-on encore parler de présomption d'innocence, quand l'exécutif prive arbitrairement ses citoyens de liberté, au nom d'une urgence qu'il veut étaler sur trois mois ? L'État de droit peut-il s'accommoder de cet état d'urgence permanent avec lequel on parle de le concilier ? En somme, n'est-ce pas laisser triompher le terrorisme, que de rogner ainsi nos chères valeurs, pour mieux les protéger ?

Madame la députée, je tenais à vous écrire pour vous dire que votre vote ne représente pas mes convictions. Comme d'autres de vos électeurs, peut-être, je regrette amèrement de vous voir approuver les graves erreurs que nous commettons aujourd'hui, et j'ai peur de cette France que l'on nous prépare pour demain.

Bien cordialement,

Antoine Amarilli (Montrouge), doctorant

Je recommande aussi fortement la lecture de J'ai peur, un article écrit par Pablo, qui correspond bien à mon avis sur ce qui est en train de se passer.

Managing disk image files, and compression with lrzip

Here is a random collection of tricks I figured out when trying to manage disk image files and compressing them (and other stuff) with lrzip.

Getting disk image files

When I stop to use a machine and think I have migrated all important things, or when a hard drive seems about to die, or when a machine dies and I salvage its hard drive, the easy solution is to just take an image of the drive to keep as a file on some other medium.

The standard tool for this is dd. It is very powerful, though its syntax is arcane and entirely different from any other command I know of, for historical reasons probably. Interesting things about it:

  • Its default operation is quite slow, but it can be made faster by specifying a different block size. However, if you make the size too large, it may become slower again. Apparently the only way to figure out the fastest size is by testing different possibilities, which I think is also what gparted does, but from experience it is often around 1M. Beware that all parameters like count=, seek=, skip= are always given in blocks (not as an absolute size). The size of input and output blocks may be different.

  • While dd does not display any progress information, it can be made to display some by sending it SIGUSR1, as in pkill -USR1 dd. I often use watch to issue this command periodically to see how the operation progresses. [For simpler tasks, an alternative way to get nice progress information is to just use pv, e.g., pv /dev/sda > sda.img. Thanks Mc for the suggestion.]

When you are copying information from a faulty or failing drive, a better tool is GNU ddrescue. It will attempt to copy the data, but it will skip the sectors that cannot be accessed, to get as much data as possible; it will then progressively refine the missing sectors. Don't miss the possibility of using a logfile as a third argument, to save the progress information and be able to resume the operation later (see manpage).

When taking disk images, you can either retrieve a partition (e.g., /dev/sda1), or the entire drive (e.g., /dev/sda). Partition image files can be mounted, e.g., with sudo mount -o loop sda1.img /mnt/sda, you can also use -o ro for readonly. Whole device files can be examined with, e.g., fdisk -l sda.img, to see the partitions. You can also mount each partition manually:

LOOP=$(losetup -f)
fdisk -l sda.img
losetup --offset N "$LOOP" sda.img
mount "$LOOP" /mnt/mountpoint

where N is the offset of the partition in blocks indicated by fdisk, multiplied by the block size (also indicated by fdisk, usually 512 bytes).

Shrinking disk image files

When you are keeping image files for archival purposes, you want to make them as small as possible. For instance, if the file has free space, you want to resize it to the smallest possible size. Note that if you do not do this, the free space will be filled with random garbage, so you cannot even hope that it compresses well. (Of course, before doing this, you must be OK with modifying the image file, and you must be sure that you wouldn't want to recover anything from this space, e.g., a deleted file on the partition.)

For ext2, ext3, and ext4 filesystems, the standard filesystems on GNU/Linux, it turns out that resize2fs works just fine on image files, and has a convenient option to resize to the smallest size: -M. It can also show progress with -p. It requires e2fsck to be run first, which you should run with -C 0 to see progress on stdout, -tt if you want more timing information, -f to force checking even when it seems unnecessary, and -p to automatically fix errors when it is safe to do so.

For FAT filesystems, you should first run fsck.vfat -p, and you can then use fatresize, with -p, and with -s giving the size (to be obtained with fatresize -i IMAGE | grep '^Min'). Afterwards, you need to truncate the file (which fatresize does not do) with truncate -s SIZE, with the SIZE obtained from fatresize -i IMAGE | grep '^Size'.

All of the above applies to image files for single partitions; to perform them on full device images you can do it on a loop device created as explained above, though of course the image file will not be shrunk then.

In some cases it is not safe to resize partitions. For instance, if you took an image of a full device of an embedded system, it would be possible that resizing partitions and moving them around would confuse the bootloader. In this case, an alternative is to zero out the unneeded space, so that the file is not smaller but at least it will compress better. For extN systems, you can use zerofree to do this.

I wrote a script, shrinkimg, to automate such tasks (also compression and decompression). I do not guarantee that it will work for you, but you can always adapt it.

Compressing files

Another important thing to do when archiving files is to compress them. First, using tar without any compression option, you can always reduce your files to a single .tar file to be compressed. There is a tradeoff between compressing each file individually, which makes it easier to access each of them, and tar-ing multiple files together to compress them together, which will yield a smaller result (redundancies across files can be used). My heuristic is to tar together files that are related (e.g., different versions of the same codebase).

Not all compression algorithms are alike. The common ones I have used are the following (from fastest, oldest, and least efficient, to slowest, most recent, and most efficient):

  • gzip, using DEFLATE, so essentially like most ZIP files
  • bzip2, similar but with some improvements, e.g., the BWT
  • xz, using LZMA, essentially the same as 7-zip
  • lrzip, the most recent, which I will describe in more detail.

lrzip is packaged for Debian, and performs two independent steps. First, it uses a large sliding window to find and eliminate redundancies in the file even when the occurrences are far apart. Second, it compresses the result using a configurable algorithm, the default one being LZMA, the most efficient but slowest being ZPAQ. It is multithreaded by default at the expense of a small loss of compression efficiency.

lrzip provides its own benchmarks but I wanted to test it on my own data. I used a 19 GiB image file. I tried xz and lrzip with the default compression level or the increased one -L 9, with the default LZMA algorithm or with ZPAQ -z. Take these timings with a grain of salt, the machine was often busy doing other things: they are wall clock timings (so in particular lrzip can take advantage of parallelization).

Command   Time (s)    Filesize (bytes)
xz -9 10588.75 7300161660
lrzip 4871.58s 5734885539
lrzip -L9 5980.53s 5693722686
lrzip -z 17852.32s   5544631223
lrzip -z -L9 39809.40s   5332451600

The default lrzip version took around 20 minutes to decompress, the version obtained with lrzip -L9 -z took around 12 hours (!). I checked that the file decompressed fine, and with the same SHA1 image, to verify that no corruption had occurred.

So, in brief, lrzip is noticeably more efficient than xz, at the expense of being slower if you use the advanced settings. So I think it is worth it to use it on very large images where one can expect long-distance redundancy.

The main drawbacks of lrzip are that it is RAM-hungry. It tries to allocate a sensible amount only, but you definitely shouldn't run multiple copies in parallel unless you know what you are doing: it may get killed by the OOM killer, starve the machine, or fail to allocate with a message like Failed to malloc ckbuf in hash_search2.

I should also point out an impressive feat of lrzip, where it did amazingly well when compressing a tar archive of a bunch of database backups (each of them being mostly a prefix of the latest one). The .tar was 1.6 GiB, gzip compressed it to 518 MiB, xz to 112 MiB, and lrzip with advanced settings to just 2.5 MiB. In this case lrzip was much faster than the others, because the redundancy elimination went fast, and the size of the relevant data to be compressed was very small indeed!

Thanks to Pablo Rauzy for proofreading this.

La consultation République Numérique : mes réponses et mes scripts

Une consultation République numérique sur le projet de loi pour une République numérique a été lancée et se déroulera jusqu'au 18 octobre.

Je partage à cet égard les mêmes réserves que Stéphane Bortzmeyer ou La Quadrature Du Net : cette consultation arrive bien tard, et difficile de croire que l'expression des citoyens sera véritablement prise en compte. Ceci dit, je me dis qu'il vaut mieux donner son avis quand on le peut, plutôt que de se plaindre après coup de ne pas avoir été écouté...

J'encourage donc ceux qui le peuvent à voter. Si vous pensez que vous n'avez pas le temps, mais que vous avez quelques compétences techniques, j'ai fait un script Python pour vous permettre de voter rapidement de la même façon que moi, allez directement à la section correspondante.

Votes et soutiens

Vous pouvez voir mes votes sur le site de la consultation. Dans de nombreux cas, j'ai dû me résigner à voter pour des amendements dont la rédaction précise n'est pas satisfaisante, ou le champ d'application pas assez large, ou le rapport avec le texte assez peu clairement établi, faute de mieux et dans l'espoir de susciter un débat intéressant... J'ai aussi souvent lu en diagonale, car il y a beaucoup de matériel et je n'ai pas une quantité illimitée de temps...

C'est assez décourageant (mais pas si surprenant) de voir que la plupart des propositions sont de bien piètre qualité. En fait, quelque chose qui m'a davantage surpris, c'est que les propositions initiales du gouvernement (clairement identifiées), ou celles du Conseil national du numérique, sont parfois assez judicieuses, même si ça se limite trop souvent à de bonnes intentions, sans sanctions, sans garanties, sans recours, et sans conséquences réelles. J'ai tout de même voté pour certaines d'entre elles...

Une bonne façon de trouver des propositions que l'on a envie de soutenir, c'est de regarder celles d'associations ou d'individus qui font du bon travail. En particulier, j'ai voté pour toutes les propositions de La Quadrature Du Net, de l'April, de Wikimédia France (pour l'importante question de la liberté de panorama). Pour les questions de libre accès des travaux de recherche (voir ici pour une excellente introduction documentée sur ce problème), j'ai soutenu le CNRS, INRIA, le Consortium Couperin, Roberto di Cosmo et SavoirsCom1, entre autres.

Autres idées

Il y a trois points que j'aurais bien ajoutés à la consultation, mais je ne le fais pas faute de temps, et parce que je n'ai guère espoir que ce soit utile :

  • Personne ne s'est spécifiquement plaint de la pénibilité, dans les textes législatifs, des modifications qui sont décrites en langue naturelle, du genre « à l'alinéa (ii.) de l'article L-42, au point (b.), le mot "est" est remplacé par "n'est pas" ». En termes d'open data, ce serait bien de faire quelque chose pour que le jeu de données de Légifrance et autres sources soit plus utile, en demandant que des diffs exploitables par un ordinateur soient fournis en plus de (ou à la place de) la version actuelle (qui est illisible pour les ordinateurs, plus encore que pour les humains).

  • Une proposition de l'April indique que le code source des logiciels publics sont susceptibles de faire l'objet d'une communication aux citoyens. Ceci fait suite à un avis de la CADA commenté par exemple par Next INpact. Je suis tout à fait d'accord, mais j'irais même plus loin. En effet, l'avis en question indique que des "droits de propriété intellectuelle détenus par des tiers à l’administration" seraient susceptibles, s'il y en avait, de limiter le droit du demandeur (Monsieur X) à réutiliser le code concerné. Je pense qu'en pratique cette restriction pourrait être très problématique, vu que les administrations ont tendance à faire appel à des prestataires pour développer leurs logiciels. À mon avis, il devrait être entendu que tout prestataire développant ou hébergeant du logiciel pour l'administration doit être prêt à le fournir, sous licence libre, sur demande des citoyens, sans pouvoir s'opposer à sa réutilisation ultérieure.

  • Ce serait utile d'avoir un portail pour demander l'ouverture d'un jeu de données dont on estime qu'une administration (sans devoir chercher laquelle) doit bien le détenir. En particulier, j'avais un jour cherché à trouver la base de données des valeurs locatives cadastrales utilisées pour le calcul de la taxe d'habitation, sans rien trouver, et j'aurais bien aimé pouvoir demander quelque part ce jeu de données (apparemment fixe depuis 40 ans).


Je partage l'insatisfaction de Bortzmeyer sur la plateforme elle-même, et l'absence de moyen de remonter des problèmes avec celles-ci. En particulier, le diff mot par mot est parfois fautif, souvent illisible. Cependant, à mon avis, le plus grave problème est de ne pas pouvoir simplement déléguer son vote à d'autres institutions. Suivant le principe de la démocratie liquide, je voudrais pouvoir confier ma voix à ceux en qui j'ai confiance, sans avoir besoin de me documenter sur le détail de chacune de leurs propositions. Cependant, la plateforme ne propose rien de mieux que d'aller voir leurs propositions et voter pour elles une par une. Cette délégation manuelle ne sera pas mise à jour si le mandataire change ses votes, et, même si on oublie cet inconvénient, elle est évidemment beaucoup trop pénible à faire en pratique.

C'est pour cette raison que j'ai écrit un jeu de scripts pour pouvoir facilement voter en soutien à un autre utilisateur, de façon automatique. L'objectif n'est pas de faire du bourrage d'urnes, simplement de permettre à qui le souhaite de voter de la même manière que quelqu'un en qui il a confiance, ou de voter pour toutes les propositions d'autres utilisateurs, sans devoir le faire vote par vote.

S'il vous plaît, utilisez ces scripts de façon raisonnable, et en particulier ne créez pas d'utilisateurs factices. Je n'ai écrit ce code que parce que la plateforme de consultation ne permet pas facilement de soutenir toutes les propositions d'un utilisateur, ou de reprendre à son compte tous les votes d'un autre utilisateur. J'estime que ces fonctionnalités devraient en principe être proposées par le site lui-même. À ma connaissance, l'utilisation de tels scripts à de telles fins n'est pas contraire à la charte. Cependant, je n'accepte aucune responsabilité pour ce que vous en ferez.

TL;DR: comment voter

Commencez par créer un créer un compte (n'utilisez pas Facebook ou Google+, c'est mal, et ça ne fonctionnera pas). Renseignez les informations, recevez le mail de vérification, et validez votre compte. Mettons que votre email soit toto@example.com et votre mot de passe "4242".

Ensuite, installez mes scripts (j'indique les prérequis pour Debian, mais vous pouvez adapter pour votre distribution) :

sudo apt-get install git python3-lxml python3-bs4 python3-requests
git clone https://a3nm.net/git/republique
cd republique

Pour voter pour les propositions des associations que je soutiens en utilisant le compte que vous avez créé :

./get_propositions.py april laquadraturedunet wikimediafrance inria \
    cnrsdistrenaudfabre consortiumcouperin robertodicosmo savoircom1 \
    > propositions
./vote.py toto@example.com "4242" < propositions

C'est normal que ces commandes prennent quelques minutes, vu qu'elles doivent faire deux requêtes par proposition. En cas d'échec, assurez-vous bien d'avoir créé et validé le compte au préalable.

Pour reprendre à votre compte tous les votes que j'ai faits :

./get_votes.py antoineamarilli > votes
./vote.py toto@example.com "4242" < votes

Si vous vous appelez Jean Dupond et que vous n'avez pas d'homonymes, vous pouvez vérifier que vos votes sont bien en ligne à l'URL https://www.republique-numerique.fr/profile/user/jeandupond.

N'hésitez pas à utiliser mes scripts puis à réajuster vos votes, car vous pouvez toujours les modifier ou les effacer individuellement sur le site de la consultation. Si vous votez plusieurs fois (avec l'outil ou en ligne) sur un sujet, seul le vote le plus récent est pris en compte. En particulier, si plusieurs de vos sources de propositions ou de votes sont en désaccord, c'est la dernière mentionnée qui prévaut.

J'ai corrigé un bug qui faisait que des votes favorables à un argument portant sur une proposition ou une modification pouvaient être interprétés à tort comme un vote sur la proposition ou modification elle-même. Si vous avez utilisé la vieille version de get_votes.py, vous avez peut-être voté à tort pour certaines propositions ou modifications. Si vous avez utilisé les commandes que je suggérais, vérifiez en particulier votre vote sur l'article 9, la modification du CNRS, et la commercialisation des données publiques.

Je produis régulièrement quelques tables (moches et faites très rapidement) pour pouvoir suivre de façon commode les propositions, modifications et arguments ayant reçu le plus de votes favorables et le plus de votes au total. Le code qui les génère est dans le dépôt git.