a3nm's blog

Privacy in public space

— updated

We can roughly divide the world in two regions: public space, where anyone can enter and watch what happens, and private space, where only selected people can get in and see what takes place inside. Of course, there are borderline conditions, such as space being either private or public depending on time of day (e.g., parks that close at night), some space being "public" but technically private property subjected to house regulations (e.g., inside a shop), and some space being "private" but with the technical possibility for outsiders to look inside (e.g., through windows, into a private home or vehicle, with convoluted mechanisms to explain why this is still a violation of privacy). Still, by and large, the distinction holds.

It seems that, by definition, there can be no expectation of privacy in public space, because, in principle, everything that happens there can be witnessed and recorded by anyone (and possibly shared publicly, although personality rights may apply). Yet, people rely on it to some extent. First, in terms of proximity: when you are with someone else (e.g., in a park) and there is no one around, you assume that your conversations are private, and (except in crowded environments) a third party cannot intrude and listen (because of the fuzzy assumption that, as long as they can sit elsewhere and enjoy an equivalent portion of the space, you should be entitled to your own spot; and that, when you talk and there is no one around, the information will not be recorded from far away). Second, in terms of unrelatedness, as people will sometimes indulge in a conversation with third parties within earshot under the assumption that they are not concerned by what is said (and politeness would require them not to pay attention). Third, in terms of continuity: you assume that no one knows your whereabouts at all times, because to do so they would have to stay close to you always, that is, follow you, and there is again an assumption that other's use of public space should not be "guided" by yours. Fourth, in terms of ephemerality: even if someone were to see or hear, you may assume that they will not retain a record of it forever, and that it is not possible to look up public space information about the past, because it was not archived.

My point is to focus on the last two privacy expectations, and show that they can break down when the notion of "public space" becomes altered by technology.

Today, a variety of entities (shops, police, etc.) have already installed CCTV cameras (within their private property or with applicable permits) to monitor public space. Cameras tend to become more and more widespread, so that more and more public space is filmed. The resulting trail of data is not eternal, for practical and legal reasons; but both limitations tend to disappear as time passes. So eventually we may assume that an uncoordinated bunch of actors will store traces which, together, could be used to reconstitute the entire history of everything visible in public space.

Now consider a second step which is currently starting to happen: CCTV cameras that upload their recordings to the cloud rather than storing them locally. This seems natural as more and more computing and storage is centralized in datacenters rather than on individual devices. Now, as I see no reason why cloud providers should not remain an oligopoly (or even become a monopoly), suddenly a growing proportion of the acquired data (in raw form) is available to a small number of actors. Incidentally, wiretapping ensures that various secret agencies also get access to the data.

Add a third step where the storage space, processing power, and algorithmic sophistication of the cloud providers go to infinity. Suddenly, all those actors have a different kind of access to public space, which is not limited by the notion of presence which intuitively applies for humans. They can know everything that happens everywhere, or happened at any point in time. I call this total public space access. This marks the collapse of the two privacy expectations I mentioned.

Of course, this has far-reaching consequences. Organizations with total public space access can know where everyone is located and the history of everywhere they went. This is problematic because of all the private information (love affairs, political organization, etc.) which is revealed by location information. (There are currently easier ways to retrieve this information, in a less precise manner, for people carrying mobile phones, but with CCTV opting out becomes much harder.) Note that this also implies you cannot privately go from point A to point B through public space, even if A and B are private... A tentative workaround would be to cover your face so that you are not recognizable, but this may be illegal, and does not suffice: people tend to usually return to their private dwellings, so that total access to public space is sufficient to establish a continuous trail for them, and thus identify them even if their appearances are indistinguishable.

Of course, this is not the only way in which unrestricted public space access challenges usual privacy expectations. Consider names on doorbells. To my knowledge, there is currently no database harvested from them that provides all addresses where a certain name appears, and people therefore do not consider that putting their real name on the doorbell divulges the information in that direction, from their name to the address. Yet this is all information available in public space, so I am not sure about the general legal framework that would prohibit the construction of a reverse database as I described.

The disappearance of privacy in public space is not necessarily a bad thing in itself: unrestricted public space access is a power, so it can be used for good, or for evil. It can be used to fight crime: while it cannot ensure that crime is altogether prevented, it ensures that crimes committed in the public space always leave a trace that can be investigated. Under the (non-obvious) assumption that this trace cannot be tampered with, it means that the objective truth of any claim about public space can be assessed. It implies that criminals can no longer run away (assuming interference powers from the police to extract criminals from a hideout in private space, and assuming that private space regions are not well-connected, as is the case in real life).

It is not clear that the provability of public space crime would make it impractical, because some criminals may not care if they will get caught; but assuming that it does, the benefits for society is not just the crimes that are no longer committed, it is much higher: it means that precautions to prevent the crimes are no longer needed (bikes, doors no longer need to be locked, stuff can be left in public space without risk), and also that some efficient rental schemes become practically applicable (if, e.g., there is no longer a risk that the rented good is not returned). Beyond crime, unrestricted access to public space gives opportunity for smarter decisions in terms of traffic, queues, shops being opened or closed, bus schedules, etc. Indeed, a lot of practical inefficiencies are the result of insufficient knowledge of public space, which (currently, and assuming that algorithms are not a problem) is usually caused by insufficient available data.

I have claimed that total public space access, under the assumptions that I outlined, will eventually become a technological possibility, and the default situation would be that a small number of organisations get it and the general public doesn't. What should be done about this?

A first option would be to legally prohibit total access to public space, or make it impossible. A good umbrella term (coined, to my knowledge, by Louis Jachiet) is that of indiscriminate data acquisition in public space. The rationale is that while people taking pictures, tourists filming monuments, etc., are acquiring information in a targeted manner, total public space access would result from CCTV, Google cars, and other technologies which perform such broad captures. Such acquisition should not necessarily be prohibited, but should become a target for regulation.

A second option would be to ensure that the resulting public space archive is available to everyone under the same terms. Indeed, much of the reasons why total public space access is scary is because of the asymmetry between those who have it and those who don't. It means that certain companies, secret services, can know anything about you (and could, e.g., prosecute you for any minor offense you commit), and yet protect themselves so that others do not know anything about them (especially, their wrongdoings would remain unpunished). Of course, organizations with more means will always stand a better chance of finding something to use against you, but society could try to ensure that citizens can at least access the data and organize to scavenge it.

In this second case, I am not sure about whether I think the resulting society would be a good one. The panopticon is usually thought of as a bad thing, but, in another way, the fact that you have non-total visibility and memory of public space seems to me like a bug that should be fixed, not a feature. I wonder what the best compromise is.

Quick notes on graphical models

— updated

For my PhD work I had to read a bit about graphical models. As I found the topic quite interesting, I thought I'd publish here the notes that I took. Those notes are probably not very helpful without detailed explanations and examples from Wikipedia and from e.g. Chapter 8 of Bishop's Pattern recognition and machine learning which is the clearest source I found about those things. Yet, maybe it might save some time to people trying to understand the relation between the various concepts.

Joint distributions and their properties

We consider a set of random variables X1, ..., Xn, and we wish to define a joint probability distribution over these variables, namely, a function from valuations of the Xi to reals, which must sum up (or integrate, for continuous variables) to 1. This implies that we are not assuming independence of the Xi.

The "structure" of the resulting function might be entirely general, with no way to decompose the computation of its value into "clusters" of the Xi. However, most distributions in practice do not depend in an arbitrary way on the Xi, but do so in a "structured" way.

There are two main ideas to characterize this "structure":

Factorization
Write the probability distribution as a product of functions (called "potential functions") over subsets of the Xi. If such a factorization is possible, then this justifies that the probability distribution depends on the Xi in a "structured" way because it can be computed by first computing a certain function on those subsets separately, and then combining them by product.
Conditional independence
Figure out for which variables Xi, Xj and set S it holds that Xi and Xj are independent knowing the value of the variables of S (for all possible values of those variables). Whenever a distribution has these properties, it justifies that it is "structured" by the fact that the distribution on Xi and Xj is the product distribution whenever some other variables have been fixed.

Graphical models

We can use three main formalisms to represent a distribution while giving some information about its structure:

Bayesian Networks

A DAG over the variables that intuitively represents that a variable only depends "directly" from its immediate parents. This is a very natural representation for processes where the random variables denote the complete state of a process that generates some values from other values.

A Bayesian Network guarantees that the distribution can be factored as the product of one potential function per variable involving that variable and its immediate parents in the graph. Additionally, each such potential is a conditional distribution for the variable it defines, which is not the case in general. In terms of conditional independence, a criterion called d-separation allows you to determine which conditional independence relations hold.

Markov Random Fields

An undirected graph over the variables. In terms of factorization it guarantees that the joint distribution can be factored as a product of potential functions (which must be strictly positive) over the cliques of the graph (or the maximal cliques). However, note that in this case the potential functions do not correspond to conditional distributions, or to probability distributions altogether; they are not normalized.

The main selling point of MRFs is in terms of conditional independence, as there is a very simple and nice criterion: Xi is independent of Xj given S if there is no path connecting Xi to Xj in the graph where S is removed. (This is simpler than d-separation.)

Factor graphs

Factor graphs represent the distribution as a bipartite graph between the variables and the factors. (For legibility, rather than grouping the variable and factors in two sets, the graph is usually represented as a regular graph but with variable vertices denoted by circles, and factor vertices denoted by squares.) They are essentially equivalent to MRFs except that you can use them to represent multiple factors on a set of variables, or factors on a non-maximal clique, in a finer-grained way.

Expressiveness of the models

In terms of expressiveness of these models, it is clearly the case that any joint distribution can be represented in any of the models in a "trivial" way (a Bayesian network that enumerates the variables in an arbitrary way, with variable Xi depending on Xj for all j < i, or a MRF which is the complete graph). The more interesting question is whether, for a distribution, there is a graphical model that encodes exactly how the distribution can be factored, or which conditional independence relations occur.

For factorization, there is something called the Hammersley–Clifford theorem that ensures that, if a probability distribution has a factorization with potential functions that are strictly positive, then it can be represented by the corresponding MRF.

For conditional independence, Bayesian nets and MRFs are incomparable. The counter-examples are on page 53 (number 393) of Chapter 8 of Bishop's book.

Marginalization for tree-like factor graphs

To marginalize a variable (or several variables) is to compute the probability distribution of this single variable, according to the joint probability distribution. Marginalization is an especially important task to perform on graphical models. For instance, in situations where a Bayesian net represents a Boolean formula, we could use marginalization to compute the probability that the formula is true. In the context of processes represented by Bayes nets, it is often the case that you want to incorporate evidence (observations, known values) for some of the leaf variables, and then compute the marginal of the root variables (latent variables), that are the hidden values that you wish to figure out from the observations.

Of course, marginalization can always be performed by looking at the variable(s) to be marginalized and, for each of their possible values, counting the total mass of the possible worlds which achieve the value (and match the observations, if any). Of course, it is intractable to do so. The point is that for distributions that can be represented by a sparse graphical model, more efficient algorithms can be performed. In this section, we assume that we are dealing with a factor graph that is actually a tree.

I show that we can compute the marginal distribution of a selected variable x in this case, in linear time (assuming constant arithmetic cost, PTIME otherwise), on the factor graph which we root at variable x and traverse as a tree in one bottom-up pass. We store for each variable the (unnormalized) marginal distribution of this variable in the subtree of which it is the root, and for each factor the (unnormalized) distribution of the factor on the children variables for each value of the parent variable.

The distribution of a variable is simply, for each value, the product of the overall probability of the children factors for this variable value. The distribution of a factor given a parent variable value is the factor's distribution on assignments for this value times the probability of each child variable for each value in the assignment.

In the setting where the factor graph comes from a Bayesian network, the marginal at the root variable should be correctly normalized, but other conditional marginals in the tree will not; however it is straightforward to renormalize them from their total mass.

In graphical model terms, the algorithm that I have sketched is called "belief propagation" or "sum-product algorithm". Variations exist to compute the marginals for all variables, still in a single pass.

Bounded treewidth factor graphs

Now I sketch why marginalization can also be computed in PTIME on factor graphs which we assume have a tree decomposition of width bounded by a constant. I first explain how the notion of bounded treewidth is defined on the various models, and maintained when translating from one model to the other.

The treewidth of a factor graph is just the treewidth of the resulting bipartite graph. The treewidth of a MRF is the usual graph treewidth. The treewidth of a Bayes net is the treewidth of its moral graph, obtained by connecting, for each node, its set of immediate parents into a clique, and then forgetting about the orientation.

It is easy to see that when converting a MRF or a Bayes net to a factor graph (which is the representation I use for marginalization), the property of having treewidth less than a constant is always preserved; indeed, starting with a tree decomposition of the original graph, consider the node where the elements of each maximal clique (of the MRF, or of the moralized graph of a Bayes net) to be the one node where you add the factor vertex of the factor graph. It's easy to see that this gives you a tree decomposition of the factor graph (factor vertices have a single occurrence, occurrences of variable vertices are OK because they were in the original decomposition, factor vertices co-occur in their node with all the variable vertices to which they are connected), and in so doing the treewidth remains constant (the number of maximal cliques involving a node is at most k! if k is the treewidth).

Now, if a factor graph is not a tree but has treewidth bounded by a constant, the algorithm of the previous section generalizes (it can be performed bottom-up on the tree decomposition). In graphical models, this is called the "junction tree" algorithm.

Diff current version of Debian config file against original version

— updated

Say you want to identify which changes you made to a configuration file in /etc on a Debian system. If you have set up a tol like etckeeper you can just use it, but what if you haven't? A simple idea is to just diff your file against the original configuration file that shipped with the corresponding Debian package.

I was surprised that there was no automatic tool to identify the right Debian package with apt-file, retrieve it with apt-get download, and perform the diff, so I wrote it: debdiffconf.

Customizing your keyboard layout with xkbcomp

— updated

This is really going to be a poor excuse for a tutorial, it is just aimed at covering my current setup and nothing else, but maybe it is useful to someone.

I type using the Dvorak Simplified Keyboard (though I'm not especially convinced it has any intrinsic benefits, but that's another story). However, as I type French (but don't use a French Dvorak or the Bépo layout -- another complicated story), I need to have mappings to type the diacritics used in French. (French keyboard mappings provide the diacritics at the expense of some characters being relegated to AltGr combinations, and they have 105 keys, one more than Qwerty keyboards, with the additional key at the bottom right being used for '<' and '>'.)

Debian provides the Dvorak Simplified Keyboard, along with an international option that allows you to have AltGr combinations to invoke dead keys for common diacritics, you can obtain it using the following invocation. (Note that in all of this I am thinking of the X server world, not the ttys, which I don't use and which use a different mechanism, see loadkeys.)

setxkbmap -layout dvorak -variant intl -model pc105

However, those combinations are not especially convenient (AltGr-6 for the circumflex, for instance...), so it is tempting to modify them. I used to carry an entire dump of the keyboard configuration, that I edited haphazardly and loaded with xkbcomp ~/.xmodmaprc $DISPLAY. I now figured out by skimming through some documentation (mostly this) how to do so more cleanly.

Let me assume that the configuration will be stored in config/xkb. First create a file map to load your current configuration. In my case, I issued:

setxkbmap -layout dvorak -variant intl -model pc105 -option compose:caps -print > map 

You should be able to load this file by issuing xkbcomp ~/config/xkb/map $DISPLAY. Next, create a subfolder symbols that will contain the various files storing the customizations. I have altgr that sets up AltGr (the right alt key) to be used as a modifier, space that sets up a combination to make non-breaking spaces (both AltGr and Maj as I don't want to make them accidentally), and a file accents that stores all of the accents that I need: both dead keys, and shortcuts to make the most common accented characters directly. On Debian systems, see package x11proto-core-dev, file /usr/include/X11/keysymdef.h, for the possible symbols.

Now, you can add to the "xkb_symbols" line of the map file a reference to your extension files: separated by spaces, first the name of the file and then between brackets the name of the stanza (they match in my examples). This gives the final map file.

Last, you need to be able to invoke it. You need to specify where to search for the extension files, so the complete invocation is:

xkbcomp -I$HOME/config/xkb ~/config/xkb/map $DISPLAY

Do not worry about the many warnings. I run this script at the start of my X session to set up the layout (and to set the delay and rate for typematic (the delay after which keys repeat when you keep them pressed) to something that's reasonably fast).

A cute allocation problem

— updated

In the spirit of something I already did, I'm going to discuss a cute problem that suggested itself spontaneously over dinner, except this time I will study its complexity from a theoretical angle.

The problem is as follows. You applied to a number of companies, and each company's board issued a list of the candidates ranked by their order of preference. You know, for each company, how many people they will hire, and you know that they will take the highest ranked people who accept the job. Your hope is to get a job at any of the companies. Maybe you are ranked sufficiently high at some company to be sure to get a job there, in which case you've won; but maybe you aren't, and you hope that some of the people ranked higher than you will not accept their offer(s) so that you can be selected instead.

Of course, you always have some hope that everyone will desist and that you will get what you want. A harder question is: can you be sure that you will get a position, even though you are not ranked well enough to be sure to have any position? If you think about it, the answer is clearly yes in some situations, assuming that other candidates can only accept at most one position. If some super-strong candidate is ranked first at all positions, then they'll take only one of them at most, so if there are two positions where you're the first candidate below the bar, then you're sure to get one of them no matter what happens.

The problem is now to decide this algorithmically. Given as input the list of candidates that beat you at each company, given the number of people recruited by each company, can you decide efficiently whether you are sure to get a job at any of the companies?

It might seem that this problem is NP-hard (computationally intractable) because it looks a lot like the set cover problem or the Boolean satisfiability problem, but actually you can determine in polynomial time whether you are sure to get some job.

The method is to encode the problem as a flow problem. Build a graph that has a source vertex s, a target vertex t, one vertex ci for each candidate i, one vertex lj for each list j, and consider the graph with one edge from s to every ci (with capacity 1), one edge from each ci to the lists lj where candidate i appears before you (with capacity 1), and one edge from each lj to t whose capacity is the number of positions offered by company j (in other words, the number of candidates that need to accept a position at this company for you not to get the job).

Now, we ask whether the maximal flow of this graph saturates all the edges to t. If it does, and if the flow is integral, then observe that the flow gives you a way to allocate candidates to lists so that you do not get any job (the one saturated edge leaving ci tells you which job candidate i accepts). Conversely, if there is such an allocation, then there is a maximal flow saturating all edges to t. But now, the integral flow theorem ensures that there is always a maximal flow that is integral. It remains to observe that the maximum flow problem can be solved in polynomial time to conclude that the same is true of our problem.

This implies the tractability of the following (equivalent) rephrasing of this problem in terms of satisfiability. You have a set of variables (xi) which can each be assigned a value from a domain Di (so, multivalued variables). You have a conjunction of clauses which are sets of equalities between such variables and a constant from their domain. For each clause Cj you have a number nj and you say that the clause is true if at least nj of its litterals are true. This is of course much worse than the usual NP-hard Boolean satisfiability problem, except that you require that the same equality (between a variable and a constant) can never occur in two different clauses. Now, by the above, this restriction makes the satisfiability problem tractable.