Antoine Amarilli's blog

Quick notes on graphical models

— updated

For my PhD work I had to read a bit about graphical models. As I found the topic quite interesting, I thought I'd publish here the notes that I took. Those notes are probably not very helpful without detailed explanations and examples from Wikipedia and from e.g. Chapter 8 of Bishop's Pattern recognition and machine learning which is the clearest source I found about those things. Yet, maybe it might save some time to people trying to understand the relation between the various concepts.

Joint distributions and their properties

We consider a set of random variables X1, ..., Xn, and we wish to define a joint probability distribution over these variables, namely, a function from valuations of the Xi to reals, which must sum up (or integrate, for continuous variables) to 1. This implies that we are not assuming independence of the Xi.

The "structure" of the resulting function might be entirely general, with no way to decompose the computation of its value into "clusters" of the Xi. However, most distributions in practice do not depend in an arbitrary way on the Xi, but do so in a "structured" way.

There are two main ideas to characterize this "structure":

Write the probability distribution as a product of functions (called "potential functions") over subsets of the Xi. If such a factorization is possible, then this justifies that the probability distribution depends on the Xi in a "structured" way because it can be computed by first computing a certain function on those subsets separately, and then combining them by product.
Conditional independence
Figure out for which variables Xi, Xj and set S it holds that Xi and Xj are independent knowing the value of the variables of S (for all possible values of those variables). Whenever a distribution has these properties, it justifies that it is "structured" by the fact that the distribution on Xi and Xj is the product distribution whenever some other variables have been fixed.

Graphical models

We can use three main formalisms to represent a distribution while giving some information about its structure:

Bayesian Networks

A DAG over the variables that intuitively represents that a variable only depends "directly" from its immediate parents. This is a very natural representation for processes where the random variables denote the complete state of a process that generates some values from other values.

A Bayesian Network guarantees that the distribution can be factored as the product of one potential function per variable involving that variable and its immediate parents in the graph. Additionally, each such potential is a conditional distribution for the variable it defines, which is not the case in general. In terms of conditional independence, a criterion called d-separation allows you to determine which conditional independence relations hold.

Markov Random Fields

An undirected graph over the variables. In terms of factorization it guarantees that the joint distribution can be factored as a product of potential functions (which must be strictly positive) over the cliques of the graph (or the maximal cliques). However, note that in this case the potential functions do not correspond to conditional distributions, or to probability distributions altogether; they are not normalized.

The main selling point of MRFs is in terms of conditional independence, as there is a very simple and nice criterion: Xi is independent of Xj given S if there is no path connecting Xi to Xj in the graph where S is removed. (This is simpler than d-separation.)

Factor graphs

Factor graphs represent the distribution as a bipartite graph between the variables and the factors. (For legibility, rather than grouping the variable and factors in two sets, the graph is usually represented as a regular graph but with variable vertices denoted by circles, and factor vertices denoted by squares.) They are essentially equivalent to MRFs except that you can use them to represent multiple factors on a set of variables, or factors on a non-maximal clique, in a finer-grained way.

Expressiveness of the models

In terms of expressiveness of these models, it is clearly the case that any joint distribution can be represented in any of the models in a "trivial" way (a Bayesian network that enumerates the variables in an arbitrary way, with variable Xi depending on Xj for all j < i, or a MRF which is the complete graph). The more interesting question is whether, for a distribution, there is a graphical model that encodes exactly how the distribution can be factored, or which conditional independence relations occur.

For factorization, there is something called the Hammersley–Clifford theorem that ensures that, if a probability distribution has a factorization with potential functions that are strictly positive, then it can be represented by the corresponding MRF.

For conditional independence, Bayesian nets and MRFs are incomparable. The counter-examples are on page 53 (number 393) of Chapter 8 of Bishop's book.

Marginalization for tree-like factor graphs

To marginalize a variable (or several variables) is to compute the probability distribution of this single variable, according to the joint probability distribution. Marginalization is an especially important task to perform on graphical models. For instance, in situations where a Bayesian net represents a Boolean formula, we could use marginalization to compute the probability that the formula is true. In the context of processes represented by Bayes nets, it is often the case that you want to incorporate evidence (observations, known values) for some of the leaf variables, and then compute the marginal of the root variables (latent variables), that are the hidden values that you wish to figure out from the observations.

Of course, marginalization can always be performed by looking at the variable(s) to be marginalized and, for each of their possible values, counting the total mass of the possible worlds which achieve the value (and match the observations, if any). Of course, it is intractable to do so. The point is that for distributions that can be represented by a sparse graphical model, more efficient algorithms can be performed. In this section, we assume that we are dealing with a factor graph that is actually a tree.

I show that we can compute the marginal distribution of a selected variable x in this case, in linear time (assuming constant arithmetic cost, PTIME otherwise), on the factor graph which we root at variable x and traverse as a tree in one bottom-up pass. We store for each variable the (unnormalized) marginal distribution of this variable in the subtree of which it is the root, and for each factor the (unnormalized) distribution of the factor on the children variables for each value of the parent variable.

The distribution of a variable is simply, for each value, the product of the overall probability of the children factors for this variable value. The distribution of a factor given a parent variable value is the factor's distribution on assignments for this value times the probability of each child variable for each value in the assignment.

In the setting where the factor graph comes from a Bayesian network, the marginal at the root variable should be correctly normalized, but other conditional marginals in the tree will not; however it is straightforward to renormalize them from their total mass.

In graphical model terms, the algorithm that I have sketched is called "belief propagation" or "sum-product algorithm". Variations exist to compute the marginals for all variables, still in a single pass.

Bounded treewidth factor graphs

Now I sketch why marginalization can also be computed in PTIME on factor graphs which we assume have a tree decomposition of width bounded by a constant. I first explain how the notion of bounded treewidth is defined on the various models, and maintained when translating from one model to the other.

The treewidth of a factor graph is just the treewidth of the resulting bipartite graph. The treewidth of a MRF is the usual graph treewidth. The treewidth of a Bayes net is the treewidth of its moral graph, obtained by connecting, for each node, its set of immediate parents into a clique, and then forgetting about the orientation.

It is easy to see that when converting a MRF or a Bayes net to a factor graph (which is the representation I use for marginalization), the property of having treewidth less than a constant is always preserved; indeed, starting with a tree decomposition of the original graph, consider the node where the elements of each maximal clique (of the MRF, or of the moralized graph of a Bayes net) to be the one node where you add the factor vertex of the factor graph. It's easy to see that this gives you a tree decomposition of the factor graph (factor vertices have a single occurrence, occurrences of variable vertices are OK because they were in the original decomposition, factor vertices co-occur in their node with all the variable vertices to which they are connected), and in so doing the treewidth remains constant (the number of maximal cliques involving a node is at most k! if k is the treewidth).

Now, if a factor graph is not a tree but has treewidth bounded by a constant, the algorithm of the previous section generalizes (it can be performed bottom-up on the tree decomposition). In graphical models, this is called the "junction tree" algorithm.

Diff current version of Debian config file against original version

— updated

Say you want to identify which changes you made to a configuration file in /etc on a Debian system. If you have set up a tol like etckeeper you can just use it, but what if you haven't? A simple idea is to just diff your file against the original configuration file that shipped with the corresponding Debian package.

I was surprised that there was no automatic tool to identify the right Debian package with apt-file, retrieve it with apt-get download, and perform the diff, so I wrote it: debdiffconf.

Customizing your keyboard layout with xkbcomp

— updated

This is really going to be a poor excuse for a tutorial, it is just aimed at covering my current setup and nothing else, but maybe it is useful to someone.

I type using the Dvorak Simplified Keyboard (though I'm not especially convinced it has any intrinsic benefits, but that's another story). However, as I type French (but don't use a French Dvorak or the Bépo layout -- another complicated story), I need to have mappings to type the diacritics used in French. (French keyboard mappings provide the diacritics at the expense of some characters being relegated to AltGr combinations, and they have 105 keys, one more than Qwerty keyboards, with the additional key at the bottom right being used for '<' and '>'.)

Debian provides the Dvorak Simplified Keyboard, along with an international option that allows you to have AltGr combinations to invoke dead keys for common diacritics, you can obtain it using the following invocation. (Note that in all of this I am thinking of the X server world, not the ttys, which I don't use and which use a different mechanism, see loadkeys.)

setxkbmap -layout dvorak -variant intl -model pc105

However, those combinations are not especially convenient (AltGr-6 for the circumflex, for instance...), so it is tempting to modify them. I used to carry an entire dump of the keyboard configuration, that I edited haphazardly and loaded with xkbcomp ~/.xmodmaprc $DISPLAY. I now figured out by skimming through some documentation (mostly this) how to do so more cleanly.

Let me assume that the configuration will be stored in config/xkb. First create a file map to load your current configuration. In my case, I issued:

setxkbmap -layout dvorak -variant intl -model pc105 -option compose:caps -print > map 

You should be able to load this file by issuing xkbcomp ~/config/xkb/map $DISPLAY. Next, create a subfolder symbols that will contain the various files storing the customizations. I have altgr that sets up AltGr (the right alt key) to be used as a modifier, space that sets up a combination to make non-breaking spaces (both AltGr and Maj as I don't want to make them accidentally), and a file accents that stores all of the accents that I need: both dead keys, and shortcuts to make the most common accented characters directly. On Debian systems, see package x11proto-core-dev, file /usr/include/X11/keysymdef.h, for the possible symbols.

Now, you can add to the "xkb_symbols" line of the map file a reference to your extension files: separated by spaces, first the name of the file and then between brackets the name of the stanza (they match in my examples). This gives the final map file.

Last, you need to be able to invoke it. You need to specify where to search for the extension files, so the complete invocation is:

xkbcomp -I$HOME/config/xkb ~/config/xkb/map $DISPLAY

Do not worry about the many warnings. I run this script at the start of my X session to set up the layout (and to set the delay and rate for typematic (the delay after which keys repeat when you keep them pressed) to something that's reasonably fast).

A cute allocation problem

— updated

In the spirit of something I already did, I'm going to discuss a cute problem that suggested itself spontaneously over dinner, except this time I will study its complexity from a theoretical angle.

The problem is as follows. You applied to a number of companies, and each company's board issued a list of the candidates ranked by their order of preference. You know, for each company, how many people they will hire, and you know that they will take the highest ranked people who accept the job. Your hope is to get a job at any of the companies. Maybe you are ranked sufficiently high at some company to be sure to get a job there, in which case you've won; but maybe you aren't, and you hope that some of the people ranked higher than you will not accept their offer(s) so that you can be selected instead.

Of course, you always have some hope that everyone will desist and that you will get what you want. A harder question is: can you be sure that you will get a position, even though you are not ranked well enough to be sure to have any position? If you think about it, the answer is clearly yes in some situations, assuming that other candidates can only accept at most one position. If some super-strong guy is ranked first at all positions, then he'll take only one of them at most, so if there are two positions where you're the first candidate below the bar, then you're sure to get one of them no matter what happens.

The problem is now to decide this algorithmically. Given as input the list of candidates that beat you at each company, given the number of people recruited by each company, can you decide efficiently whether you are sure to get a job at any of the companies?

It might seem that this problem is NP-hard (computationally intractable) because it looks a lot like the set cover problem or the Boolean satisfiability problem, but actually you can determine in polynomial time whether you are sure to get some job.

The method is to encode the problem as a flow problem. Build a graph that has a source vertex s, a target vertex t, one vertex ci for each candidate i, one vertex lj for each list j, and consider the graph with one edge from s to every ci (with capacity 1), one edge from each ci to the lists lj where candidate i appears before you (with capacity 1), and one edge from each lj to t whose capacity is the number of positions offered by company j (in other words, the number of candidates that need to accept a position at this company for you not to get the job).

Now, we ask whether the maximal flow of this graph saturates all the edges to t. If it does, and if the flow is integral, then observe that the flow gives you a way to allocate candidates to lists so that you do not get any job (the one saturated edge leaving ci tells you which job candidate i accepts). Conversely, if there is such an allocation, then there is a maximal flow saturating all edges to t. But now, the integral flow theorem ensures that there is always a maximal flow that is integral. It remains to observe that the maximum flow problem can be solved in polynomial time to conclude that the same is true of our problem.

This implies the tractability of the following (equivalent) rephrasing of this problem in terms of satisfiability. You have a set of variables (xi) which can each be assigned a value from a domain Di (so, multivalued variables). You have a conjunction of clauses which are sets of equalities between such variables and a constant from their domain. For each clause Cj you have a number nj and you say that the clause is true if at least nj of its litterals are true. This is of course much worse than the usual NP-hard Boolean satisfiability problem, except that you require that the same equality (between a variable and a constant) can never occur in two different clauses. Now, by the above, this restriction makes the satisfiability problem tractable.

New GPG key

— updated

It's a bit of a pain, but it turns out that my old OpenPGP key was using suboptimal settings, and so I've regenerated a new one.

I did this after reading this fine best practices tutorial (for which I incidentally helped write a French translation). The gist of it is (1.) that you should set up GPG correctly to fetch keys from key servers (there's the parcimonie-related paranoia, but there's the very embarrassing fact that by default it seems that GPG never manages to talk to a key server; and (2.) that you should check that your key is secure by issuing the following and checking for things in red:

sudo apt-get install hopenpgp-tools
# FINGERPRINT is the actual fingerprint, not a key ID
hkt export-pubkeys "FINGERPRINT" | hokey lint

Generating a new key isn't especially hard but here is a reminder of what you have to do. You then have to re-sign the keys that you had signed with the old key, using the new one...

The formal transition statement signed by both keys is here, so that you can sign the new key if you had signed the old one.