For my PhD work I had to read a bit about graphical models. As I found the topic quite interesting, I thought I'd publish here the notes that I took. Those notes are probably not very helpful without detailed explanations and examples from Wikipedia and from e.g. Chapter 8 of Bishop's Pattern recognition and machine learning which is the clearest source I found about those things. Yet, maybe it might save some time to people trying to understand the relation between the various concepts.
Joint distributions and their properties
We consider a set of random variables X1, ..., Xn, and we wish to define a joint probability distribution over these variables, namely, a function from valuations of the Xi to reals, which must sum up (or integrate, for continuous variables) to 1. This implies that we are not assuming independence of the Xi.
The "structure" of the resulting function might be entirely general, with no way to decompose the computation of its value into "clusters" of the Xi. However, most distributions in practice do not depend in an arbitrary way on the Xi, but do so in a "structured" way.
There are two main ideas to characterize this "structure":
- Write the probability distribution as a product of functions (called "potential functions") over subsets of the Xi. If such a factorization is possible, then this justifies that the probability distribution depends on the Xi in a "structured" way because it can be computed by first computing a certain function on those subsets separately, and then combining them by product.
- Conditional independence
- Figure out for which variables Xi, Xj and set S it holds that Xi and Xj are independent knowing the value of the variables of S (for all possible values of those variables). Whenever a distribution has these properties, it justifies that it is "structured" by the fact that the distribution on Xi and Xj is the product distribution whenever some other variables have been fixed.
We can use three main formalisms to represent a distribution while giving some information about its structure:
- Bayesian Networks
A DAG over the variables that intuitively represents that a variable only depends "directly" from its immediate parents. This is a very natural representation for processes where the random variables denote the complete state of a process that generates some values from other values.
A Bayesian Network guarantees that the distribution can be factored as the product of one potential function per variable involving that variable and its immediate parents in the graph. Additionally, each such potential is a conditional distribution for the variable it defines, which is not the case in general. In terms of conditional independence, a criterion called d-separation allows you to determine which conditional independence relations hold.
- Markov Random Fields
An undirected graph over the variables. In terms of factorization it guarantees that the joint distribution can be factored as a product of potential functions (which must be strictly positive) over the cliques of the graph (or the maximal cliques). However, note that in this case the potential functions do not correspond to conditional distributions, or to probability distributions altogether; they are not normalized.
The main selling point of MRFs is in terms of conditional independence, as there is a very simple and nice criterion: Xi is independent of Xj given S if there is no path connecting Xi to Xj in the graph where S is removed. (This is simpler than d-separation.)
- Factor graphs
Factor graphs represent the distribution as a bipartite graph between the variables and the factors. (For legibility, rather than grouping the variable and factors in two sets, the graph is usually represented as a regular graph but with variable vertices denoted by circles, and factor vertices denoted by squares.) They are essentially equivalent to MRFs except that you can use them to represent multiple factors on a set of variables, or factors on a non-maximal clique, in a finer-grained way.
Expressiveness of the models
In terms of expressiveness of these models, it is clearly the case that any joint distribution can be represented in any of the models in a "trivial" way (a Bayesian network that enumerates the variables in an arbitrary way, with variable Xi depending on Xj for all j < i, or a MRF which is the complete graph). The more interesting question is whether, for a distribution, there is a graphical model that encodes exactly how the distribution can be factored, or which conditional independence relations occur.
For factorization, there is something called the Hammersley–Clifford theorem that ensures that, if a probability distribution has a factorization with potential functions that are strictly positive, then it can be represented by the corresponding MRF.
For conditional independence, Bayesian nets and MRFs are incomparable. The counter-examples are on page 53 (number 393) of Chapter 8 of Bishop's book.
Marginalization for tree-like factor graphs
To marginalize a variable (or several variables) is to compute the probability distribution of this single variable, according to the joint probability distribution. Marginalization is an especially important task to perform on graphical models. For instance, in situations where a Bayesian net represents a Boolean formula, we could use marginalization to compute the probability that the formula is true. In the context of processes represented by Bayes nets, it is often the case that you want to incorporate evidence (observations, known values) for some of the leaf variables, and then compute the marginal of the root variables (latent variables), that are the hidden values that you wish to figure out from the observations.
Of course, marginalization can always be performed by looking at the variable(s) to be marginalized and, for each of their possible values, counting the total mass of the possible worlds which achieve the value (and match the observations, if any). Of course, it is intractable to do so. The point is that for distributions that can be represented by a sparse graphical model, more efficient algorithms can be performed. In this section, we assume that we are dealing with a factor graph that is actually a tree.
I show that we can compute the marginal distribution of a selected variable x in this case, in linear time (assuming constant arithmetic cost, PTIME otherwise), on the factor graph which we root at variable x and traverse as a tree in one bottom-up pass. We store for each variable the (unnormalized) marginal distribution of this variable in the subtree of which it is the root, and for each factor the (unnormalized) distribution of the factor on the children variables for each value of the parent variable.
The distribution of a variable is simply, for each value, the product of the overall probability of the children factors for this variable value. The distribution of a factor given a parent variable value is the factor's distribution on assignments for this value times the probability of each child variable for each value in the assignment.
In the setting where the factor graph comes from a Bayesian network, the marginal at the root variable should be correctly normalized, but other conditional marginals in the tree will not; however it is straightforward to renormalize them from their total mass.
In graphical model terms, the algorithm that I have sketched is called "belief propagation" or "sum-product algorithm". Variations exist to compute the marginals for all variables, still in a single pass.
Bounded treewidth factor graphs
Now I sketch why marginalization can also be computed in PTIME on factor graphs which we assume have a tree decomposition of width bounded by a constant. I first explain how the notion of bounded treewidth is defined on the various models, and maintained when translating from one model to the other.
The treewidth of a factor graph is just the treewidth of the resulting bipartite graph. The treewidth of a MRF is the usual graph treewidth. The treewidth of a Bayes net is the treewidth of its moral graph, obtained by connecting, for each node, its set of immediate parents into a clique, and then forgetting about the orientation.
It is easy to see that when converting a MRF or a Bayes net to a factor graph (which is the representation I use for marginalization), the property of having treewidth less than a constant is always preserved; indeed, starting with a tree decomposition of the original graph, consider the node where the elements of each maximal clique (of the MRF, or of the moralized graph of a Bayes net) to be the one node where you add the factor vertex of the factor graph. It's easy to see that this gives you a tree decomposition of the factor graph (factor vertices have a single occurrence, occurrences of variable vertices are OK because they were in the original decomposition, factor vertices co-occur in their node with all the variable vertices to which they are connected), and in so doing the treewidth remains constant (the number of maximal cliques involving a node is at most k! if k is the treewidth).
Now, if a factor graph is not a tree but has treewidth bounded by a constant, the algorithm of the previous section generalizes (it can be performed bottom-up on the tree decomposition). In graphical models, this is called the "junction tree" algorithm.