Bayesian Statistics (4th ed)
Page 30
The success of this method is illustrated in Figure 10.2, where the normalized conjugate density discussed earlier is compared with the smoothed density function found from such a random sample constructed in this way starting from the gamma approximation.
Figure 10.2 Approximation to the conjugate gamma distribution.
We have used sampling with replacement as recommended by Albert (2009, Section 5.10), although Gelman et al. (2004, Section 10.5) prefer sampling without replacement, claiming that ‘in a bad case, with a few very large values and many small values [s]ampling with replacement will pick the same few values … again and again; in contrast, sampling without replacement yields a more desirable intermediate approximation’. Nevertheless, at least in some cases such as the one discussed here, substituting replace=F results in a much poorer fit.
We can, of course, use the sample thus obtained to estimate the mean, this time as 1.689, and the variance, this time as 0.531. Further, we can use these values to estimate the median as 1.616 where the true value is 1.622.
10.1.3 Multidimensional applications
The SIR algorithm is most commonly applied in situations when we are trying to find the properties of the posterior distribution of a multidimensional unknown parameter . Quite often, it turns out that a convenient choice of an approximating density p from which to generate a random sample is the multivariate t distribution. Since the multivariate t distribution has been avoided in this book, we shall not discuss this approach further; for more details about it see Albert (2009, Section 5.10).
10.2 Variational Bayesian methods: simple case
The integrals which arise in Bayesian inference are, as we have seen, frequently intractable unless we use priors which are jointly conjugate, and these are often difficult to deal with, as in the case of the normal/chi-squared distribution introduced in Section 2.13. The idea of variational Bayesian methods is to approximate the posterior by a density of a simpler form. They can be thought of as extensions of the EM technique discussed in Section 9.2.
When approximating one distribution by another, it is useful to have a measure of the closeness of the approximation, and for this purpose we shall use the Kullback–Leibler divergence or information 1
(cf. Section 3.11; the integral sign denotes a multiple integral when is multidimensional and, of course, is replaced by a summation in discrete cases). This function satisfies
To show this, note that if we use natural logarithms it is easily shown that x – 1 and hence for any densities and we have
Note, however, that it is not symmetric [i.e. it is not generally true that ], nor does it satisfy the triangle inequality, so that it is not a metric.
In variational inference, we try to find a distribution of a simpler form which approximates an intractable posterior distribution, so with a posterior we seek which minimizes
among a suitable class of densities .
We note that since
It follows that
where
Since is not a function of q, it follows that we minimize the divergence by maximizing .2
10.2.1 Independent parameters
For the moment, let us suppose that there are just two unknown parameters and , so that . It is often the case that the exact posterior is such that and are correlated and then it is useful to look for an approximation to the posterior which factorizes, so that
When this happens, the unknown parameters are, of course, independent. We look for a density of this form which minimizes the Kullback–Leibler divergence. We proceed sequentially, first taking the distribution of as known and choosing that of to minimize the divergence, and then taking that of as known and choosing the distribution of to minimize the divergence, and then iterating. Because these functions are densities, we have to apply the subsidiary conditions
For this purpose, we make use of a theorem in the calculus of variations given by Gelfand and Fomin (1963, Section 12.1, Theorem 1) which gives a way of minimizing an integral J[y] subject to the condition that K[y] takes a fixed value l. The result we need is:
Theorem 10.1 Given the functional
let the admissible curves satisfy the conditions
where K[y] is another functional, and let J[y] have an extremum for y=y(x). Then, if y=y(x) is not an extremal of K[y], there exists a constant such that y=y(x) is an extremal of the functional
that is y=y(x) satisfies the differential equation
Proof. See Gelfand and Fomin
We can extend this result to integrals over the whole real line assuming convergence of the integrals.
Regarding the distribution of as fixed for the moment, we now apply this theorem to in the case where we wish to maximize
We now take for x, q1 for y,
for and for (with l = 1) in the theorem. As none of the integrals depend on the terms
vanish automatically, while the term becomes
It follows that the stationarity condition given by the differential equation in the theorem takes the form
and hence that
The constant of proportionality is determined by the fact that .
The theorem quoted earlier from Gelfand and Fomin requires us to check that y is not an extremal of K[y]. In our case , so that
and hence y is not an extremal of K[y].
It should be clear that, when we need to determine the distribution of whilst regarding the distribution of as fixed, we take
The argument proceeds with no essential difference in cases where and are multidimensional.
10.2.2 Application to the normal distribution
As a simple example, we consider a case where we take observations from a normal distribution with a log-likelihood L satisfying
We have already shown in Section 2.12 that the posteriors for and cannot be expected to be independent, but we wish to find an approximation to the posterior by a density . We assume independent conjugate priors, so a normal prior for and an inverse chi-squared prior for . We thus have four ‘hyperparameters’, , S0 and determining the prior distributions and which satisfy
(cf. Sections A.1 and A.5). Because
it follows that (up to a function of , S0 and alone)
We seek an approximation to of the form , and with this in mind we begin by setting the initial value of as and update using and then update using . This whole process has to be repeated until we obtain convergence.
10.2.3 Updating the mean
We update the mean using
where has an inverse chi-squared distribution . Writing for the density of an random variable the right hand side becomes (ignoring some terms constant with respect to )
The first and last terms are now seen to be constant with respect to , while as a density integrates to unity and is easily shown to be . Proceeding to complete the square much as we did in Section 2.3, it follows that (with c1, c2, c3 constant with respect to )
where
Note the similarity of the expression for to the expression for found using the EM algorithm in Subsection 9.2.3, headed ‘Semiconjugate prior with a normal likelihood’.
10.2.4 Updating the variance
Updating of the variance proceeds similarly, using
The distribution of is and consequently, writing for the density of a distribution, the right-hand side of the expression for becomes (ignoring some terms constant with respect to )
(using ). The third term is a constant with respect to . We then use and to deduce that
writing
We thus find that (with c4 and c5 constant with respect to )
where
10.2.5 Iteration
It will be seen that the expressions for and depend on S and and that those for S and depend on and . Consequently, we have to proceed iteratively, first using prior values of S and to derive values of and , then using these values to find values S1 and , then using these values to derive values and , and then to values S2 and , and so on. The resultant sequences should reasonably quickly converge to values , , and .
/>
10.2.6 Numerical example
In the subsection ‘Semi-conjugate prior with a normal likelihood’ of Section 9.2, we investigated how the EM algorithm could be used with the data on wheat yield originally considered in the example towards the end of Section 2.13 in which n=12, and sum of squares about the mean SS=13 045. We took a prior for the variance which was with S0=2700 and and an independent prior for the mean which was where and then3 and we shall take the same priors now.
The following program converges within half a dozen iterations to values , , and . In particular, the distribution of is N(112.259, 3.872). Note that the value of 112.259 is close to the value 112.278 which we derived from the EM algorithm in Section 9.2.
r <- 10 phi <- rep(NA,r) mu <- rep(NA,r) S <- rep(NA,r) n <- 12; xbar <- 119; SS <- 13045 phi0 <- 20; mu0 <- 110; S0 <- 2700; nu0 <- 11 S[1] <- S0 nustar <- nu0 + n for (i in 2:r) } phi[i] <- (1/phi0 + n*nustar/S[i-1])^{-1} mu[i] <- phi[i]*(mu0/phi0 + n*xbar*nustar/S[i-1]) S[i] <- S0 + (SS+n*xbar^2) - 2*n*xbar*mu[i+1] + n*(mu[i]^2 + phi[i]) cat("i",i,"phi",phi[i],"mu",mu[i],"S",S[i],"n") } mustar <- mu[r]; phistar <- phi[r]; Sstar <- S[r] cat("mu has mean",mustar,"and s.d.",sqrt(phistar),"n") cat("phi has mean",Sstar/(nustar-2),"and s.d.", (Sstar/(nustar-2))*sqrt(2/(nustar-4)),"n")
10.3 Variational Bayesian methods: general case
The variational Bayes approach can also be used in cases where we have more than two parameters or sets of parameters. In cases where we seek an approximation to the density of the form
We need to write
With this notation, a straightforward generalization of the argument in the previous section leads to
and so to
10.3.1 A mixture of multivariate normals
We sometimes encounter data in two or more dimensions which is clearly not adequately modelled by a multivariate normal distribution but which might well be modelled by a mixture of two or more such distributions. An example is the data on waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA taken from Härdle (1991), which is illustrated in Figure 10.3. A more complicated example of variational Bayes techniques can be illustrated by outlining a treatment of this problem.
Figure 10.3 Härdle’s old faithful data.
We shall write for a vector of probabilities, so that , for a set of M mean vectors of dimension k, and for a set of M variance–covariance () matrices.
Suppose that we have a data set X comprising N observations () which come from a mixture of sub-populations each of which has a k-dimensional multivariate normal distribution, so that with probability (for ) their distribution is multivariate normal with mean and variance–covariance matrix , that is . Thus,
(where by abuse of notation the expression on the right-hand side means a variable which with probability has an distribution). We suppose that all the parameters are unknown, and we seek maximum likelihood estimates of them, so that we can find what the component distributions are and which observations come from which component.
In order to deal with this problem, it is helpful to augment the data with variables sij which are indicator variables denoting the sub-population to which each observation belongs, so that
and the values for different values of i are independent of one another. We write for the complete set of the sij. Estimating the values of the sij amounts to determining which sub-population each observation comes from.
In dealing with this problem, we begin by working as if the value of were known. We then take independent multivariate normal priors of mean and variance–covariance matrix for the . For the variance–covariance matrices, we take independent inverse Wishart distributions; since we are only outlining a treatment of this problem it is not important exactly what this means and you can take it as given that this is the appropriate multidimensional analogue of the (multiples of) inverse chi-squared distributions we use as a prior for the variance in the one-dimensional case.
We can specify the prior distribution of for a given value of by saying that an individual i (for ) belongs to sub-population j (for ) with probability .
It then follows that
In order to evaluate it is necessary to marginalize this expression with respect to , so that we seek
Now as we found earlier
We then seek a variational posterior distribution of the form
Since we used appropriate conjugate priors, we end up with a posterior for which is multivariate normal and for which is inverse Wishart. The posterior distribution of for a given value of is determined by a matrix (pij) where pij is the probability that individual i belongs to sub-population j.
In this way we obtain a variational lower bound which approximates the true marginal log-likelihood . This bound is, of course, dependent on through the component of . By maximizing it with respect to we obtain our required estimates for the mixing coefficients. But, of course, the value of and hence the value of the lower bound will depend on . We, therefore, adopt an EM procedure in which we alternately maximize with respect to and then optimize q by updating of the variational solutions for q1, q2 and q3 (expectation step). We have so far assumed that M is fixed and known, but it is possible to try the procedure for various values of M and choose that value which maximizes the variational likelihood.
We omit the details, which can be found in Corduneanu and Bishop (2001). They find that the Old Faithful geyser data is well fitted by a three-component mixture of bivariate normals with . This corresponds to the visual impression given by Figure 10.3 in which it appears that there is one sub-population at the top right, one at the bottom left and possibly a small sub-population between the two.
Figure 10.3 Härdle’s old faithful data.
Some further applications of variational Bayesian methods can be found in Ormerod and Wand (2010). In their words, ‘Variational methods are a much faster alternative to MCMC, especially for large models’.
10.4 ABC: Approximate Bayesian Computation
10.4.1 The ABC rejection algorithm
We sometimes find that we have a model with a likelihood which is mathematically difficult to deal with, and for such cases a technique called Approximate Bayesian Computation or ABC (sometimes referred to as likelihood-free computation) has been developed and has been quite widely employed, especially in genetics. There are several variants of this technique and we will illustrate each of them with the genetic linkage example we considered earlier in Sections 9.2 and 9.3.
Suppose we want to find the posterior distribution when we have observations coming from a distribution and a (proper) prior distribution for θ. Then if the observations come from a discrete distribution we can use the following algorithm to generate a sample of size k from the required posterior distribution.
10.4.1.1 ABC algorithm
1. Generate θ from the prior (discrete) density .
2. Generate from the density .
3. If , accept θ as an observation from , and otherwise reject .
4. Repeat until a sample of size k has been obtained or a set maximum number of iterations has taken place.
This algorithm works because for any set A of possible values of θ we find (using the ‘tilde’ notation introduced in Chapter 1)
as required.
Although it works, this is an extremely inefficient algorithm. If, for example, we take six observations from a Poisson distribution such as the misprint data considered in Section 3.4
We could calculate the value of this for any conjugate prior
but for simplicity we consider the case where and S0=2, so that our prior is . With the data we considered in that example, namely, , the required probability becomes
Clearly this technique is not practical without modification, even in the discrete case, and since continuous random variables attain any particular value with probability zero, it cannot be used at all in the continuous case.
We can make a slight improvement by use of sufficient statistics. So in the case of
a sample of size 6 from we need consider only the sum of the observations, which has a distribution. With this in mind, we can replace step 3 of the algorithm by
3*. If , accept θ as an observation from and otherwise reject .
By working with the sufficient statistic, the acceptance probability with the aforementioned data rises to
This is clearly not enough.
If there is a distance metric d available defining distances between values of the statistic T(x), we can define an ε-approximate posterior distribution as
where
10.4.1.2 ABC-REJ algorithm
1. Generate θ from the prior density .
2. Generate from the density .
3. If , accept θ as an observation from and otherwise reject .
4. Repeat until a sample of size k has been obtained or a set maximum number of iterations has taken place.
With this modification, the method can be used with continuous random variables as well as with discrete ones. We observe that if this algorithm reduces to the previous one with the modification concerning sufficient statistics.
Sometimes there is not a suitable sufficient statistic and we instead use a statistic which is in some sense close to being sufficient.