by Peter M Lee
10.4.2 The genetic linkage example
We shall now apply this method to the genetic linkage. Recall that we considered observations with cell probabilities
The values actually observed were x1=125, x2=18, x3=20, x4=34, and we are interested in the posterior distribution of . The likelihood is then
and so is sufficient. We take a uniform U(0, 1) prior for , use the distance measure
and fix a value for ε. The algorithm takes the form:
10.4.2.1 ABC-REJ algorithm for the genetic linkage example
1. Generate from U(0, 1).
2. Generate data which has a trinomial distribution with index n and parameter
(We have not formally considered the trinomial distribution, although it was briefly mentioned in Exercise 15 on Chapter 2 and in Exercise 10 on Chapter 3, but it arises as a simple generalization of the binomial, giving the joint distribution of the total number of individuals taken from a population of size n falling in each of three classes when the individuals are independently assigned to the classes with the probabilities given by .)
3. If , accept θ as an observation from and otherwise reject .
4. Repeat N times.
A program in to do this is as follows:
N <- 100000 etastar <- c(125,18+20,34) n <- sum(etastar) eps <- 3 d <- rep(NA,N) trial <- rep(NA,N) for (j in 1:N) { etatrial <- runif(1) inds <- sample(3,n,replace=T, p=c(0.5+etatrial/4,(1-etatrial)/2,etatrial/4)) samp <- c(sum(inds==1),sum(inds==2),sum(inds==3)) d <- sqrt(sum((etastar-samp)^2)) if (d <= eps) trial[j] <- etatrial } eta <- trial[!is.na(trial)] k <- length(eta) m <- mean(eta) s <- sd(eta) cat("k",k,"m",m,"s",s,"n")
A run of this program produced a sample of values of of size k=1040, mean m=0.625 and s.d. s=0.0511. For comparison, we recall that we found a value for of 0.630 from the EM algorithm in Section 9.2, and a distribution with a posterior mode of 0.627 by data augmentation in Section 9.3.
An ad hoc rule sometimes quoted is that ε should be chosen such that approximately 1% of trials of should be accepted, and the choice of 3 as a value for ε was made for this reason.
10.4.3 The ABC Markov Chain Monte Carlo algorithm
We can incorporate the idea of the Metropolis–Hastings algorithm introduced in Section 9.6 into approximate Bayesian computation. In the aforementioned algorithm is a suitable proposal density (a transition probability density for a Markov chain) and
where
the statistic being in some sense close to being sufficient.
10.4.3.1 ABC-MCMC algorithm
1. Generate from the prior density .
2. Generate from the proposal density .
3. Generate X from the density .
4. Set with probability α and otherwise set .
5. Repeat until a sample of size k has been obtained or a set maximum number of iterations has taken place.
In the genetic linkage example [using the sufficient statistic as our )] a convenient choice is a uniform prior U(0, 1) and a normal proposal density
because, as is easily checked, then reduces to unity, and so
The algorithm thus becomes:
10.4.3.2 ABC-MCMC algorithm for the genetic linkage example
1. Generate from the prior density U(0, 1).
2. Generate from the proposal density .
3. Generate X from the density .
4. Set if and otherwise set .
5. Repeat until a sample of size k has been obtained or a set maximum number of iterations has taken place.
A program in to do this is as follows:
N <- 100000 epsilon <- 3 sdev <- 0.05 k <- 0 etaset <- 0.5 xstar <- c(125,18+20,34) n <- sum(xstar) eta <- rep(NA,N) for (j in 1:N) { etatrial = rnorm(1,etaset,sdev) if ((0
A run of this program produced a sample of values of of size k=4627, mean m=0.624 and s.d. s=0.0499. Note the number of acceptances is some four and a half times as great as with the ABC-REJ algorithm.
A problem which can occur with this algorithm is pointed out by Sisson et al. (2007). In their words, ‘if the ABC-MCMC sampler enters an area of relatively low probability with a poor proposal mechanism, the efficiency of the algorithm is strongly reduced because it then becomes difficult to move anywhere with a reasonable chance of acceptance, and so the sampler “sticks” in that part of the state space for long periods of time.’
10.4.4 The ABC Sequential Monte Carlo algorithm
In the ABC-SMC algorithm we try to improve on simple rejection sampling by adopting a sequential Monte Carlo based simulation approach. Here, a collection of values from the parameter space (termed particles) is propagated from an initial prior distribution through a sequence of intermediary distributions, until it ultimately represents the target distribution. The method can be thought of as a type of importance sampling (cf. Section 10.1).
We begin by taking from an initial distribution (which may or may not be the prior distribution p). We then seek a target distribution which represents the posterior of θ conditional on for observed data . The standard importance sampling procedure would then indicate how well each particle adheres to by specifying the importance weight it should receive in the full population of N particles. The effectiveness of such a procedure is sensitive to the choice of and it can be highly inefficient if is diffuse relative to fT.
The idea behind sequential sampling methods is to avoid the potential disparity between and fT by specifying a sequence of intermediary distributions such that they evolve gradually from the initial distribution towards the target distribution. In the ABC setting, we may naturally define the sequence of distributions as
where is a strictly decreasing sequence of tolerances.
The degree of sample degeneracy in a population is measured by the effective sample size ESS which represents the equivalent number of random samples needed to obtain an estimate such that its Monte Carlo variation is equal to that of the N weighted particles and it may be estimated as , which is between 1 and N. It is usual to set a re-sampling threshold E, which is often taken as N/2.
A difficulty with this approach is that, while some action occurs (e.g. a realization or move proposal is accepted) each time a non-zero likelihood is encountered, there is a large probability that the likelihood, and therefore the particle weight will be zero. Fortunately, the idea of partial rejection control leads to the modified notion of the ABC-PRC algorithm.
This process can be described in the following algorithm which assumes that data is available and is taken from the 2009 corrections to Sisson et al. (2007):
10.4.4.1 ABC-PRC algorithm
1. Initialize and specify the initial sampling density (which may or may not coincide with the prior distribution ). Set the population indicator t to 1.
2. Set the particle indicator to 1.
a. If t=1 sample independently from . If t> 1 sample from the previous population with weights (W(i)t–1), and move the particle to according to a Markov transition density Kt which may or may not vary with t. Generate a data set from the density . If then go back to 2(a).
b. Set and
If i< N increment i to i+1 and go back to 2(a).
3. Normalize the weights, so that . If then re-sample with replacement the particles with weights W(i)t and then set all weights W(i)t to 1/N.
4. If t< T increment t to t+1 and go back to step 2.
Note that as is the prior density, when t=1 step 2(b) gives the same weights as in importance sampling in Section 10.1, while the weights when t> 1 come from an obvious updating process.
An program to carry this out for the genetic linkage example, using the normal random walk (restricted, so that is always between 0 and 1) as the Markov transition density, is as follows:
Estimated
density functions of based on , and are shown in Figure 10.4.
Figure 10.4 The ABC-PRC algorithm for the genetic linkage example.
10.4.5 The ABC local linear regression algorithm
Another variant of ABC which has been proposed uses local linear regression. We shall illustrate the technique in the case where there is a single unknown parameter θ. We define the Epanechnikov kernel as the function
so that and . We then use this function to give most weights to those simulations which result in the smallest values of . Thus, we could estimate the mean value of θ by
But actually we can go further than this. We can seek values of α and which minimize
If instead of using the Epanechnikov kernel, we had taken for all t this would result in ordinary linear regression as we considered in Section 6.3, but instead we are giving more weight to observations for which is small. Straightforward differentiation shows that
By setting we see that the minimum occurs when and , where
and these equations are easily solved, albeit not quite as simply as in Section 6.3 since does not drop out of the first equation.
The natural way of solving such equations is by matrix algebra and with matrix techniques it is almost as easy to deal with cases where the statistic is multidimensional. For details see Beaumont et al. (2002).
10.4.6 Other variants of ABC
Various other variants of the ABC technique have been considered, such as non-linear regression models, in particular fast-forward neural network regression models. For details, see Blum and François (2010).
It has also been applied to model selection (see Toni and Stumpf, 2010).
A forthcoming paper likely to be of interest in connection with ABC is Fearnhead and Prangle (2012) and a useful survey is Sisson and Fan (2011).
10.5 Reversible jump Markov chain Monte Carlo
The Metropolis–Hastings algorithm introduced in Section 9.6 can be generalized to deal with a situation in which we have a number of available models, each with unknown parameters, and we wish to choose between them.
The basic idea is that as well as considering moves within a model as in the basic Metropolis–Hastings algorithm we also consider possible moves between models. We can regard the chain as moving between states which are specified by a model and a set of parameters for that model. For the time being, we shall suppose that we have two models M(1) with parameters and M(2) with parameters , so that at time t we have model M(i[t]) with parameters .
10.5.1 RJMCMC algorithm
1. Initialize by setting t=0 and choosing i[0] and hence model M(i[0]), and then initial parameters .
2. For
a. Update the parameters of the current model M(i[t–1]) as in the usual Metropolis–Hastings algorithm.
b. Propose to move to a new model with parameters . Calculate the acceptance probability for the move and decide whether to accept or reject the new model (and parameter values).
An example arises in connection with the coal-mining disaster data in which we could consider a model M(1) in which all the data are supposed to be such that for and compare it with the model M(2) discussed in Section 9.4 in which we suppose there is an unknown value k such that for and for (with and unknown). In this case typical states of the chain would be and .
A problem which is immediately seen with models M(1) and M(2) described earlier is that while M(1) has only one unknown parameter, namely , M(2) has three, namely . We accordingly add auxiliary variables to the parameters of model M1, so that the parameter spaces of the two models are of the same dimension. The auxiliary variables do not affect the modelling of the data but allow a one-to-one mapping between the two parameter spaces. We can easily map from the parameters of M(2) to M(1) by setting , but in order to map from the parameters of M(1) to uniquely determined parameters M(2) we need these auxiliary variables. We can then, for example, take , and k=w, where the random variables v and w are chosen to have suitable distributions.
Because of the fact that we change the variables we use, we need to re-examine the detailed balance equation
from which we deduced the Metropolis–Hastings algorithm in Section 9.6. We note that since we need to work in terms of probability elements in the continuous multivariate case this should be thought of as
In one dimension, the ratio of differentials would give rise to an ordinary derivative, but the analogue in many dimensions is, of course, the Jacobian . For this reason, we find
Since the models have different parameters, it is clear that the priors cannot cancel out and so must be proper.
We can give a more explicit form to the probability of switching as follows. Let be the parameters of a model M(i) and let be the associated auxiliary variables. Let be the parameters of a model M(j) and let be the associated auxiliary variables. Write and . Let be a proposal distribution for and let be the proposal distribution for . Then the probability of moving from model i with parameters to model j with parameters is
where q(i | j) is the probability that a change of model from model i results in model j and is the Jacobian.
In the case of the coal-mining disasters, we have only two possible models, so that q(2 | 1)=q(1 | 2)=1. Further, as , and k=w, we find
so that, depending on which way a jump is proposed, either or .
In this case, a suitable proposal density is obtained by giving v a normal distribution with mean zero and a suitable variance, so , and w a discrete uniform distribution over (2, n–1), so that . Since is null, we may take . In fact, it turns out that there is strong evidence for the existence of a change point; if the chain is started in model M(2) it always remains in that model (and the change point still being some time in 1889), although, if we allow the possibility of more than one change point, then Green (1995) shows that there is weak evidence for a second change point.
Another case where RJMCMC can be applied is in mixture models where we are unsure how many components are present. As a simple case, we might wish to tell whether a data set X consists of a single normal population of mean 0 or a 50:50 mixture of populations with mean 0 but different variances. More generally, we could allow for more components, for different means and for variable mixing coefficients. The method can also be applied in choosing between various regression models. In cases where there are m> 2 models, values of q(j | i) will have to be supplied, but in some cases it will suffice simply to take q(j | i)=1/(m–1) for all i and j (with ), or at least to make all jumps to models which are in some sense ‘adjacent’ equally likely.
Some more details about this approach, including a detailed treatment of the coal-mining disaster data, can be found in Green (1995); see also Richardson and Green (1997).
More information can be found in Fan and Sisson (2011) or Gamerman and Lopes (2006). RJMCMC is in the process of being incorporated into WinBUGS; for information about this see the website associated with this book.
10.6 Exercises on Chapter 10
1. Show that in importance sampling the choice
minimizes even in cases where f(x) is not of constant sign.
2. Suppose that has a Cauchy distribution. It is easily shown that , but we will consider Monte Carlo methods of evaluating this probability.
a. Show that if k is the number of values taken from a random sample of size n with a Cauchy distribution, then k/n is an estimate with variance 0.125 802 7/n.
b. Let p(x)=2/x2, so that . Show that if is uniformly distributed over the unit interval then y=2/x has the density p(x) and that all values of y satisfy and hence that
gives an estimate of by importance sampling.
c. Deduce that if are independent U(0, 1) variates then
gives an estimate of .
d. Check that is an unbiased estimate of and show that
and deduce that
so that this estimator has a notably smaller variance than the estimate considered in (a).
3. Apply sampling importance re-sampling starting from random variables uniformly distribute
d over (0, 1) to estimate the mean and variance of a beta distribution Be(2, 3).
4. Use the sample found in Section 10.5 to find a 90% HDR for Be(2, 3) and compare the resultant limits with the values found using the methodology of Section 3.1. Why do the values differ?
5. Apply the methodology used in the numerical example in Section 10.2 to the data set used in both Exercise 16 on Chapter 2 and Exercise 5 on Chapter 9.
6. Find the Kullback–Leibler divergence when p is a binomial distribution and q is a binomial distribution . When does ?
7. Find the Kullback–Leibler divergence when p is a normal distribution and q is a normal distribution .
8. Let p be the density (x> 0) of the modulus x=|z| of a standard normal variate z and let q be the density (x> 0) of an distribution. Find the value of such that q is as close an approximation to p as possible in the sense that the Kullback–Leibler divergence is a minimum.
9. The paper by Corduneanu and Bishop (2001) referred to in Section 10.3 can be found on the web at
Härdle’s data set is available in by going data(faithful). Fill in the details of the analysis of a mixture of multivariate normals given in that section.
10. Carry out the calculations in Section 10.4 for the genetic linkage data quoted by Smith which was given in Exercise 3 on Chapter 9.
11. A group of n students sit two exams. Exam one is on history and exam two is on chemistry. Let xi and yi denote the ith student’s score in the history and chemistry exams, respectively. The following linear regression model is proposed for the relationship between the two exam scores: