by Peter M Lee
the expectation being taken over values of or over values of , that is,
using the result at the very end of Section 8.2 and bearing in mind that . Now writing k=g–h we have
and hence
for all , which can happen only if k vanishes identically by the uniqueness of Laplace transforms.
8.5 Bayesian analysis for an unknown overall mean
In Section 8.2, we derived the posterior for supposing that a priori
where μ was known. We shall now go on to an approach introduced by Lindley (1969) and developed in his contribution to Godambe and Sprott (1971) and in Lindley and Smith (1972) for the case where μ is unknown.
We suppose that
are independent given the and . This is the situation which arises in one way analysis of variance (analysis of variance between and within groups). In either of the practical circumstances described above, the means will be thought to be alike. More specifically, the joint distribution of these means must have the property referred to by de Finetti (1937 or 1974–1975, Section 11.4) as exchangeability; that is, the joint distribution remains invariant under any permutation of the suffices. A famous result in de Finetti (1937) [for a good outline treatment see Bernardo and Smith (1994, Sections 4.2 and 4.3)] says that exchangeability implies that the have the probability structure of a random sample from a distribution. It might seem sensible to add the additional assumption that this distribution is normal (as we often do in statistics). It would then be appropriate to assume that
the being assumed independent for given μ and .
To complete the specification of the prior distribution, it is necessary to discuss μ, and . For the moment, we shall suppose that the two variances are known and that the prior knowledge of μ is weak, so that, over the range for which the likelihood is appreciable, the prior density of μ is constant (cf. Section 2.5).
We thus have
so that
We shall show that we can write the posterior distribution in the form
where the ti are defined by
in which
We thus see that the posterior means of the take the form of a weighted average of the mean (the least-squares estimate) and an overall mean , depending in a natural way on the sample sizes and the ratio of the two variance components. The effect of this weighted average is to shift all the estimates for the sample mean towards the overall mean. It is clear that these estimates are of the same type as the Efron–Morris (or Stein) estimators derived earlier.
The proof of this result is given in Section 8.6, but can be omitted by readers willing to take the result for granted. It must be admitted that the result is mainly of theoretical interest because it is difficult to think of real-life cases where both and are known.
In the case where and are unknown and conjugate (inverse chi-squared) priors are taken for them, somewhat similar results are possible with and in the expression for wj replaced by suitable estimators; the details can be found in Lindley’s contribution to Godambe and Sprott (1971). Unfortunately, while it is possible to use a reference prior for , there are severe difficulties about using a similar prior for . In the words of Lindley, op. cit.,
The difficulty can be viewed mathematically by remarking that if a prior proportional to … which is improper … – is used, then the posterior remains improper whatever size of sample is taken. Heuristically it can be seen that the between-sample variance provides information directly about , – that is, confounded with – and not about itself, so that the extreme form of the prior cannot be overcome by sampling.
We shall discuss numerical methods for use in connection with the hierarchical normal model in Sections 9.2 and 9.4.
8.5.1 Derivation of the posterior
Because
where , we see that
Noting that (because )
we can integrate over μ to get
Minus twice the coefficient of in the above exponential is
while the coefficient of of is
from which it follows that if we set
we can write the posterior distribution in the form
where the ti are yet to be determined.
By equating coefficients of we see that
where
Writing
it follows that
so that
where
and so
We have thus proved that the posterior means of the do indeed take the form of a weighted average of the mean (the least-squares estimate) and an overall mean , depending in a natural way on the sample sizes and the ratio of the two variance components and so shift all the estimates for the sample mean towards the overall mean.
A further discussion of related matters can be found in Leonard and Hsu (2001, Section 6.3).
8.6 The general linear model revisited
8.6.1 An informative prior for the general linear model
This section follows on from Section 6.7 on ‘The general linear model’ and like that section presumes a knowledge of matrix theory.
We suppose as in that section that
(where is r-dimensional, so is ), but this time we take a non-trivial prior for , namely
(where is s-dimensional, so is ). If the hyperparameters are known, we may as well take r=s and , and in practice dispense with , but although for the moment we assume that is known, in due course we shall let have a distribution, and it will then be useful to allow other values for .
Assuming that , and are known, the log of the posterior density is (up to an additive constant)
Differentiating with respect to the components of we get a set of equations which can be written as one vector equation
Equating this to zero to find the mode of the posterior distribution, which by symmetry equals its mean, we get
so that
In particular, if is taken as zero, so that the vector of prior means vanishes, then this takes the form
The usual least squares estimators reappear if .
8.6.2 Ridge regression
This result is related to a technique which has become popular in recent years among classical statisticians which is known as ridge regression. This was originally developed by Hoerl and Kennard (1970), and a good account of it can be found in the article entitled ‘Ridge Regression’ in Kotz et al. (2006); alternatively, see Weisberg (2005, Section 11.2). Some further remarks about the connection between ridge regression and Bayesian analysis can be found in Rubin (1988).
What they pointed out was that the appropriate (least squares) point estimator for was
From a classical standpoint, it then matters to find the variance–covariance matrix of this estimator in repeated sampling, which is easily shown to be
(since ), so that the sum of the variances of the regression coefficients is
(the trace of a matrix being defined as the sum of the elements down its main diagonal) and the mean square error in estimating θ is
However, there can be considerable problems in carrying out this analysis. It has been found that the least squares estimates are sometimes inflated in magnitude, sometimes have the wrong sign, and are sometimes unstable in that radical changes to their values can result from small changes or additions to the data. Evidently if is large, so is the mean-square error, which we can summarize by saying that the poorer the conditioning of the matrix, the worse the deficiencies referred to above are likely to be. The suggestion of Hoerl and Kennard was to add small positive quantities to the main diagonal, that is to replace by where k> 0, so obtaining the estimator
which we derived earlier from a Bayesian standpoint. On the other hand, Hoerl and Kennard have some rather ad hoc mechanisms for deciding on a suitable value for k.
8.6.3 A further stage to the general linear model
We now explore a genuinely hierarchical model. We supposed that , or slightly more generally that
(see the description of the multivariate normal distribution in Appendix A). Further a priori , or slightly more generally
> At the next stage, we can suppose that our knowledge of is vague, so that . We can then find the marginal density of as
on completing the square by taking such that
that is
Since the second exponential is proportional to a normal density, it integrates to a constant and we can deduce that
that is , where
We can then find the posterior distribution of given as
Again completing the square, it is easily seen that this posterior distribution is where
8.6.4 The one way model
If we take the formulation of the general linear model much as we discussed it in Section 6.7, so that
we note that . We assume that the xi are independent and have variance so that reduces to and hence The situation where we assume that the are independently fits into this situation if take (an r-dimensional column vector of 1s, so that while is an matrix with 1s everywhere) and have just one scalar hyperparameter μ of which we have vague prior knowledge. Then reduces to and to giving
and so has diagonal elements ai+b and all off-diagonal elements equal to b, where
These are of course the same values we found in Section 8.5 earlier. It is, of course, also be possible to deduce the form of the posterior means found there from the approach used here.
8.6.5 Posterior variances of the estimators
Writing
and (remember that b< 0) it is easily seen that
and hence
using the Sherman–Morrison formula for the inverse of with . [This result is easily established; in case of difficulty, refer to Miller (1987, Section 3) or Horn and Johnson (1991, Section 0.7.4).] Consequently the posterior variance of is
Now substituting and , we see that
from which it follows that
We thus confirm that the incorporation of prior information has resulted in a reduction of the variance.
8.7 Exercises on Chapter 8
1. Show that the prior
suggested in connection with the example on risk of tumour in a group of rats is equivalent to a density uniform in .
2. Observations x1, x2, … , xn are independently distributed given parameters , , … , according to the Poisson distribution . The prior distribution for is constructed hierarchically. First, the s are assumed to be independently identically distributed given a hyperparameter according to the exponential distribution for and then is given the improper uniform prior for . Provided that , prove that the posterior distribution of has the beta form
Thereby show that the posterior means of the are shrunk by a factor relative to the usual classical procedure which estimates each of the by xi.
What happens if ?
3. Carry out the Bayesian analysis for known overall mean developed in Section 8.2 mentioned earlier (a) with the loss function replaced by a weighted mean
and (b) with it replaced by
4. Compare the effect of the Efron–Morris estimator on the baseball data in Section 8.3 with the effect of a James–Stein estimator which shrinks the values of towards or equivalently shrinks the values of Xi towards .
5. The Helmert transformation is defined by the matrix
so that the element aij in row i, column j is
It is also useful to write for the (column) vector which consists of the jth column of the matrix . Show that if the variates Xi are independently , then the variates are independently normally distributed with unit variance and such that for j> 1 and
By taking for i> j, aij=0 for i< j and ajj such that , extend this result to the general case and show that . Deduce that the distribution of a non-central chi-squared variate depends only of r and .
6. Show that where
(Lehmann 1983, Section 4.6, Theorem 6.2).
7. Writing
for the least-squares and ridge regression estimators for regression coefficients , show that
and that the bias of is
while its variance–covariance matrix is
Deduce expressions for the sum of the squares of the biases and for the sum of the variances of the regression coefficients, and hence show that the mean square error is
Assuming that is continuous and monotonic decreasing with and that is continuous and monotonic increasing with , deduce that there always exists a k such that MSEk< MSE0 (Theobald, 1974).
8. Show that the matrix in Section 8.6 satisfies and that if is square and non-singular then vanishes.
9. Consider the following particular case of the two way layout. Suppose that eight plots are harvested on four of which one variety has been sown, while a different variety has been sown on the other four. Of the four plots with each variety, two different fertilizers have been used on two each. The yield will be normally distributed with a mean θ dependent on the fertiliser and the variety and with variance . It is supposed a priori that the mean for plots yields sown with the two different varieties are independently normally distributed with mean α and variance , while the effect of the two different fertilizers will add an amount which is independently normally distributed with mean β and variance . This fits into the situation described in Section 8.6 with being times an identity matrix and
Find the matrix needed to find the posterior of θ.
10. Generalize the theory developed in Section 8.6 to deal with the case where and and knowledge of is vague to deal with the case where (Lindley and Smith, 1972).
11. Find the elements of the variance–covariance matrix for the one way model in the case where ni=n for all i.
9
The Gibbs sampler and other numerical methods
9.1 Introduction to numerical methods
9.1.1 Monte Carlo methods
Bayesian statistics proceeds smoothly and easily as long as we stick to well-known distributions and use conjugate priors. But in real life, it is often difficult to model the situations which arise in practice in that way; instead, we often arrive at a posterior
but have no easy way of finding the constant multiple and have to have recourse to some numerical technique. This is even more true when we come to evaluate the marginal density of a single component of . A simple example was given earlier in Section 3.9 on ‘The circular normal distribution’, but the purpose of this chapter is to discuss more advanced numerical techniques and the Gibbs sampler in particular.
There are, of course, many techniques available for numerical integration, good discussions of which can be found in, for example, Davis and Rabinowitz (1984), Evans (1993) or Evans and Swartz (2000). Most of the methods we shall discuss are related to the idea of ‘Monte Carlo’ integration as a method of finding an expectation. In the simplest version of this, we write
where the points x1, x2, … , xn are chosen as independent ‘pseudo-random’ numbers with density p(x) on the interval (a, b), which in the simplest case is the uniform distribution U(a, b). Although the results of this technique get better in higher dimensions, ‘In all [dimensions], crude Monte Carlo exists as a last resort, especially for integrals defined over nonstandard regions and for integrands of low-order continuity. It is usually quite reliable but very expensive and not very accurate’ (Davis and Rabinowitz, op. cit., Section 5.10). We shall not, however, use crude Monte Carlo as such, but rather certain refinements of it.
An extension of crude Monte Carlo is provided by importance sampling. This method is useful when we want to find a parameter θ which is defined as the expectation of a function f(x) with respect to a density q(x) but we cannot easily generate random variables with that density although we can generate variates xi with a density p(x) which is such that p(x) roughly approximates |f(x)|q(x) over the range of integration. Then
The function p(x) is called an importance function.
9.1.2 Markov chains
Another key notion is that of a Markov chain, which can be thought of as a model for a system which moves randomly through series of ‘states’ without having any memory of where it has been – where it jumps to next depends solely on where it is now. Anothe
r way of putting this is to say that given the present, the past and the future are independent. We thus have a probability density, called the transition probability density representing the chance of taking the state y at time t given that its state at time t–1 is x. In the cases, we are most interested in, this density will not depend on t and so takes the form p(y|x). If the probability density of its state at time 0 is p(0)(x), then clearly the density of its state at time 1 is given by the generalized addition law as
A similar result with the sum replaced by an integral holds when the state space, that is, the set of possible states, is continuous. Iterating this process we can clearly find the distribution of the state at any time t in terms of p(0)(x) and p(y|x). The key point for our purposes is that in many cases this density converges to a limit p(y) which does not depend on p(0)(x) but rather is uniquely determined by p(y|x). The limiting distribution is known as the stationary distribution or the invariant distribution. Properties of Markov chains which can take only a finite or countable number of values are discussed in Feller (1968, Volume 1; Chapter XV) and an indication of the general theory is given in Breiman (1968, Section 7.4). A more detailed account can be found in Meyn and Tweedie (1993) and a more applied one in Norris (1997).