Bayesian Statistics (4th ed)
Page 17
6. A random sample is available from an distribution and a second independent random sample is available from an distribution. Obtain, under the usual assumptions, the posterior distributions of and of .
7. Verify the formula for S1 given towards the end of Section 5.2.
8. The following data consists of the lengths in mm of cuckoo’s eggs found in nests belonging to the dunnock and to the reed warbler:
Investigate the difference θ between these lengths without making any particular assumptions about the variances of the two populations, and in particular give an interval in which you are 90% sure that it lies.
9. Show that if m=n then the expression f21/f2 in Patil’s approximation reduces to
10. Suppose that Tx, Ty and θ are defined as in Section 5.3 and that
Show that the transformation from (Tx, Ty) to (T, U) has unit Jacobian and hence show that the density of T satisfies
11. Show that if then
12. Two different microscopic methods, A and B, are available for the measurement of very small dimensions in microns. As a result of several such measurements on the same object, estimates of variance are available as follows:
Give an interval in which you are 95% sure that the ratio of the variances lies.
13. Measurement errors when using two different instruments are more or less symmetrically distributed and are believed to be reasonably well approximated by a normal distribution. Ten measurements with each show a sample standard deviation three times as large with one instrument as with the other. Give an interval in which you are 99% sure that the ratio of the true standard deviations lies.
14. Repeat the analysis of Di Raimondo’s data in Section 5.6 on the effects of penicillin of mice, this time assuming that you have prior knowledge worth about six observations in each case suggesting that the mean chance of survival is about a half with the standard injection but about two-thirds with the penicillin injection.
15. The undermentioned table [quoted from Jeffreys (1961, Section 5.1)] gives the relationship between grammatical gender in Welsh and psychoanalytical symbolism according to Freud:
Find the posterior probability that the log odds-ratio is positive and compare it with the comparable probability found by using the inverse root-sine transformation.
16. Show that if then the log odds-ratio is such that
17. A report issued in 1966 about the effect of radiation on patients with inoperable lung cancer compared the effect of radiation treatment with placebos. The numbers surviving after a year were:
What are the approximate posterior odds that the one-year survival rate of irradiated patients is at least 0.01 greater than that of those who were not irradiated?
18. Suppose that , that is x is Poisson of mean 8.5, and . What is the approximate distribution of x - y?
6
Correlation, regression and the analysis of variance
6.1 Theory of the correlation coefficient
6.1.1 Definitions
The standard measure of association between two random variables, which was first mentioned in Section 1.5 on ‘Means and Variances’, is the correlation coefficient
It is used to measure the strength of linear association between two variables, most commonly in the case where it might be expected that both have, at least approximately, a normal distribution. It is most important in cases where it is not thought that either variable is dependent on the other. One example of its use would be an investigation of the relationship between the height and the weight of individuals in a population, and another would be in finding how closely related barometric gradients and wind velocities were. You should, however, be warned that it is very easy to conclude that measurements are closely related because they have a high correlation, when, in fact, the relationship is due to their having a common time trend or a common cause and there is no close relationship between the two (see the relationship between the growth of money supply and Scottish dysentery as pointed out in a letter to The Times dated 6 April 1977). You should also be aware that two closely related variables can have a low correlation if the relationship between them is highly non-linear.
We suppose, then, that we have a set of n ordered pairs of observations, the pairs being independent of one another but members of the same pair being, in general, not independent. We shall denote these observations (xi, yi) and, as usual, we shall write and . Further, suppose that these pairs have a bivariate normal distribution with
and we shall use the notation
(Sxx and Syy have previously been denoted Sx and Sy), and
It is also useful to define the sample correlation coefficient r by
so that .
We shall show that, with standard reference priors for λ, μ, and , a reasonable approximation to the posterior density of ρ is given by
where is its prior density. Making the substitution
we will go on to show that after another approximation
These results will be derived after quite a complicated series of substitutions [due to Fisher (1915, 1921)]. Readers who are prepared to take these results for granted can omit the rest of this section.
6.1.2 Approximate posterior distribution of the correlation coefficient
As before, we shall have use for the formulae
and also for a similar one not used before
Now the (joint) density function of a single pair (x, y) of observations from a bivariate normal distribution is
where
and hence the joint density of the vector is
where
It follows that the vector is sufficient for . For the moment, we shall use independent priors of a simple form. For λ, μ, and , we shall take the standard reference priors, and for the moment we shall use a perfectly general prior for ρ, so that
and hence
The last factor is evidently the (joint) density of λ and μ considered as bivariate normal with means and , variances and and correlation ρ. Consequently it integrates to unity, and so as the first factor does not depend on λ or μ
To integrate and out, it is convenient to define
so that and . The Jacobian is
and hence
where
The substitution (so that ) reduces the integral over to a standard gamma function integral, and hence we can deduce that
Finally, integrating over ω
By substituting for ω it is easily checked that the integral from 0 to 1 is equal to that from 1 to , so that as constant multiples are irrelevant, the lower limit of the integral can be taken to be 1 rather than 0.
By substituting , the integral can be put in the alternative form
The exact distribution corresponding to has been tabulated in David (1954), but for most purposes it suffices to use an approximation. The usual way to proceed is by yet a further substitution, in terms of u where , but this is rather messy and gives more than is necessary for a first-order approximation. Instead, note that for small t
while the contribution to the integral from values where t is large will, at least for large n, be negligible. Using this approximation
On substituting
the integral is seen to be proportional to
Since the integral in this last expression does not depend on ρ, we can conclude that
Although evaluation of the constant of proportionality would still require the use of numerical methods, it is much simpler to calculate the distribution of ρ using this expression than to have to evaluate an integral for every value of ρ. In fact, the approximation is quite good [some numerical comparisons can be found in Box and Tiao (1992, Section 8.4.8)].
6.1.3 The hyperbolic tangent substitution
Although the exact mode does not usually occur at , it is easily seen that for plausible choices of the prior , the approximate density derived earlier is greatest when ρ is near r. However, except when r = 0, this distribution is asymmetrical. Its asymmetry can be reduced by writing
so that and
It
follows that
If n is large, since the factor does not depend on n, it may be regarded as approximately constant over the range over which is appreciably different from zero, so that
Finally put
and note that if ζ is close to z then . Putting this into the expression for and using the exponential limit
so that approximately , or equivalently
A slightly better approximation to the mean and variance can be found by using approximations based on the likelihood as in Section 3.10. If we take a uniform prior for ρ or at least assume that the prior does not vary appreciably over the range of values of interest, we get
We can now approximate ρ by r (we could write and so get a better approximation, but it is not worth it). We can also approximate by n, so getting the root of the likelihood equation as
Further
so that again approximating ρ by r, we have at
It follows that the distribution of ζ is given slightly more accurately by
This approximation differs a little from that usually given by classical statisticians, who usually quote the variance as (n–3)–1, but the difference is not of great importance.
6.1.4 Reference prior
Clearly, the results will be simplest if the prior used has the form
for some c. The simplest choice is to take c = 0, that is, a uniform prior with , and it seems quite a reasonable choice. It is possible to use the multi-parameter version of Jeffreys’ rule to find a prior for , though it is not wholly simple. The easiest way is to write for the covariance and to work in terms of the inverse of the variance–covariance matrix, that is, in terms of where
It turns out that , where Δ is the determinant , and that the Jacobian determinant
so that . Finally, transforming to the parameters that are really of interest, it transpires that
which corresponds to the choice and the standard reference priors for and .
6.1.5 Incorporation of prior information
It is not difficult to adapt the aforementioned analysis to the case where prior information from the conjugate family [i.e. inverse chi-squared for and and of the form for ρ] is available. In practice, this information will usually be available in the form of previous measurements of a similar type and in this case it is best dealt with by transforming all the information about ρ into statements about so that the theory we have built up for the normal distribution can be used.
6.2 Examples on the use of the correlation coefficient
6.2.1 Use of the hyperbolic tangent transformation
The following data is a small subset of a much larger quantity of data on the length and breadth (in mm) of the eggs of cuckoos (C. canorus).
Here n = 9, , , Sxx=12.816, Syy=6.842, Sxy=7.581, r=0.810 and so z=tanh–10.810=1.127 and 1/n=1/9. We can conclude that with 95% posterior probability ζ is in the interval , that is, (0.474, 1.780), giving rise to (0.441, 0.945) as a corresponding interval for ρ, using Lindley and Scott (1995, Table 17) or Neave (1978, Table 6.3).
6.2.2 Combination of several correlation coefficients
One of the important uses of the hyperbolic tangent transformation lies in the way in which it makes it possible to combine different observations of the correlation coefficient. Suppose, for example, that on one occasion we observe that r=0.7 on the basis of 19 observations and on another we observe that r=0.9 on the basis of 25 observations. Then after the first set of observations, our posterior for ζ is N(tanh–10.7, 1/19). The second set of observations now puts us into the situation of a normal prior and likelihood, so the posterior after all the observations is still normal, with variance
and mean
(using rounded values) that is, is N(1.210, 0.0227), suggesting a point estimate of tanh 1.210=0.8367 for ρ.
The transformation also allows one to investigate whether or not the correlations on the two occasions really were from the same population or at least from reasonably similar populations.
6.2.3 The squared correlation coefficient
There is a temptation to take r as such too seriously and to think that if it is very close to 1 then the two variables are closely related, but we will see shortly when we come to consider regression that r2, which measures the proportion of the variance of one variable that can be accounted for by the other variable, is in many ways at least as useful a quantity to consider.
6.3 Regression and the bivariate normal model
6.3.1 The model
The problem we will consider in this section is that of using the values of one variable to explain or predict values of another. We shall refer to an explanatory and a dependent variable, although it is conventional to refer to an independent and a dependent variable. An important reason for preferring the phrase explanatory variable is that the word ‘independent’ if used in this context has nothing to do with the use of the word in the phrase ‘independent random variable’. Some authors, for example, Novick and Jackson (1974, Section 9.1), refer to the dependent variable as the criterion variable. The theory can be applied, for example, to finding a way of predicting the weight (the dependent variable) of typical individuals in terms of their height (the explanatory variable). It should be noted that the relationship which best predicts weight in terms of height will not necessarily be the best relationship for predicting height in terms of weight.
The basic situation and notation are the same as in the last two sections, although in this case there is not the symmetry between the two variables that there was there. We shall suppose that the xs represent the explanatory variable and the ys the dependent variables.
There are two slightly different situations. In the first, the experimenters are free to set the values of xi, whereas in the second both values are random, although one is thought of as having a causal or explanatory relationship with the other. The analysis, however, turns out to be the same in both cases.
The most general model is
where in the first situation described above is a null vector and the distribution of is degenerate. If it is assumed that and have independent priors, so that , then
It is now obvious that we can integrate over to get
Technically, given , the vector is sufficient for and, given , the vector is ancillary for . It follows that insofar as we wish to make inferences about , we may act as if were constant.
6.3.2 Bivariate linear regression
We will now move on to a very important particular case. Suppose that conditional on we have
Thus,
unless one or more of , and are known, in which case the ones that are known can be dropped from . Thus, we are supposing that, on average, the dependence of the ys on the xs is linear. It would be necessary to use rather different methods if there were grounds for thinking, for example, that or that . It is also important to suppose that the ys are homoscedastic, that is, that the variance has the same constant value whatever the value of xi; modifications to the analysis would be necessary if it were thought that, for example, so that the variance increased with xi.
It simplifies some expressions to write as where, of course, , so that and , hence . The model can now be written as
Because a key feature of the model is the regression line on which the expected values lie, the parameter β is usually referred to as the slope and α is sometimes called the intercept, although this term is also sometimes applied to . For the rest of this section, we shall take a reference prior that is independently uniform in α, β and , so that
In addition to the notation used in Sections 6.1 and 6.2, it is helpful to define
It then turns out that
Now since is a constant and the sum of squares can be written as
Thus, the joint posterior is
It is now clear that for given b and the posterior for β is , and so we can integrate β out to get
(note the change in the exponent of ).
In Section 2.12 on ‘Normal mean and variance both unknown’, we showed that if
an
d then
It follows from just the same argument that in this case the posterior for α given and is such that if s2=See/(n–2) then
Similarly the posterior of β can be found be integrating α out to show that
Finally, note that
It should, however, be noted that the posteriors for α and β are not independent, although they are independent for given .
It may be noted that the posterior means of α and β are a and b and that these are the values that minimize the sum of squares
and that See is the minimum sum of squares. This fact is clear because the sum is
and it constitutes the principle of least squares, for which reason a and b are referred to as the least squares estimates of α and β. The regression line
which can be plotted for all x as opposed to just those xi observed, is called the line of best fit for y on x. The principle is very old; it was probably first published by Legendre but first discovered by Gauss; for its history see Harter (1974, 1975, 1976). It should be noted that the line of best fit for y on x is not, in general, the same as the line of best fit for x on y.