by Peter M Lee
Our intention is to fit a smooth curve like the one shown in Figure 9.12 through these points in the same way that a straight line is fitted in ordinary regression. The procedure here has to be different because proportions, which estimate probabilities, have to lie between 0 and 1.
Figure 9.12 Logistic curve .
We define
and then observe that a suitable family of curves is given by
or equivalently
We have a binomial likelihood
(ignoring the binomial coefficients) with corresponding log-likelihood
where the binomial probabilities are related to the values of the explanatory variable xi by the above logistic equation. We then consider how to choose values of and so that the chosen curve best fits the data. The obvious general principle used by classical statisticians is that of maximum likelihood, and a simple iteration shows that with the data in the example, the maximum likelihood values are
There are nowadays many books which deal with logistic regression, usually from a classical standpoint, for example Kleinbaum (1994) and Hosmer and Lemeshow (1989).
In this book, we have tried to show how most of the standard statistical models can be fitted into a Bayesian framework, and logistic regression is no exception. All we need to do is to give and suitable prior distributions, noting that while we have no idea about the sign of , we would expect to be positive. Unfortunately, we cannot produce a model with conjugate priors, but we are able to use Gibbs sampling, and a suitable program for use in WinBUGS is as follows.
model;
{
for (i in 1:6) {
y[i] ~ dbin(pi[i],n[i])
logit(pi[i]) < - beta0 + beta1*x[i]
}
beta0 ~ dnorm(0,0.001)
beta1 ~ dgamma(0.001,0.001)
}
data;
list(y=c(0,9,21,47,60,63),
n=c(70,70,70,70,70,70),
x=c(0,1,2,3,4,5))
inits;
list(beta0=0,beta1=1)
A run of 5000 after a burn-in of 1000 resulted in a posterior mean of –3.314 for and of 1.251 for . Unsurprisingly, since we have chosen relatively uninformative priors, these are close to the maximum likelihood estimates.
Of course, a Bayesian analysis allows us to incorporate any prior information, we may have by taking different priors for and .
9.8.2 A general framework
We note that in Section 9.7, the parameter π of the binomial distribution is not itself linearly related to the explanatory variable x, but instead we have a relationship of the form
where in the case of logistic regression the function g, which is known as the link function is the logit function.
This pattern can be seen in various other models. For example, Gelman et al. (2004, Section 16.5) model the number yep of persons of ethnicity e stopped by the police in precinct p by defining nep as the number of arrests by the Division of Criminal Justice Services (DCJS) of New York State for that ethnic group in that precinct and then supposing that yep has a Poisson distribution
where
where the coefficients control for the ethnic group and the adjust for variation among precincts. Writing and taking explanatory variates such as where
this is a similar relationship to the one we encountered with logistic regression.
Similarly, a standard model often used for the table we encountered in 5.6 is to suppose that the entries in the four cells of the table each have Poisson distributions. More precisely, we suppose that the mean number in the top left-hand corner has some mean μ, that numbers in the second column have a mean α times bigger than corresponding values in the first column, and that numbers in the second row have means β times bigger than those in the first row (where, of course, α and β can be either greater or less than unity). It follows that the means in the four cells are μ, , and respectively. In this case we can regard the explanatory variables as the row and column in which any entry occurs and the Poisson distributions we encounter have means satisfying the equation
where
Further discussion of generalized linear models can be found in Dobson (2002) or McCullagh and Nelder (1989), and in a Bayesian context in Gelman et al. (2004, Chapter 16) and Dey et al. (2000).
9.9 Exercises on Chapter 9
1. Find the value of by crude Monte Carlo integration using a sample size of n=10 values from a uniform distribution U(0, 1) taken from tables of random numbers [use, e.g. groups of random digits from Lindley and Scott (1995, Table 27) or Neave (1978, Table 8.1)]. Repeat the experiment ten times and compute the overall mean and the sample standard deviation of the values you obtain. What is the theoretical value of the population standard deviation and how does the value you obtained compare with it?
2. Suppose that, in a Markov chain with just two states, the probabilities of going from state i to state j in one time unit are given by the entries of the matrix
in which i represents the row and j the column. Show that the probability of getting from state i to state j in t time units is given by the tth power of the matrix and that
Deduce that, irrespective of the state the chain started in, after a long time it will be in the first state with probability and in the second state with probabi-lity .
3. Smith (1969, Section 21.10) quotes an example on genetic linkage in which we have observations with cell probabilities
The values quoted are x1=461, x2=130, x3=161 and x4=515. Divide x1 into y0 and y1 and x4 into y4 and y5 to produce augmented data and use the EM algorithm to estimate η.
4. Dempster et al. (1977) define a generalized EM algorithm (abbreviated as a GEM algorithm) as one in which . Give reasons for believing that GEM algorithms converge to the posterior mode.
5. In question 16 in Chapter 2, we supposed that the results of a certain test were known, on the basis of general theory, to be normally distributed about the same mean μ with the same variance , neither of which is known. In that question, we went on to suppose that your prior beliefs about could be represented by a normal/chi-squared distribution with
Find a semi-conjugate prior which has marginal distributions that are close to the marginal distributions of the normal/chi-squared prior but is such that the mean and variance are independent a priori. Now suppose as previously that 100 observations are obtained from the population with mean 89 and sample variance s2=30. Find the posterior distribution of . Compare the posterior mean obtained by the EM algorithm with that obtained from the fully conjugate prior.
6. A textile company weaves a fabric on a large number of looms. Four looms selected at random from those available, and four observations of the tensile strength of fabric woven on each of these looms are available (there is no significance to the order of the observations from each of the looms), and the resulting data are as follows:
Estimate the means for each of the looms, the overall mean, the variance of observations from the same loom, and the variance of means from different looms in the population.
7. Write computer programs in C++ equivalent to the programs in in this chapter.
8. Use the data augmentation algorithm to estimate the posterior density of the parameter η in the linkage model in question 3.
9. Suppose that and where n is a Poisson variable of mean λ as opposed to being fixed as in Section 9.4. Use the Gibbs sampler (chained data augmentation) to find the unconditional distribution of n in the case where . and (cf. Casella and George, 1992).
10. Find the mean and variance of the posterior distribution of θ for the data in question 5 mentioned earlier using the prior you derived in answer to that question by means of the Gibbs sampler (chained data augmentation).
11. The following data represent the weights of r=30 young rats measured weekly for n=5 weeks as quoted by Gelfand et al. (1990), Tanner (1996, Table 1.3 and Section 6.2.1), Carlin and Louis (2000, Example 5.6):
The weight of the ith rat in week j is denoted xij and we suppose that weight growth is linear, that
is,
but that the slope and intercept vary from rat to rat. We further suppose that and have a bivariate normal distribution, so that
and thus we have a random effects model. At the third stage, we suppose that
where we have used the notation for the Wishart distribution for a random symmetric positive definite matrix , which has density
Methods of sampling from this distribution are described in Odell and Feiveson (1966), Kennedy and Gentle (1990, Section 6.5.10) and Gelfand et al. (1990). [This example was omitted from the main text because we have avoided use of the Wishart distribution elsewhere in the book. A slightly simpler model in which is assumed to be diagonal is to be found as the example ‘Rats’ distributed with WinBUGS.]
Explain in detail how you would use the Gibbs sampler to estimate the posterior distributions of and , and if possible carry out this procedure.
12. Use the Metropolis–Hastings algorithm to estimate the posterior density of the parameter η in the linkage model in Sections 9.2 and 9.3 using candidate values generated from a uniform distribution on (0, 1) [cf. Tanner (1996, Section 6.5.2)].
13. Write a WinBUGS program to analyze the data on wheat yield considered towards the end of Section 2.13 and in Section 9.3.
14. In bioassays, the response may vary with a covariate termed the dose. A typical example involving a binary response is given in the following table, where R is the number of beetles killed after 5 hours of exposure to gaseous carbon disulphide at various concentrations (data from Bliss, 1935, quoted by Dobson, 2002, Example 7.3.1).
Dose xi Number of Number
(log10CS2mgl–2) insects, ni killed, ri
1.6907 59 6
1.7242 60 13
1.7552 62 18
1.7842 56 28
1.8113 63 52
1.8369 59 53
1.8610 62 61
1.8839 60 60
Fit a logistic regression model and plot the proportion killed against dose and the fitted line.
1. Sometimes called a jumping distribution.
10
Some approximate methods
10.1 Bayesian importance sampling
Importance sampling was first mentioned towards the start of Section 9.1. We said then that it is useful when we want to find a parameter θ which is defined as the expectation of a function f(x) with respect to a density q(x) but we cannot easily generate random variables with that density although we can generate variates xi with a density p(x) which is such that p(x) roughly approximates |f(x)|q(x) over the range of integration. Then
The function p(x) is called an importance function. In the words of Wikipedia, ‘Importance sampling is a variance reduction technique that can be used in the Monte Carlo method. The idea behind importance sampling is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. If these ‘important’ values are emphasized by sampling more frequently, then the estimator variance can be reduced.’
A case where this technique is easily seen to be valuable occurs in evaluating tail areas of the normal density function (although this function is, of course, well-enough tabulated for it to be unnecessary in this case). If, for example, we wanted to find the probability that a standard normal variate was greater than 4, so that
a naïve Monte Carlo method would be to find the proportion of values of such a variate in a large sample that turned out to be greater than 4. However, even with a sample as large as n=10 000 000 a trial found just 299 such values, implying that the required integral came to 0.000 030 0, whereas a reference to tables shows that the correct value of θ is 0.000 031 67, which is, therefore, underestimated by 5%. In fact, the number we observe will have a binomial distribution and our estimate will have a standard deviation of .
We can improve on this slightly by noting that and therefore the fact that in the same trial 619 values were found to exceed 4 in absolute value leads to an estimate of 0.000 030 95, this time an underestimate by about 2%. This time the standard deviation can easily be shown to be . There are various other methods by which slight improvements can be made (see Ripley, 1987, Section 5.1).
A considerably improved estimate can be obtained by importance sampling with p(x) being the density function of an exponential distribution E(1) of mean 1 (see Appendix A.4) truncated to the range so that
It is easily seen that we can generate a random sample from this density by simply generating values from the usual exponential distribution E(1) of mean 1 and adding 4 to these values. By using this truncated distribution, we avoid needing to generate values which will never be used. We can take then an importance sampling estimator
Using this method, a sample of size only 1000 resulted in an estimate of 0.000 031 5 to three decimal places, which is within 1% of the correct value.
With any importance sampler each observed value x has density p(x) and so w(x)=f(x)q(x)/p(x) has mean
and variance
If for all x and we take
then w(x) is constant and so has variance 0 which is necessarily a minimum. This choice, however, is impossible because to make it we would need to know the value of θ, which is what we are trying to find. Nevertheless, if fq/p is roughly constant then we will obtain satisfactory results. Very poor results occur when fq/p is small with a high probability but can be very large with a low probability, as, for example, when fq has very wide tails compared with p.
For a more complicated example, suppose has a gamma density
(see Section A.4). Then there is a family of conjugate priors of the form
where and are hyperparameters, since the posterior is then equal to
(we write q rather than p in accordance with our notation for importance sampling). This does not come from any of the well-known distributions and the constant of proportionality is unknown, although it can be approximated by normal or gamma densities. In Figure 10.1, we show the unnormalized density when and . It appears symmetric about a value near 1.5 with most of the values between 0.75 and 2.25, so it seems reasonable to use as an importance function an approximation such as N(1.5, 0.75) or, since it is asymmetric, a gamma density with the same mean and variance, so a G(4, 0.375) distribution, and these two are also illustrated in the figure. There is a discrepancy in scale because the constant of proportionality of the density we are interested in is unknown.
Figure 10.1 The conjugate gamma distribution.
While in the middle of the range either of these distributions fits reasonably well, the gamma approximation is considerably better in the tails. We can begin by finding the constant of proportionality using the importance sampling formula with f(x)=1, and p(x) taken as a gamma density with and . Use of this method with a sample size of 10 000 resulted in a value for the constant of 2.264 where the true value is approximately 2.267. We can use the same sample to find a value of the mean, this time taking f(x)=x/k and finding a value of 1.694 where the true value is 1.699 (we would have taken a different importance function if we were concerned only with this integral, but when we want results for both f(x)=1 and f(x)=x/k it saves time to use the same sample). In the same way we can obtain a value of 0.529 for the variance where the true value is 0.529.
10.1.1 Importance sampling to find HDRs
It has been pointed out (see Chen and Shao, 1998) that we can use importance sampling to find HDRs. As we noted in Section 2.6, an HDR is an interval which is (for a given probability level) as short as possible. This being so, all we need to do is to take the sample generated by importance sampling, sort it and note that for any desired value α, if the sorted vales are and , then any interval of the form (yj, yj+k) will give an interval with an estimated posterior probability of , so for an HDR we merely need to find the shortest such interval. It may help to consider the following segment of code:
alpha < - 0.1 le <- (1-alpha)*n lo <- 1:(n-le) hi <- (le+1):n y <- sort(samp) r <- y[hi]-y[lo] rm <- min(r) lom <- min(lo[r==rm]) him <- min(hi[r==rm]) abline(v=y[lom]) abline(v=y[him
])
10.1.2 Sampling importance re-sampling
Another technique sometimes employed is sampling importance re-sampling (SIR). This provides a method for generating samples from a probability distribution which is hard to handle and for which the constant of proportionality may be unknown. We suppose that the (unnormalized) density in question is q(x). What we do is to generate a sample of size n from a distribution with density p(x) which is easier to handle and is in some sense close to the distribution we are interested in. We then find the values of
and then normalize these to find
We then take a sample of the same size size n without replacement from the values with probabilities to obtain a new sample . This operation is easily carried out in by
x <- random sample of size n from density p w <- q(x)/p(x) pi <- w/sum(w) samp <- sample(x,size=n,prob=pi,replace=T)