by Peter M Lee
   so that the rejection region takes the form
   which is the likelihood ratio test prescribed by Neyman–Pearson theory. A difference is that in the Neyman–Pearson theory, the ‘critical value’ of the rejection region is determined by fixing the size α, that is, the probability that x lies in the rejection region R if the null hypothesis is true, whereas in a decision theoretic approach, it is fixed in terms of the loss function and the prior probabilities of the hypotheses.
   7.7.2 Composite hypotheses
   If the hypotheses are composite (i.e. not simple), then, again as in Section 4.1 on ‘Hypothesis testing’,
   so that there is still a rejection region that can be interpreted in a similar manner. However, it should be noted that classical statisticians faced with similar problems are more inclined to work in terms of a likelihood ratio
   (cf. Lehmann, 1986, Section 1.7). In fact, it is possible to express quite a lot of the ideas of classical statistics in a language involving loss functions.
   It may be noted that it is easy to extend the above discussion about dichotomies (i.e. situations where a choice has to be made between two hypotheses) to deal with trichotomies or polytomies, although some theories of statistical inference find choices between more than two hypotheses difficult to deal with.
   7.8 Empirical Bayes methods
   7.8.1 Von Mises’ example
   Only a very brief idea about empirical Bayes methods will be given in this chapter; more will be said about this topic in Chapter 8 and a full account can be found in Maritz and Lwin (1989). One of the reasons for this brief treatment is that, despite their name, very few empirical Bayes procedures are, in fact, Bayesian; for a discussion of this point see, for example, Deely and Lindley (1981).
   The problems we will consider in this section are concerned with a sequence xi of observations such that the distribution of the ith observation xi depends on a parameter , typically in such a way that has the same functional form for all i. The parameters are themselves supposed to be a random sample from some (unknown) distribution, and it is this unknown distribution that plays the role of a prior distribution and so accounts for the use of the name of Bayes. There is a clear contrast with the situation in the rest of the book, where the prior distribution represents our prior beliefs, and so by definition it cannot be unknown. Further, the prior distribution in empirical Bayes methods is usually given a frequency interpretation, by contrast with the situation arising in true Bayesian methods.
   One of the earliest examples of an empirical Bayes procedure was due to von Mises (1942). He supposed that in examining the quality of a batch of water for possible contamination by certain bacteria, m = 5 samples of a given volume were taken, and he was interested in determining the probability θ that a sample contains at least one bacterium. Evidently, the probability of x positive result in the 5 samples is
   for a given value of θ. If the same procedure is to be used with a number of batches of different quality, then the predictive distribution (denoted to avoid ambiguity) is
   where the density represents the variation of the quality θ of batches. [If comes from the beta family, and there is no particular reason why it should, then is a beta-binomial distribution, as mentioned at the end of Section 3.1 on ‘The binomial distribution’]. In his example, von Mises wished to estimate the density function on the basis of n = 3420 observations.
   7.8.2 The Poisson case
   Instead of considering the binomial distribution further, we shall consider a problem to do with the Poisson distribution which, of course, provides an approximation to the binomial distribution when the number m of samples is large and the probability θ is small. Suppose that we have observations where the have a distribution with a density , and that we have available n past observations, among which fn(x) were equal to x for . Thus, fn(x) is an empirical frequency and fn(x)/n is an estimate of the predictive density . As x has a Poisson distribution for given λ
   Now suppose that, with this past data available, a new observation is made, and we want to say something about the corresponding value of λ. In Section 7.5 on ‘Bayesian decision theory’, we saw that the posterior mean of λ is
   To use this formula, we need to know the prior or at least to know and , which we do not know. However, it is clear that a reasonable estimate of is , after allowing for the latest observation. Similarly, a reasonable estimate for is . It follows that a possible point estimate for the current value of λ, corresponding to the value resulting from a quadratic loss function, is
   This formula could be used in a case like that investigated by von Mises if the number m of samples taken from each batch were fairly large and the probability θ that a sample contained at least one bacterium were fairly small, so that the Poisson approximation to the binomial could be used.
   This method can easily be adapted to any case where the posterior mean of the parameter of interest takes the form
   and there are quite a number of such cases (Maritz and Lwin, 1989, Section 1.3).
   Going back to the Poisson case, if it were known that the underlying distribution were of the form for some S0 and , then it is known (cf. Section 7.5) that
   In this case, we could use to estimate S0 and in some way, by, say, and , giving an alternative point estimate for the current value of
   The advantage of an estimate like this is that, because, considered as a function of , it is smoother than , it could be expected to do better. This is analogous with the situation in regression analysis, where a fitted regression line can be expected to give a better estimate of the mean of the dependent variable y at a particular value of the independent variable x than you would get by concentrating on values of y obtained at that single value of x. On the other hand, the method just described does depend on assuming a particular form for the prior, which is probably not justifiable. There are, however, other methods of producing a ‘smoother’ estimate.
   Empirical Bayes methods can also be used for testing whether a parameter θ lies in one or another of a number of sets, that is, for hypothesis testing and its generalizations.
   7.9 Exercises on Chapter 7
   1. Show that in any experiment E in which there is a possible value y for the random variable such that , then if z is any other possible value of , the statistic t=t(x) defined by
   is sufficient for θ given x. Hence, show that if is a continuous random variable, then a naïve application of the weak sufficiency principle as defined in Section 7.1 would result in for any two possible values y and z of .
   2. Consider an experiment . We say that censoring (strictly speaking, fixed censoring) occurs with censoring mechanism g (a known function of x) when, instead of , one observes y=g(x). A typical example occurs when we report x if x< k for some fixed k, but otherwise simply report that . As a result, the experiment really performed is . A second method with censoring mechanism h is said to be equivalent to the first when
   As a special case, if g is one-to-one then the mechanism is said to be equivalent to no censoring. Show that if two censoring mechanisms are equivalent, then the likelihood principle implies that
   3. Suppose that the density function is defined as follows for and . If θ is even, then
   if θ is odd but θ ≠ 1, then
   while if θ = 1 then
   Show that, for any x the data intuitively give equal support to the three possible values of θ compatible with that observation, and hence that on likelihood grounds any of the three would be a suitable estimate. Consider, therefore, the three possible estimators d1, d2 and d3 corresponding to the smallest, middle and largest possible θ. Show that
   but that
   Does this apparent discrepancy cause any problems for a Bayesian analysis (due to G. Monette and D. A. S. Fraser)?
   4. A drunken soldier, starting at an intersection O in a city which has square blocks, staggers around a random path trailing a taut string. Eventually, he stops at an intersection (after walking at least one block) and buries a treasure. Let θ de
note the path of the string from O to the treasure. Letting N, S, E and W stand for a path segment one block long in the indicated direction, so that θ can be expressed as a sequence of such letters, say . (Note that NS, SN, EW and WE cannot appear as the taut string would be rewound). After burying the treasure, the soldier walks one block further in a random direction (still keeping the string taut). Let X denote this augmented path, so that X is one of θN, θS, θE and θW, each with probability . You observe X and are then to find the treasure. Show that if you use a reference prior for all possible paths θ, then all four possible values of θ given X are equally likely. Note, however, that intuition would suggest that θ is three times as likely to extend the path as to backtrack, suggesting that one particular value of θ is more likely than the others after X is observed (due to M. Stone).
   5. Suppose that, starting with a fortune of f0 units, you bet a units each time on evens at roulette (so that you have a probability of 18/37 of winning at Monte Carlo or 18/38 at Las Vegas) and keep a record of your fortune fn and the difference dn between the number of times you win and the number of times you lose in n games. Which of the following are stopping rules? a. The last time n at which .
   b. The first time that you win in three successive games.
   c. The value of n for which .
   6. Suppose that x1, … is a sequential sample from an N(θ, 1) distribution and it is desired to test H0 : θ = θ0 versus H1 : θ ≠ θ0. The experimenter reports that he used a proper stopping rule and obtained the data 3, −1, 2, 1. a. What could a frequentist conclude?
   b. What could a Bayesian conclude?
   7. Let be a sequential sample from a Poisson distribution ). Suppose that the stopping rule is to stop sampling at time with probability
   for (define 0/0=1). Suppose that the first five observations are 3, 1, 2, 5, 7 and that sampling then stops. Find the likelihood function for λ (Berger, 1985).
   8. Show that the mean of the beta-Pascal distribution
   is given by the formula in Section 7.3, namely,
   9. Suppose that you intend to observe the number x of successes in n Bernoulli trials and the number y of failures before the nth success after the first n trials, so that and . Find the likelihood function and deduce the reference prior that Jeffreys’ rule would suggest for this case.
   10. The negative of loss is sometimes referred to as utility. Consider a gambling game very unlike most in that you are bound to win at least £2, and accordingly in order to be allowed to play, you must pay an entry fee of £e. A coin is tossed until it comes up heads, and if this occurs for the first time on the nth toss, you receive £2n. Assuming that the utility to you of making a gain of £x is u(x), find the expected utility of this game, and then discuss whether it is plausible that u(x) is directly proportional to x. [The gamble discussed here is known as the St Petersburg Paradox. A fuller discussion of it can be found in Leonard and Hsu (2001, Chapter 4).]
   11. Suppose that you want to estimate the parameter of a binomial distribution . Show that if the loss function is
   then the Bayes rule corresponding to a uniform [i.e. ] prior for is given by d(x)=x/n for any x such that 0< x< n, that is, the maximum likelihood estimator. Is d(x)=x/n a Bayes rule if x = 0 or x=n?
   12. Let and have independent binomial distributions of the same index but possibly different parameters. Find the Bayes rule corresponding to the loss
   when the priors for and ρ are independent uniform distributions.
   13. Investigate possible point estimators for on the basis of the posterior distribution in the example in the subsection of Section 2.10 headed ‘Mixtures of conjugate densities’.
   14. Find the Bayes rule corresponding to the loss function
   15. Suppose that your prior for the proportion π of defective items supplied by a manufacturer is given by the beta distribution Be(2, 12), and that you then observe that none of a random sample of size 6 is defective. Find the posterior distribution and use it to carry out a test of the hypothesis H0: π < 0.1 using a. a ‘0 – 1’ loss function, and
   b. the loss function
   16. Suppose there is a loss function defined by
   On the basis of an observation x you have to take action a0, a1 or a2. For what values of the posterior probabilities p0 and p1 of the hypotheses and would you take each of the possible actions?
   17. A child is given an intelligence test. We assume that the test result x is N(θ, 100) where θ is the true intelligence quotient of the child, as measured by the test (in other words, if the child took a large number of similar tests, the average score would be θ). Assume also that, in the population as a whole, θ is distributed according to an N(100, 225) distribution. If it is desired, on the basis of the intelligence quotient, to decide whether to put the child into a slow, average or fast group for reading, the actions available are: a. a1: Put in slow group, that is, decide
   b. a1: Put in average group, that is, decide
   c. a1: Put in fast group, that is, decide
   A loss function of the following form might be deemed appropriate:
   Assume that you observe that the test result x = 115. By using tables of the normal distribution and the fact that if (t) is the density function of the standard normal distribution, then ∫ t (t) dt = −(t), find is the appropriate action to take on the basis of this observation. [See Berger (1985, Sections 4.2–4.4)].
   18. In Section 7.8, a point estimator for the current value λ of the parameter of a Poisson distribution was found. Adapt the argument to deal with the case where the underlying distribution is geometric, that is
   Generalize to the case of a negative binomial distribution, that is,
   8
   Hierarchical models
   8.1 The idea of a hierarchical model
   8.1.1 Definition
   So far, we have assumed that we have a single known form to our prior distribution. Sometimes, however, we feel uncertain about the extent of our prior knowledge. In a typical case, we have a first stage in which observations have a density which depends on r unknown parameters
   for which we have a prior density . Quite often we make one or more assumptions about the relationships between the different parameters , for example, that they are independently and identically distributed [sometimes abbreviated i.i.d.] or that they are in increasing order. Such relationships are often referred to as structural.
   In some cases, the structural prior knowledge is combined with a standard form of Bayesian prior belief about the parameters of the structure. Thus, in the case where the are independently and identically distributed, their common distribution might depend on a parameter , which we often refer to as a hyperparameter. We are used to this situation in cases where is known, but sometimes it is unknown. When it is unknown, we have a second stage in which we suppose that we have a hyperprior expressing our beliefs about possible values of . In such a case, we say that we have a hierarchical prior; for the development of this idea, see Good (1980). It should be noted that the difficulty of specifying a second stage prior has made common the use of noninformative priors at the second stage (cf. Berger, 1985, Sections 3.6 and 4.6.1).
   In Lindley’s words in his contribution to Godambe and Sprott (1971),
   The type of problem to be discussed … is one in which there is a substantial amount of data whose probability structure depends on several parameters of the same type. For example, an agricultural trial involving many varieties, the parameters being the varietal means, or an educational test performed on many subjects, with their true scores as the unknowns. In both these situations the parameters are related, in one case by the common circumstances of the trial, in the other by the test used, so that a Bayesian solution, which is capable of including such prior feelings of relationship, promises to show improvements over the usual techniques.
   There are obvious generalizations. For one thing, we might have a vector of hyperparameters rather than a single hyperparameter. For another, we sometimes carry this process to a third stage, su
pposing that the prior for depends on one or more hyper-hyperparameters and so takes the form . If is unknown, then we have a hyper-hyperprior density representing our beliefs about possible values of .
   All of this will become clearer when we consider some examples. Examples from various fields are given to emphasize the fact that hierarchical models arise in many different contexts.
   8.1.2 Examples
   8.1.2.1 Hierarchical Poisson model
   In Section 7.8 on ‘Empirical Bayes methods’, we considered a case where we had observations where the have a distribution with a density , and then went on to specialize to the case where was of the conjugate form for some S0 and . This is a structural relationship as defined earlier in which the parameters are and the hyperparameters are . To fit this situation into the hierarchical framework we only need to take a prior distribution for the hyperparameters. Since they are both in the range , one possibility might be to take independent reference priors and or proper priors close to these over a large range.
   8.1.2.2 Test scores
   Suppose that a number of individuals take intelligence tests (‘IQ tests’) on which their scores are normally distributed with a known variance but with a mean which depends on the ‘true abilities’ of the individuals concerned, so that . It may well happen that the individuals come from a population in which the true abilities are (at least to a reasonable approximation) normally distributed, so that . In this case the hyperparameters are . If informative priors are taken at this stage, a possible form would be to take μ and as independent with and for suitable values of the hyper-hyperparameters λ, , S0 and .