P(x, y)dxdy = P(x)dxP(y|x)dy
(11.4)
or
P(x, y)dxdy = P(y)dyP(x|y)dx
(11.5)
Eliminating P(x, y), we obtain Bayes’ theorem:
(11.6)
This theorem formalizes the inverse problem posed in Section 11.1. Here:
P(x|y) is the posterior PDF for the state vector x given the observations y, and defines the solution to the inverse problem.
P(x) is the prior PDF of the state vector x before the measurements are made, i.e., the PDF of xA defined by the error statistics for εA.
P(y|x) is the PDF of the observation vector y given the true value of x and accounting for errors in the measurements and in the forward model, as defined by the error statistics for εΟ (11.1).
P(y) is the PDF of y for all possible values of x.
The optimal estimate for x is defined by the maximum of P(x|y), corresponding to
∇xP(x| y) = 0
(11.7)
where ∇x is the gradient operator in the state vector space operating on all state vector elements. From (11.6) we have
(11.8)
since P(y) is independent of x. It follows that the optimal estimation given by (11.7) can be rewritten as
∇x[P(y| x)P(x)] = 0
(11.9)
This defines the Bayesian optimal estimate solution to the inverse problem. P(y) does not contribute to the solution and does not appear in (11.9). Indeed, P(y) can be viewed simply as a normalizing factor in equation (11.6) to ensure that the integral of P(x|y) over all possible values of x is unity. We ignore it in what follows.
Inverse modeling using Bayes’ theorem as described here is an example of a regularization method where prior information is used to constrain the fitting of x to y. Here the constraint is the prior PDF of x. This is the most commonly used constraint in inverse modeling, but others may also be used. Some methods ignore prior information and fit x to match y using least-squares or some other norm minimization. Other regularization methods enforce different prior constraints on the solution, such as positivity, smoothness, or patterns. Examples are Tikhonov regularizations and geostatistical methods. These are covered in Section 11.7.
11.3 A Simple Scalar Example
Here we apply Bayes’ theorem to solve a simple inverse problem using scalars. This allows us to introduce concepts, terminology, and equations that will be useful for understanding the solution of the more general problem involving vectors.
Consider in this example a single source releasing a species X to the atmosphere with an emission rate x (Figure 11.1). We have a prior estimate xA ± σA for x, where is the prior error variance defined by
(11.10)
Here εA is the error on our prior estimate, and E[] is the expected value operator representing the expected mean value of the bracketed quantity for an infinitely large number of realizations. E[εA] is the mean value of the error, called the mean bias. The prior estimate xA is unbiased by definition so E[εA] = 0. This is an important point. You may think of the prior estimate as “biased” because it differs from the true value; however, before we make observations, it is equally likely to be too high or too low.
Figure 11.1 Simple example of inverse modeling. A point source emits a species X with an estimated prior emission rate xA. We seek to improve this estimate by measuring the concentration y of X at a point downwind, and using a CTM as forward model to relate x to y.
We now make a single measurement of the concentration of X downwind from the source. The measured concentration is y = yT + εI, where yT is the true concentration and εI is the instrument error. We use a CTM as forward model F to relate yT to x:
yT = F(x) + εM + εR
(11.11)
Here, εM describes the forward model error in reproducing the true concentration yT given the true emission rate x. This error includes contributions from model parameters such as winds, model physics, and model numerics. There is in addition a representation error εR reflecting the mismatch between the model resolution and the measurement location (Figure 11.2). Representation error is caused by the numerical discretization of the model equations so that the model provides simulated concentrations only on a discrete spatial grid and at discrete time steps, which may not correspond to the exact location of the measurement. Thus the model is not representative of the measurement, and even with interpolation some error is incurred. Representation error is not intrinsically a forward model error because it could in principle be corrected by adjusting the location and timing of the measurement.
Figure 11.2 Representation error. The forward model computes concentrations as gridbox averages over discrete time steps Δt. The measurement is at a specific location within the gridbox and at an intermediate time t′. Interpolation in space and time is necessary to compare the measurement to the model, and the associated error is called the representation error.
Summing the errors, the measured concentration y is related to the true value of x by
y = F(x) + εI + εR + εM = F(x) + εO
(11.12)
where εO = εI + εR + εM is the observational error which includes instrument, representation, and forward model errors. This terminology might at first seem strange, as we are used to opposing observations to models. A very important conceptual point in inverse modeling is that instrument and model errors are inherently coupled when attempting to estimate x from y. Having a very precise instrument is useless if the model is poor or mismatched; in turn, having a very precise model is useless if the instrument is poor. Instrument and model must be viewed as inseparable partners of the observing system by which we seek to gain knowledge of x.
The instrument, representation, and model errors are uncorrelated so that their variances are additive:
(11.13)
Let us assume for now that the observational error is unbiased so that bO = E[εO] = 0; we will examine the implications of bO ≠ 0 later. Let us further assume that the prior and observational errors are normally distributed. Finally, let us assume that the forward model is linear so that F(x) = kx where k is the model parameter; again, we will examine the implications of nonlinearity later.
We now have all the elements needed for application of Bayes’ theorem to obtain an optimal estimate of x given y. The prior PDF for x is given by
(11.14)
and the conditional PDF for the observation y given the true value of x is given by
(11.15)
Applying Bayes’ theorem (11.6) and ignoring the normalizing terms that are independent of x, we obtain:
(11.16)
where ∝ is the proportionality symbol. Finding the maximum value for P(x|y) is equivalent to finding the minimum for the cost function J(x):
(11.17)
which is a least-squares sum weighted by error variances and is called a χ2 cost function.
The optimal estimate is the solution to ∂J/∂x = 0:
(11.18)
This yields
(11.19)
where g is a gain factor given by
(11.20)
In (11.19), the second term on the right-hand side represents the correction to the prior estimate on the basis of the measurement y. The gain factor is the sensitivity of the optimal estimate to the observation: . We see from (11.20) that the gain factor depends on the relative magnitudes of σA and σO/k. If σA ≪ σO /k, then g → 0 and ; the measurement is useless because the observational error is too large. If by contrast σA ≫ σO /k, then g → 1/k and ; the measurement is so precise that it constrains the solution without recourse to prior information.
We can also express the optimal estimate in terms of its proximity to the true solution x. Replacing (11.12) with F(x) = kx into (11.19) we obtain
(11.21)
or equivalently
(11.22)
where a is the averaging kernel defined as
(11.23)
The averaging kernel describes the relative weights of the prior estimate xA and the
true value x in contributing to the optimal estimate. It represents the sensitivity of the optimal estimate to the true state: . The gain factor is now applied to the observational error in the third term on the right-hand side.
We see that the averaging kernel simply weighs the error variances in state space and (σO /k)2. In the limit σA ≫ σO /k, we have a → 1 and the prior estimate does not contribute to the solution. However, our ability to approach the true solution is still limited by the term gεO with variance (gσO)2. We call (1 – a)(xA – x) the smoothing error since it regularizes the solution by limiting our ability to depart from the prior estimate, and we call gεO the observational error in state space.
Manipulating the PDF of the optimal estimate as given by (11.16) and replacing y using (11.19) yields a Gaussian form where is the harmonic sum of and (σO/k)2:
(11.24)
Here is the error variance on the optimal estimate, called the posterior error variance. It is always less than the prior and observational error variances, and tends toward one of the two in the limiting cases that we described.
Before the measurement the error variance on x was ; after the measurement it is . The amount of information from the measurement can be quantified as the relative error variance reduction . We find from (11.23) and (11.24) that this quantity is equal to the averaging kernel:
(11.25)
The role of the prior estimate in obtaining the optimal solution deserves some discussion. Sometimes an inverse method will be described as “not needing prior information.” But that means either that any prior information is very poor compared to what can be achieved from the observing system, or that the method is suboptimal. In our example, not using prior information will yield as solution y/k with error variance ; but since this is not as good a solution as . Using prior information can lead to confusion about the actual contribution of the measurement to the reported solution . Knowledge of averaging kernels is important to avoid such confusion.
We have assumed in the above a linear forward model y = F(x) = kx. If the forward model is not linear, we can still calculate an optimal estimate as the minimum of the cost function (11.17), where we replace kx by the nonlinear form F(x). We then have
(11.26)
and the optimal estimate is given by solving dJ/dx = 0:
(11.27)
The error on this optimal estimate is not Gaussian though, so (11.24) does not apply. And although we can still define an averaging kernel , this averaging kernel cannot be expressed analytically anymore as a ratio of error variances; it may instead need to be calculated numerically. Obtaining error statistics on the optimal estimate is thus far more difficult. An alternative is to linearize the forward model around xA as ko = ∂F/∂x|xA and solve for the corresponding minimum of the cost function as in the linear case (11.18):
(11.28)
This yields an initial guess x1 for on which we iterate by recalculating k1 = ∂F/∂x|x1, solving (11.28) using k1, obtaining a next guess x2, and so on until convergence. This preserves the ability for analytical characterization of observing system errors.
We have assumed in our analysis that the errors are unbiased. The prior error εA is indeed unbiased because xA is our best prior estimate of x; even though xA is biased its error is not. Another way of stating this is that we don’t know that xA is biased until after making the measurement. However, the observational error could be biased if the instrument is inaccurate or if there are systematic errors in some aspect of the forward model. In that case we must rewrite (11.12) as
y = F(x) + bO + ε'O
(11.29)
where bO = E[εO] is the observation bias and ε'O is the residual random error such that E[ε'O] = 0. The optimal estimate can be derived as above by replacing y with y – bO, and we see in this manner that the bias will be propagated through the equations to cause a corresponding bias in the solution. For a linear model F(x) = kx, the analytical solution given by (11.19) will be biased by gbO.
So far we have limited ourselves to one single measurement. We can reduce the error on the optimal estimate by making m independent measurements yi, each adding a term to the cost function J(x) in (11.17). Assuming for illustrative purpose the same observational error variance and the same linear forward model parameter k for each measurement, and further assuming that the successive measurements are not only independent but uncorrelated, we have the following expression for J(x):
(11.30)
where the overbar denotes the average value and is the variance of the error on . By taking m measurements, we have reduced the observational error variance on the average value by m; this is the central limit theorem. By increasing m, we could thus approach the true solution: . However, this works only if (1) the observational error has an expected value of zero (no mean bias), and (2) the m observations are independent and identically distributed (IID), meaning that they all sample the same PDF in an uncorrelated way.
With regard to (1), systematic error (mean bias) will not be reduced by increasing the number of measurements and will still propagate to affect the solution as discussed here. As the number of observations increases and the importance of the random error component decreases, the effect of bias on the solution increases in relative importance. With regard to (2), instrumental errors (as from photon counting) may be uncorrelated; however, forward model errors rarely are. In our example, two successive measurements at a site may sample the same air mass and thus be subject to the same model transport error in the CTM used as forward model. It is thus important to determine the error correlation between the different observations. This error correlation can best be described by assembling the measurements into a vector and constructing the observational error covariance matrix (Section 11.4.1). Dealing with error correlations, and more generally dealing with a multi-component state vector, requires that we switch to a vector-matrix formalism for the inverse problem. This vector-matrix formalism is central to any practical application of inverse modeling and we introduce the relevant mathematical tools in the next section.
One last word about bias before we move on. Bias in the observing system is the bane of inverse modeling. As we saw, it propagates through the inverse model to bias the solution. Random error in the observing system can be beaten down by making many measurements, but bias is irreducible. Inversions sometimes include a prior estimate for the pattern of the bias (for example, latitude-dependent bias in a satellite retrieval) and optimize it as part of the inversion. But for this we need to know that a bias is there and what form it has, and we generally are not that well informed. Minimizing bias in the observing system through independent calibration is a crucial prelude to inverse modeling. Bias in the instrument can be determined by analysis of known standards or by comparison with highly accurate independent measurements (for example, validation of satellite observations with vertical aircraft profiles during the satellite overpass). Bias in the forward model can be determined by applying the model to conditions where the state is known, though this is easier said than done. See Chapter 10 for discussion on quantifying errors in models. In the rest of this chapter, and unless otherwise noted, we will assume that errors are random.
11.4 Vector-Matrix Tools
Consider the general problem of a state vector x of dimension n with prior estimate xA and associated error εA, for which we seek an optimal estimate on the basis of an ensemble of observations assembled into an observation vector y of dimension m. y is related to x by the forward model F:
y = F(x) + εO
(11.31)
where εO is the observational error vector as in (11.1). We have omitted the model parameters p in the expression for F to simplify notation. Inverse analysis requires definition of error statistics and PDFs for vectors. The error statistics are expressed as error covariance matrices, and the PDFs are constructed in a manner that accounts for covariance between vector elements. Solution of the inverse problem may involve construction of the Jacobian matrix and the adjoint of the fo
rward model. We describe here these different objects. Their application to solving the inverse problem will be presented in the following sections.
11.4.1 Error Covariance Matrix
The error covariance matrix for a vector is the analogue of the error variance for a scalar. Consider an n-dimensional vector that we estimate as x + ε, where x = (x1, … xn)T is the true value and ε = (ε1, … εn)T is the error vector representing the errors on the individual components of x. The error covariance matrix S for x has as diagonal elements (sii) the error variances of the individual components of x, and as off-diagonal elements (sij) the error covariances between components of x:
(11.32)
(11.33)
where r(εi, εj) is Pearson’s correlation coefficient between εi and εj:
(11.34)
The error covariance matrix is thus constructed as:
(11.35)
and can be represented in compact form as S = E[εεT]. It is symmetric since the covariance operator is commutative: cov(εi, εj) = cov(εj, εi). The covariance structure is often derived from error correlation coefficients, as expressed by the error correlation matrix S′:
Modeling of Atmospheric Chemistry Page 60