Adapted from an original figure by David Baker (Colorado State University).
Figure 11.6 gives a graphical representation of the procedure for computing ∇J(x) with the adjoint. The observations y collected over a period [to, tp] constrain a state vector x(0) evaluated at to. Using the notation introduced in Section 11.4.4 and the expression for KT given in (11.51), we have
(11.103)
where y(i) denotes the ensemble of observations at time step i. Starting from the prior estimate xA and following (11.77), we write the cost function gradient ∇xJ(xA) as
(11.104)
where the adjoint is applied to the adjoint forcing SO−1(F(xA) − y) representing the error-weighted differences between the forward model and the observations. We make one pass of the forward model F(xA) through the observational period [to, tp] and collect the corresponding adjoint forcing terms , which may be scattered over the period. Starting from the observations y(p) at tp, we apply the first adjoint operator (∂y(p)/∂y(p − 1))Tfrom (11.103) to the adjoint forcing terms and obtain the adjoint variables as a 3-D field on the model grid. We then add to these adjoint variables the adjoint forcings from the observations at time tp–1, apply the next adjoint operator (∂y(p − 1)/∂y(p − 2))T to the resulting quantities, and so on until time to when the final application of the adjoint operator (∂y(0)/∂x)T returns the quantity = ∇xJ(xA) from (11.104). The procedure can be readily adapted to obtain the cost function gradient for a time-invariant state vector; see related discussion in Section 11.4.4.
Figure 11.6 Graphical representation of the adjoint method for computing the cost function gradient ). We consider here an ensemble of observations over the period [to, tp] to constrain a state vector x evaluated at time to. We start with a pass of the forward model F(xA) over the observation period [to, tp]. From there we collect adjoint forcings SO−1(F(xA) − y) (in red) for the ensemble of observations. We then force the adjoint model with the adjoint forcings at time tp and propagate these forcings back in time with the adjoint model (in blue), adding new forcings as we march backward in time from tp to to and pick up new observations along the way. At time to we apply the final operation (∂y(0)/∂x(0))Tto the adjoint variables to obtain the cost function gradient ∇xJ(xA).
The value of ∇xJ(xA) obtained in this manner is passed to the steepest-descent algorithm to make an updated guess x1 for the state vector. We then recalculate ∇xJ(x1) for that updated guess,
(11.105)
using the adjoint as before and adding the terms which are now non-zero. We pass the result to the steepest-descent algorithm, which makes an updated guess x2, and so on until convergence to the optimal estimate . Each iteration involves one pass of the forward model over [to, tp] followed by one pass of the adjoint model over [tp, t0].
An important feature of the adjoint-based inversion method is that the Jacobian matrix K is never actually constructed. We do not in fact need the sensitivity of individual observations to x, which is what K would give us (and would be very expensive to compute for a large number of observations). All we need is the summed sensitivity , which is what the adjoint method provides. Increasing the number of observations does not induce additional computing costs, as it just amounts to updating the adjoint variables on the forward model grid to add new adjoint forcings (Figure 11.5). Increasing the number of state variables (that is, the size of x) also does not incur additional computing costs other than perhaps requiring more iterations to achieve convergence.
Constructing the adjoint requires differentiation of the forward model (Section 11.4.4). Approximations are often made in that differentiation, such as assuming reverse flow for a nonlinear advection operator. The accuracy of the cost function gradients ∇xJ(x) produced by the adjoint method can be checked with finite difference testing (Henze et al., 2007). The test involves applying the forward model to xA, calculating the cost function J(xA), and repeating for a small perturbation to one of the elements xA + ΔxA where ΔxA has value Δxi for element i and zero for all other elements. The resulting finite difference approximation
(11.106)
is then compared to the value obtained with the adjoint model.
Box 11.7 illustrates the adjoint-based inversion with an example using satellite observations of atmospheric concentrations to optimize emissions. The size of the emissions state vector is limited solely by the grid resolution of the forward model. Comparison to a coarse-resolution analytical inversion shows large advantages in the amount of information retrieved.
Box 11.7 Adjoint-Based Inversion
Box 11.7 Figure 1 from Kopacz et al. (2009) illustrates the adjoint-based inversion with an optimization of CO emissions over East Asia using satellite observations of CO columns from the MOPITT instrument in March–April 2001 (Box 11.5). Here, 21 569 observations are used to constrain mean scale factors to prior CO emissions on the 2° × 2.5° grid of the CTM taken as forward model. The state vector of emission fluxes has 3013 elements. The CTM with prior emissions shows large differences with observations, and the inversion cost function is consequently large. Successive iterations bring the cost function down to its expected value for a successful inversion. The corrections to the prior emissions from the adjoint-based inversion show fine structure associated with geographical boundaries and the type of source (fuel combustion or open fires). An analytical inversion of the same observations with coarse spatial averaging of the emission state vector not only misses the fine structure but also incurs a large aggregation error (Section 11.5.5), as apparent for example in the wrong-direction correction for Korea.
Box 11.7 Figure 1 Adjoint inversion of CO emissions in East Asia using CO column observations from the MOPITT satellite instrument in March–April 2001. The top left panel shows the mean MOPITT observations for the period, the middle left panel shows the forward model with prior emissions, and the bottom panel shows the forward model with posterior emissions optimized through the inversion. The top right panel shows the evolution of the cost function with the number of iterations by the adjoint method. The bottom right panel shows the mean multiplicative factors to the prior emissions from the adjoint inversion, and the middle right panel shows the same factors for an analytical inversion with only 11 emission regions as state vector elements.
From Kopacz et al. (2009).
A drawback of the adjoint method is that it does not provide the posterior error covariance matrix as part of the solution. The matrix can be estimated by constructing the Hessian (second derivative) of the cost function. By differentiating equation (11.77) we obtain:
(11.107)
from which we see that the posterior error covariance matrix (11.81) is the inverse of the Hessian:
(11.108)
The adjoint method allows an estimate of the Hessian by finite-difference sampling and calculation of ∇xJ(x) around the optimal estimate solution. Full construction of the Hessian would require n + 1 calculations (where n is the state vector dimension) and this is not practical for large-dimension state vectors. Targeted sampling can provide the leading eigenvalues and eigenvectors to approximate the Hessian. See Bousserez et al. (2015) for a discussion of methods.
11.7 Markov Chain Monte Carlo (MCMC) Methods
Markov chain Monte Carlo (MCMC) methods construct the posterior PDF P(x|y) ~ P(x) P(y|x) from Bayes’ theorem by directly computing P(x) and P(y|x), and their product, for a very large ensemble of values of x sampling strategically the n-dimensional space defined by the dimension of x. MCMC methods can use any form of the prior and observational PDFs. They allow for non-Gaussian errors, a nonlinear forward model, and any prior constraints. With sufficient sampling, they can return the full structure of P(x|y) with no prior assumption as to its form. A drawback of MCMC methods is their computational cost. In addition, because P(x) and P(x|y) are not Gaussian, one cannot calculate an averaging kernel matrix to quantify the information content of the observations.
The basis for MCMC methods is a Markov chain where successive values o
f x are selected in a way that the next value depends on the current value but not on previous values. This is done by randomly sampling a transition PDF T(x′|x) where x is the current value and x′ is the next value. T is often taken to be a Gaussian form so that values close to x are more likely to be selected as the next value. Through this Markov chain and by including additional criteria to adopt or reject candidate next values, we achieve a targeted random sampling of x to map the function P(x|y).
The general strategy for MCMC methods is as follows. We start from a first choice for x, such as the prior estimate xA, and calculate P(xA|y). We then choose randomly a candidate for the next value x1 by sampling T(x1|xA) and calculate the corresponding P(x1|y). Comparison of P(x1|y) to P(xA|y) tells us whether x1 is more likely than xA or not, and on that basis we may choose to adopt x1 as our next value or reject it. If we adopt it, then we use x1 as a starting point to choose a candidate x2 for the next value by applying T(x2|x1). If we reject it, then we come back to xA and make another tentative choice for x1 by random sampling of T(x1|xA). In this manner we sample the PDF P(x|y) in a representative way. Figure 11.7 gives an illustration.
Figure 11.7 Sampling of a PDF by an MCMC method. The blue points represent the population and the contours are the isolines of the PDF. The black line is the MCMC sampling trajectory. For a sufficiently large sample, the PDF of the sample tends toward the PDF of the population.
From Iain Murray (University of Edinburgh).
A frequently used MCMC method is the Metropolis–Hastings algorithm. In this algorithm, following on the above sampling strategy, we calculate the ratio P(x1|y)/P(xA|y). If the ratio is greater than 1, we are moving in the right direction toward the most likely value; x1 is then adopted as the next value of the Markov chain and we proceed to choose a candidate for x2 on the basis of x1. If the ratio is less than 1, this means that x1 is less likely than xA. In that case, the ratio P(x1|y)/P(xA|y) defines the probability that x1 should be selected as the next iteration and a random decision is made based on that probability. If the decision is made to adopt x1, then we proceed as above. If the decision is made to reject x1, then we go back to xA and make another tentative choice for x1. Given a sufficiently large sampling size, one can show that this sampling strategy will eventually generate the true structure of P(x|y).
11.8 Other Optimization Methods
Standard Bayesian optimization regularizes the fitting of the state vector to observations by applying the prior PDF of the state vector as an additional constraint. We may wish to use a different type of prior constraint, or ignore prior information altogether. We briefly discuss these approaches here.
Ignoring prior information is the effective outcome of a Bayesian inversion when the prior terms SA−1(x – xA) are small relative to the observational terms in the computation of the cost function gradient by equation (11.77). This happens when the prior errors are large relative to the observational errors, and/or when the number of state vector elements is very small relative to the number of observations. If we know this from the outset, then there is little point in going through the trouble of including prior information in the cost function. The cost function including only the observational terms amounts to an error-weighted fit to the observations and this is called the maximum likelihood estimator:
(11.109)
The minimum in J can be found by computing the gradient as in (11.77) without the prior terms. Sequential updating can be applied in computing the gradient by successive ingestion of small data packets (Box 11.6). The optimal estimate after ingesting one data packet is used as the prior estimate for ingesting the next data packet.
If all elements in equation (11.109) have the same observational error variances and zero covariances, so that SO is a multiple of the identity matrix, then the cost function becomes a simple least-squares fit: J(x) = ‖y − F(x)‖2where ‖.‖ denotes the Euclidean norm. If in addition, the forward model is linear and thus fully described by its Jacobian matrix K, then we have the familiar linear least-squares optimization
J(x) = ‖y − Kx‖2
(11.110)
One can show that the corresponding minimum of J(x) is for
x = K+y
(11.111)
where K+ = (KTK)–1KT is the Moore–Penrose pseudoinverse of K.
A danger of not including prior information in the solution to the inverse problem is that the resulting solution may exhibit non-physical attributes such as negative values or unrealistically large swings between adjacent elements (checkerboard noise). An important role of the prior information is to prevent such non-physical behavior by regularizing the solution. Bayesian regularization as described in Section 11.2 is not the only method for imposing prior constraints. In the Tikhonov regularization, a term is added to the linear least-squares minimization to enforce desired attributes of the solution:
J(x) = ‖y − Kx‖2 + ‖Γx‖2
(11.112)
Here Γ is the Tikhonov matrix and carries the prior information. For example, choosing for Γ a multiple of the identity matrix (Γ = γIn) enforces smallness of the solution. Off-diagonal terms relating adjacent state vector elements enforce smoothness of the solution. Bayesian inference can be seen as a particular form of Tikhonov regularization when we write it as
(11.113)
where ‖a‖S = aTS–1a is the error-weighted norm for vector a with error covariance matrix S.
In some inverse problems, we have better prior knowledge of the patterns between state vector elements than of the actual magnitudes of the elements. For example, when using observed atmospheric concentrations to optimize a 2-D spatial field of emissions, we may know that emissions relate to population density for which we have good prior information, even if we don’t have good information on the emissions themselves. This type of knowledge can be exploited through a geostatistical inversion. Here we express the cost function as
(11.114)
where the n× q matrix P describes the q different state vector patterns, with each column of P describing a normalized pattern. The unknown vector β of dimension q gives the mean scaling factor for each pattern over the ensemble of state vector elements. Thus P β represents a prior model for the mean, with β to be optimized as part of the inversion. The covariance matrix S gives the prior covariances of x, rather than the error covariances.
11.9 Positivity of the Solution
Inverse problems in atmospheric chemistry frequently require positivity of the solution. This is the case, in particular, when the state vector consists of concentrations or emission fluxes (Table 11.1). The standard assumption of Gaussian errors is at odds with the positivity requirement since it allows for the possibility of negative values, but this is not a serious issue as long as the probability of negative values remains small. When small negative values are incurred in the solution for a state vector element, it may be acceptable to simply aggregate them with adjacent elements to restore positivity. In some cases, maintaining positivity in the solution requires stronger measures. Miller et al. (2014) review different approaches for enforcing positivity in inverse modeling of emission fluxes using observations of atmospheric concentrations.
A straightforward way to enforce positivity of the solution is to transform the state variables into their logarithms and optimize the logarithms using Gaussian error statistics. This assumes that the errors on the state variables are lognormally distributed, which is often a realistic assumption for a quantity constrained to be positive. For example, when we say that a state variable is uncertain by a 1-σ factor of 2, we effectively make the statement that the natural logarithm of that variable has a Gaussian error standard deviation of ln 2. Any of the inversion methods described above can be applied to the logarithms of the state variables with Gaussian error statistics, in the same way as for the original state variables. However, if the forward model was linear with respect to the original state variables, then transformation to logarithms loses the linearity. Analytical solution to the inverse pr
oblem then requires an iterative approach with reconstruction of the Jacobian matrix at each iteration. This is not an issue with an adjoint-based inverse method since the Jacobian is never explicitly constructed in that case. An adjoint-based inverse method incurs no computational penalty when transforming the state variables into their logarithms (or any other transformation). The methane flux inversion in Figure 11.4, for example, used an adjoint-based method with logarithms of emission scaling factors as the state variables.
MCMC methods (Section 11.7) can easily handle the requirement of positivity by restricting the Markov chain to positive values of the state vector elements. However, these methods are computationally expensive. An additional drawback is that they may lead to an unrealistic structure of the posterior error PDF with exaggerated probability density close to zero.
Another approach to enforce positivity of the solution is by applying Karush–Kuhn–Tucker (KKT) conditions to the cost function. KKT conditions are a general method for enforcing inequality constraints in the cost function and are an extension of the Lagrange multipliers method that enforces equality constraints. Here we describe the specific application of minimizing the cost function J(x) subject to the positivity constraints xi ≥ 0 for all n elements of the state vector x. This is done by minimizing the Lagrange function L(x, μ) with respect to x:
(11.115)
where μ = (μ1, … μn) is the vector of unknown KKT multipliers. The KKT multipliers are constrained to be non-negative (μi ≥ 0) and satisfy the complementary slackness conditions:
Modeling of Atmospheric Chemistry Page 64