The right panel of Box 11.5 Figure 1 shows the vertical profile of CO retrieved by MOPITT and compares it to a coincident vertical profile of CO measured from aircraft. The aircraft observations have high accuracy and can be regarded as defining the true vertical profile x. They show a layer of elevated CO at 900–800 hPa that MOPITT does not detect, as would be expected because of the vertical smoothing. To determine if the MOPITT retrieval is consistent with the vertical profile measured from aircraft, we need to smooth the aircraft observations with the averaging kernel matrix in order to simulate what MOPITT should actually observe. Smoothing defines a vertical profile x', shown as the dashed line in the right panel of Box 11.5 Figure 1:
x ' = Ax + (In − A)xA
(11.93)
This is the expected profile (εO = 0) that MOPITT should see if its capability is as advertised by the error analysis that led to the averaging kernel matrix. We see that the smoothed vertical profile from the aircraft agrees closely with the MOPITT observation, supporting MOPITT’s error characterization and implying that MOPITT provides an accurate measurement of the weighted tropospheric column (and not much more). Such aircraft validation of satellite instruments is critical for identifying retrieval biases and inadequate characterization of retrieval errors.
11.5.3 Degrees of Freedom for Signal
The averaging kernel matrix quantifies the number of pieces of information in an observing system toward constraining an n-dimensional state vector. This is called the degrees of freedom for signal (DOFS) (Rodgers, 2000). Before making the observations we had n unknowns representing the state vector elements as constrained solely by the prior error covariance matrix. We express that number of unknowns as
(11.87)
After making the observations the error on x is decreased, and we express this decrease as a reduction in the number of unknowns to . The number of pieces of information from the observations is the reduction in the number of unknowns:
(11.88)
The quantity is a scalar and is thus equal to its trace in matrix notation:
(11.89)
where we have taken advantage of the general property tr(AB) = tr(BA). Thus
(11.90)
so that
(11.91)
The number of pieces of information in an observing system is the trace of its averaging kernel matrix. This concept is analogous to the relative error variance reduction introduced in Section 11.3 with (11.25). If the matrices and SA are diagonal, then we see from (11.91) that the DOFS is simply the sum of the relative reductions in error variances σ2 for the individual state vector elements:
(11.92)
11.5.4 Evaluation of the Inverse Solution
Some basic checks need to be made to evaluate the quality of the optimal estimate obtained from an inverse model. A first check is that the inversion has actually decreased the cost function: < J(xA). One should also check that , verifying that the inverse solution is consistent with the specification of errors. If , the inversion was unable to achieve a solution consistent with the specification of errors. This can happen if errors were greatly underestimated. If , by contrast, errors were likely overestimated. Another important test is to apply the forward model to the optimal estimate and compare the field of (optimal estimate minus observations) to that of F(xA) – y (prior estimate minus observations). The differences with observations should be reduced when using the optimal estimate (this follows from the decrease in the cost function), and the field of should ideally be uniformly distributed as white noise around zero. Large coherent patterns with of consistent sign suggest model bias, poor characterization of errors, or a poor choice of state vector leading to aggregation error. Aggregation error is discussed in Section 11.5.5.
The relative weights of prior and observational errors play an important role in determining the optimal estimate. This weighting is determined not only by the specifications of SA and SO, but also by the relative dimensions of n and m, as these affect the relative weighting of prior and observation terms in the computation of the cost function (see (11.30) and discussion for the scalar example in Section 11.3). If m ≫ n, the solution may be insensitive to the prior estimate because the number of observational terms in the cost function overwhelms the number of prior terms. An implicit assumption in such a construction is that the observations are independent and identically distributed across the observational error PDF. However, we often have little confidence that this is the case. Autocorrelation between observations not captured by the covariance structure of SO will result in excessive weight given to the observations. Beyond this concern, there is often large uncertainty in the specifications of SA and SO. A way of testing the sensitivity to error specification is to introduce a regularization factor γ in the cost function:
(11.94)
which amounts to scaling the observational error covariance matrix SO by 1/γ. The solution can be calculated for different values of γ spanning the range of confidence in error characterization. By plotting the cost function versus γ, we may find that a value of γ other than 1 leads to an improved solution.
We pointed out in Section 11.3 the danger of over-interpreting the reduction in error variance that results from the accumulation of a large number of observations. Idealized assumption of random and representatively sampled observational error may cause to greatly underestimate the actual error on . A more realistic way of assessing the error in is to conduct an ensemble of inverse calculations with various perturbations to model parameters and error covariance statistics (such as through the regularization factor γ) within their expected uncertainties. Model parameters are often a recognized potential source of bias, so that producing an ensemble based on uncertainties in these parameters can be an effective way to address the effect of biases on the optimal estimate.
11.5.5 Limitations on State Vector Dimension: Aggregation Error
One would ideally like to use a state vector as large as possible in order to maximize the amount of information from the inversion. There are two limitations to doing so, one statistical and one computational. There are no such limitations on the size of the observation vector (see Box 11.6).
Box 11.6 Sequential Updating in Sampling the Observation Vector
There is in general no computational limitation on the size m = dim(y) of the observation vector in an analytical inversion, even though one needs to invert a m × m matrix in the construction of G. The reason is that it is usually possible to partition the observation vector into small uncorrelated “packets” of observations that are successively ingested into the inverse analysis. Rodgers (2000) calls this procedure sequential updating. The solution () obtained after ingesting one packet is used as the prior estimate for the next packet, and so on. The final solution is exactly the same as if the entire observation vector were ingested at once. The only limitation is that there must be no observational error correlation between packets; in other words, SΟ for the ensemble of observations must be a block diagonal matrix where the blocks are the individual packets. It is indeed most computationally efficient to ingest uncorrelated data packets sequentially in the inversion. In the extreme case where individual observations have no error correlation, each single observation can be ingested successively and separately.
The statistical limitation in the size of the state vector is imposed by the amount of information actually provided by the observations. As the size of the state vector increases, the number of prior terms in the cost function (11.76) increases while the number of observation terms stays the same. As a result, the optimal estimate is more constrained by the prior estimate, and the smoothing error increases. This is not necessarily a problem if prior error correlations are properly quantified, so that information from the observations can propagate between state vector elements. However, this is generally not the case. In a data assimilation problem where the state vector dimension is by design much larger than the observational vector dimension, the effect of the prior constraints can be moderate
d by allowing observations to modify state variables only locally, or by using a regularization factor as in (11.94) to balance the contributions of the prior estimate and the observations in the cost function.
The computational limitation arises from the task of constructing the Jacobian matrix K and the gain matrix G. Analytical solution to the inverse problem requires these matrices, but the computational cost of constructing them becomes prohibitive as the state vector dimension becomes very large. This constraint can be lifted by using a numerical (variational) rather than analytical method to solve the inverse problem, as described in Sections 11.6 and 11.8. However, numerical methods may not provide error characterization as part of the solution. A major advantage of the analytical solution is to provide a closed form of the posterior error covariance matrix and from there the averaging kernel matrix and the DOFS.
Say that we wish to reduce the state vector dimension in order to decrease the smoothing error or to enable an analytical solution. Starting from an initially large state vector x, we can use various clustering schemes to reduce the state vector dimension (Turner and Jacob, 2015). Clustering introduces additional observational error by not allowing the relationship between the clustered state vector elements to change in the forward model. This is called the aggregation error and is part of the forward model error (the relationship between clustered elements is now a model parameter rather than resolved by the state vector). As the state vector dimension decreases, the smoothing error decreases while the aggregation error increases. We expect therefore an optimum state vector dimension where the total error is minimum. As long as aggregation error is not excessive, it may be advantageous to decrease the state vector dimension below that optimum in order to facilitate an analytical inversion with full error characterization.
The aggregation error for a given choice of reduced-dimension state vector can be characterized following Turner and Jacob (2015). Consider the clustering of an initial state vector x of dimension n to a reduced state vector xω of dimension p. The clustering is described by xω = Γωx where the p × n matrix Γω is called the aggregation matrix. The Jacobian matrix of the forward model is K in the original inversion and Kω in the reduced inversion. For the same ensemble of observations y, the observational errors in the original and reduced inversions are
ε = y − Kx
(11.95)
εω = y − Kωxw
(11.96)
The only difference between the two errors is due to aggregation, causing εω to be greater than ε. Thus the aggregation error εa is
εa = εω − ε = Kx − Kωxω = (K − KωΓω)x
(11.97)
We see from (11.97) that the aggregation error is a function of the true state x, and the aggregation error statistics therefore depends on the PDF of the true state. Let be the mean value of the true state; the covariance matrix of the true states is Se = E[(x – )( – )T]. The aggregation bias is the mean value of the aggregation error:
(11.98)
and the aggregation error covariance matrix Sa is
(11.99)
In general we have no good knowledge of and Se. However, we can still estimate the aggregation error covariance matrix when designing the inversion system to select an optimum dimension for the state vector. For this purpose we use our prior knowledge xA and SA as the best estimates for and Se. When aggregating state vector elements, the relationship between state vector elements in the forward model is not allowed to depart from the prior estimate. It follows that if = xA then there is no aggregation bias since the prior relationship between state vector elements is true and hence K = KωΓω; the forward model with the reduced state vector is identical to that with the original state vector. The aggregation error covariance matrix is given by
Sa = (K − KωΓω)SA(K − KωΓω)T
(11.100)
The corresponding aggregation error covariance matrix in state space (error on x) is GSaGT.
We previously derived in (11.85) the smoothing and observational error covariance matrices for x. We can now write a complete error budget for x including the aggregation error for a reduced-dimension state vector:
(11.101)
where , SA,ω, Aω, and Gω apply to the reduced-dimension state vector. Equation (11.101) separates the aggregation error from the observational error by having SO include forward model error only for the original-dimension state vector (not including the effect of aggregation). We can also express the posterior error covariance matrix in observation space as describing the error on Kωxω:
(11.102)
Figure 11.3 illustrates how the different error components of (11.102) contribute to the overall posterior error covariance matrix. The smoothing error decreases with decreasing state vector size while the aggregation error increases. There is an optimum dimension of x where the posterior error is minimum. Reducing the state vector dimension beyond this minimum may still be desirable for computational reasons and to achieve an analytical solution, and incurs little penalty as long as the aggregation error remains small relative to the observational error. Beyond a certain reduction the aggregation error grows rapidly to exceed the observational error.
Figure 11.3 Total error budget from the aggregation of state vector elements in an inverse model. The application here is to an inversion of methane emissions over North America using satellite observations of methane and with n = 7.366 native-resolution state vector elements representing methane emissions on a 50 × 50 km2 grid. Results are shown as the square roots of the means of the diagonal terms (mean error standard deviation) for the aggregation, smoothing, observational, and total (posterior) error covariance matrices following (11.102) There is an optimum state vector size for which the total error is minimum and this is shown as the circle. However, the aggregation error remains small compared to the observational error down to n ≈ 300 and this could be a suitable choice for a reduced state vector dimension. Gray shading indicates 90% confidence intervals for the total error as diagnosed from the 5th and 95th quantiles of diagonal elements in the posterior error covariance matrix.
From Turner and Jacob (2015).
Figure 11.4 shows the impact of smoothing errors in an inversion of satellite observations of methane columns to constrain methane fluxes over North America at high spatial resolution. The top-left panel gives results from an attempt to constrain emissions at the 50 × 50 km2 native grid resolution of the forward model (state vector with n = 7906). Correction to the prior emissions is less than 50% anywhere because the information from the satellite data is insufficient to constrain emissions at such a high resolution. There results a large smoothing error – the solution is strongly anchored by the prior estimate. The top-right panel gives results from an inversion where the state vector has been reduced to n = 1000 elements by hierarchical clustering of the native grid. Corrections to the prior emissions are much larger. The bottom-right panel compares the quality of inversions with different levels of hierarchical clustering (n ranging from 3 to 7906) in terms of their ability to fit the satellite data. This fit is measured by the observational term in the cost function for the inversion. We find that n = 300–1000 provides the best fit. Coarser clustering (smaller n) incurs large aggregation error.
Figure 11.4 Effect of smoothing and aggregation errors in a high-resolution inversion of methane emissions using satellite observations of methane columns. (a) The correction factors to prior emissions when attempting to constrain emissions at the native 50 × 50 km2 grid resolution of the forward model (n = 7906). (b) The same inversion but with a reduced state vector (n = 1000) constructed by hierarchical clustering of the native-resolution grid cells. The clustering is shown in (c) with arbitrary colors for individual clusters. Panel (d) shows the ability of the inversion to fit the satellite observations as the state vector dimension is decreased from n = 7906 to n = 3 by hierarchical clustering. The quality of the fit is measured by the observational terms of the cost function for the inversion. Optimal results are achieved for n
in the range 300–1000. Finer resolution incurs large smoothing errors, while coarser resolution incurs large aggregation errors.
From Wecht et al. (2014).
11.6 Adjoint-Based Inversion
Analytical solution to the cost function minimization problem ∇xJ(x) = 0 as stated by equation (11.77) requires that the forward model be linear with respect to the state vector and places a practical computational limit on the size of the state vector for the inversion. These limitations can be lifted by minimizing J numerically rather than analytically. Such numerical methods, called variational methods, compute ∇J(x) for successive guesses and converge to the solution by a steepest-descent algorithm. Figure 11.5 illustrates the general strategy. In the adjoint-based inversion, the adjoint of the forward model is used to compute ∇J(x) efficiently for successive iterations.
Figure 11.5 Steepest-descent algorithm to find the minimum of a cost function. The cost function gradient ∇ is first computed for the prior estimate xA and this is used to obtain an improved estimate x1. The cost function gradient is then re-calculated for x1 and this is used to obtain an improved estimate x2, and so on until convergence.
Modeling of Atmospheric Chemistry Page 63