10.4.5 Scatterplots
Comparison of model results to a large ensemble of observations requires statistical metrics to diagnose the significance and extent of discrepancies. A first quantitative evaluation can be made by plotting the N values simulated by the model and denoted Mi(i = 1, N) as a function of the observed values Oi(i = 1, N) at corresponding locations and times. The resulting scatterplot (Box 10.1) is characterized by a cloud of points, from which a regression line
(10.12)
can be derived. Here represents a “statistical model” with coefficients a and b derived from the known values of Mi and Oi by an ordinary regression method or, if both variables are subject to uncertainties, by a reduced major axis (RMA) regression method (see Appendix E for more details). The departure of a from zero provides information on model offset (absolute bias) and the departure of b from unity provides information on relative bias, assuming that the linear model is correct. Linear regression software packages provide standard errors on a and b, but these generally assume that the linear model is correct and thus will underestimate actual errors. A better estimate of errors on a and b is obtained by jackknife or bootstrap resampling where regression coefficients are computed for different subsets of the data to yield a spread of a and b values from which the errors can be characterized.
The use of scatterplots for analyzing model results is illustrated in Figure 10.18. The panels compare the ensemble predictions of seven regional models for surface NO2 and ozone concentrations for two different days in Europe with concentrations measured at different monitoring stations. An inspection of the data suggests that the model produces a weak summertime ozone episode with concentration values (typically 100–150 μg m–3) fairly consistent with the observed values. Discrepancies with observations are worse for NO2, which is strongly influenced by local pollution sources. The scatterplots show that the NO2 concentration values provided by the model ensemble are lower than the concentrations observed at the surface. In addition, the dispersion of the points around the 1:1 line is substantial and hence the correlation coefficient is low. In the case of ozone, the agreement between model and observations is considerably better, and the points are relatively close to the 1:1 line. The model ensemble, however, slightly overestimates the ozone concentrations in areas where the measured values are low and underestimates them where the observed values are high. More elaborate measures of model skill are discussed in the next sections.
Figure 10.18 (a–b) Prediction of the NO2 concentration on October 3, 2014 (a) and the ozone (b) concentrations on June 10, 2014 [μg m–3] from the regional daily forecast for Europe based on the median ensemble of seven European chemistry transport models produced in the frame of the EU-MACC projects (Monitoring Atmospheric Composition and Climate, 2009–2015). (c–d) Measurements of NO2 (c) and ozone (d) concentrations at European monitoring stations. (e–f) Scatterplots representing the calculated concentrations as a function of the measured values.
Courtesy of Virginie Marecal, Meteo-France.
Box 10.1 Construction of Scatterplots
A scatterplot is a diagram that displays in Cartesian coordinates a collection of N points that represent pairs of data (xi, yi). It allows easy examination of different features in the (x, y) relationship including correlations, curvature in the relationship, clustering of points, presence of outliers, etc. A statistical relationship between the two quantities can be established using a best-fit procedure. Details are provided in Appendix E. This relationship is commonly expressed by a linear function:
whose intercept a and slope b are determined by applying a linear regression method (method of least squares). The reduced major axis (RMA) regression is most appropriate to account for errors in both variables (see Appendix E).
The strength and direction of the linear relationship between the two variables are expressed by the Pearson correlation coefficient (r) defined by (10.27) and further discussed in Appendix E. When one quantity increases together with the other quantity, the correlation is positive and r > 0. In the opposite case, it is negative with r < 0. The value of r ϵ [–1, +1] measures the dispersion of the data points around the regression line. If all the points fall exactly on the line then |r| = 1. When the values are not linked at all and the data points are fully dispersed in the graph, r is close to zero and the data are not significantly correlated. The fraction of the variance in y that is explained by the statistical model is called the coefficient of determination R2 (see Appendix E). In the case of a linear regression, R2 = r2. Box 10.1 Figure 1 shows an example of a scatterplot with the corresponding regression line and coefficient of determination. The statistical significance of the correlation coefficient is discussed in Box 10.2.
Box 10.1 Figure 1 Scatterplot between two arbitrary variables x and y with the regression line and the corresponding coefficient of determination R2 deduced from the different data points.
Scatterplots are useful to analyze relationships between different chemical species in the atmosphere. Box 10.1 Figure 2, for example, shows the relationship between the observed concentrations of methane and CO2 measured over southwestern Pennsylvania, color-coded by SO2 concentrations. The diagram differentiates between air masses that are representative of the boundary layer in an urban environment and those that are characteristic of the free troposphere. Some air parcels with high methane concentrations originate from an area where extraction of natural gas from shale rock layers (“fracking”) is taking place.
Box 10.1 Figure 2 Relationship between methane and CO2 concentrations measured by aircraft over southwest Pennsylvania and color-coded by SO2 concentrations.
Figure provided by X. Ren and R. Dickerson, University of Maryland.
When quantity y depends on several variables x1, x2, … the simple regression model must be replaced by a multiple regression approach. A linear relationship between three variables, for example, is represented by a regression plane in a 3-D space. Finally, in some situations, the linear regression model does not provide an adequate statistical model to represent the relationship between variables, and a nonlinear regression method must be applied to derive, for example, the regression coefficients of a degree-n polynomial or of other mathematical expressions.
10.5 Measures of Model Skill
Model skill is generally measured by the ability of the model to match observations of relevance to the problem of interest. In the case of operational models, such as used for air quality forecasts, the ability to match observations is the ultimate goal and is often measured as a model score.
10.5.1 Basic Metrics
Different statistical metrics for paired comparisons of model to observed values are used to test the skill of chemical transport models. Metrics allow a general assessment of the confidence in model analyses and forecasts. Consider N paired model and observed quantities Mi and Oi, whose averages are respectively:
(10.13)
and standard deviations are:
(10.14)
The ensemble N includes data collected at different sites and/or at different times. Individual points included in the ensemble should be independent so that each paired comparison provides independent information. Observations from nearby sites may be strongly correlated and thus not provide independent information. Spatial and temporal scales over which the observations are significantly correlated can be determined with an autocorrelogram that plots the correlation coefficient vs. the spatial or temporal separation between data points.
As pointed out in Section 10.4, data outliers require careful consideration because they may weigh heavily in the statistical results. Model fields are generally well-behaved, and any unexplained outliers should be scrutinized as they may reveal basic model errors. Observational outliers are common and may reflect instrumental error or unusual conditions that the model is not intended to represent; if so, they should be screened from the comparison. A standard way to detect outliers is to plot the data against the normal (or log-normal) distribution, reveal
ing extrema that depart from the distribution. These extrema should be discarded if it can be reasonably ascertained that they are not part of the population for which model evaluation is desired or feasible.
Two basic measures of model skill are often adopted: the mean bias (BIAS)
(10.15)
that represents the difference between the mean model and observed quantities, and the root mean squared error (RMSE):
(10.16)
that represents the spread of the individual errors. These two metrics are expressed in the same units as Mi and Oi. The systematic root mean square error (RMSES) describes the bias between observed data points Oi and the linear least square fit to the observations (see (10.12) and Box 10.1). In a scatterplot, it is determined by the square of the distance between the linear regression line and the 1:1 line:
(10.17)
The unsystematic root mean square error (RMSEU):
(10.18)
is derived as a function of the distance between the data points Mi and the linear regression line . It represents the scatter of the data about the best-fit line, and can thus be regarded as a measure of model precision. A successful model is characterized by a low value of RMSES and a value of RMSEU close to RMSE because:
(10.19)
Measures such as the mean absolute error (MAE):
(10.20)
and the mean absolute deviation (MAD):
(10.21)
involving absolute values of the differences are sometimes preferred to measures based on squared differences because they are less sensitive to high values. The mean normalized bias (MNB):
(10.22)
and the mean normalized absolute error (MNAE):
(10.23)
may be appropriate scoring measures when the data cover an extended range of values to avoid overemphasizing the high tail of the distribution. The MNB has the disadvantage, however, of being asymmetric with respect to under- and overestimation. For example, when the model overestimates the measured value, the MNB can increase to values much larger than unity; however, when it underestimates the observation its value is limited to –1. This issue is addressed by introducing the mean fractional bias (MFB):
(10.24)
which varies in the range [–2, +2] and is complemented by the mean fractional error (MFE):
(10.25)
A problem with the last four metrics is that they tend to overemphasize low values where relative errors may be large but of little relevance to the problem of interest. This can be corrected by taking the ratios of the sums, instead of the sums of the ratios. For example, instead of the MNB, one can use the normalized mean bias (NMB) for more robust statistics:
(10.26)
Other dimensionless indices of agreement have been proposed for measuring model skill and are listed in Table 10.3. Willmott et al. (2012) and Chai and Draxler (2014) discuss the advantages and disadvantages of different indices.
Table 10.3 Indices of agreement proposed by different authors to assess model performance
Author Index of agreement
Willmott (1981)
Willmott et al. (1985)
Willmott et al. (2012)
Nash and Sutcliffe (1970)
Legates and McCabe (1999)
Watterson (1996)
Mielke and Berry (2001)
Douglass et al. (1999)
The Pearson correlation coefficient
(10.27)
is the covariance between model and observed values normalized to the variances. It provides different information from the above metrics in that it characterizes the extent to which patterns in the observations are matched by the patterns in the model. Its value may range from –1 to +1. A value of +1 indicates a perfect match. A positive value indicates the level of skill with a statistical significance that depends on sample size (Box 10.2 and Section 10.6). Values of r near zero imply that the variability in the observations is controlled by processes that the model does not capture. A model that is able to capture the observed means but not the observed variability (non-significant r) may be getting the mean right for the wrong reasons. Negative values of r imply large model errors in the simulation of processes.
Box 10.2 Statistical Significance of the Correlation Coefficient
The Pearson correlation coefficient r(x, y) characterizes the strength of a linear relationship between two variables x and y. Whether the correlation is meaningful or not depends on the size of the sample that is being examined. A statistical test must determine the significance of the correlation coefficient for a given probability level.
The correlation of a population is said to be significant if the correlation coefficient ρ(x, y) of the entire population is different from zero. This is different from the correlation r(x, y) that can be determined for a sample of that population. We apply here the “null hypothesis” in which we assume that there is no correlation between x and y [ρ(x, y) = 0], and that, if the value r(x, y) measured from a sample of limited size n is different from zero, it is due to sampling errors (the size of the sample is too small). If this hypothesis is verified for a given probability, the correlation is not significant. If, however, the null hypothesis is rejected by the statistical test, the correlation is regarded as significant at a certain level of confidence defined by the probability that the population is correlated. The risk factor is the complementary probability that the population is in fact not correlated
Significance of a correlation at a certain level of confidence can be diagnosed by Student’s t-test. For a given correlation coefficient r and sample size n we compute
and compare it to the corresponding critical value from a statistical table (Appendix E). The correlation is significant if t exceeds the critical value. A confidence level of 95% corresponding to a risk factor of 5% (p < 0.05) is commonly used and Box 10.2 Figure 1 gives the corresponding threshold value of r for different sample sizes.
Box 10.2 Figure 1 Minimum absolute value of the Pearson correlation coefficient for significant correlation at the 95% confidence level, as a function of sample size.
Reproduced from Wikipedia.
As an illustration, consider a sample of ten data points with a calculated r = 0.6. We derive from the above formula a value of t = 2.12. For a risk factor of 5% (confidence level of 95%), the table of Appendix E provides for t a critical value of 2.31. In this case, the null hypothesis of zero correlation cannot be rejected, and the correlation coefficient derived from the sample is therefore not significant. If, however, we consider 52 data points with a measured r of only 0.3, the calculated value of t is 2.24, which is above the critical value of 2.01 found in the table, and the correlation is significant.
Binary prediction of atmospheric events. A model prediction of a specific event such as an air pollution episode at a given location (e.g., concentration of pollutants exceeding a regulatory threshold) can be evaluated as a binary variable by distinguishing four possible situations: (1) the event is predicted and observed; (2) the event is not predicted and not observed; (3) the event is predicted but not observed; and (4) the event is not predicted but is observed. Cases (1) and (2) are successful predictions (hits), while cases (3) and (4) are failures (misses). Consider a sample of N predictions covering a certain period of time and with each prediction having an outcome yy, nn, yn, or ny. The first letter is the prediction of whether the event occurs (y for yes, n for no) and the second letter indicates whether the event is observed. We have N = yy + nn + yn + ny. The skill of the model for binary prediction (event or no event) can be measured by the fraction of correct predictions (PC):
(10.28)
This is a lenient metric because the PC will be high when events are rare even if the model has no skill at predicting the events. The fraction of observed events that were correctly predicted (called probability of detection or POD) is
(10.29)
The fraction of predicted events that did not occur (called false alarm ratio or FAR) is
(10.30)
The critical success index, also called the threat score (TS) is defined by
(10.31)
Some of the predictions may be successful by chance. To correct for this, one defines the equitable threat score (ETS)
(10.32)
in which α is the number of events that would be predicted by chance,
α = (yy + yn)(yy + ny)
(10.33)
and is subtracted from the number of hits. The ETS is always lower than 1 and is negative if the prediction by chance is better than the actual prediction.
Grading models. When comparing different models, it is useful to attribute to each of them a grade for their ability to correctly simulate atmospheric observations. In the approach adopted by Douglass et al. (1999), the grade gm,j by which a model m represents the observed concentration of a particular species j is given by
(10.34)
Here, the overbars indicate mean values, σj denotes the standard deviation in the observations, and ng is a scaling factor. If this factor is taken to be equal to 3, the grade gm,j is equal to zero when model m is 3σj apart from the mean observed value.
When N chemical species are considered for evaluating model skill, and hence N diagnostics are performed, they can be combined to derive for a given model m a single overall grade Gm (Waugh and Eyring, 2008):
(10.35)
where wj denotes a weight assigned to the diagnostic for species j as a measure of its importance.
Modeling of Atmospheric Chemistry Page 57