Modeling of Atmospheric Chemistry
Page 70
MTM = I
where I is the identity matrix. The transformation is said to be unitary if the adjoint matrix M‡ (see Section E.3) is equal to the inverse matrix M–1:
M‡M = I
Under these conditions, the norm of vector y is equal to the norm of vector x. A unitary transformation (such as a Fourier transform, see next section) preserves the norm of a vector.
Laplace and Fourier Transforms
If function f(x) of a real variable x is equal to zero for x < 0, the Laplace transform of f(x) is defined by function F(p) of a complex variable p
where F(p) is said to be the image of f(x). One shows easily that the Laplace transform of the derivative of function f is given by p F(p) – f(0) and, more generally, the Laplace transform of the nth derivative of f(x) denoted f(n) is . Here f(m) is the mth derivative of function f at x = 0. The Laplace transform can thus be applied to transform an ODE into an algebraic equation, or a PDE into an ODE. The solution of these transformed equations is generally easier to obtain than for the original equations, and the solution f(x) is then obtained by applying an inverse transform.
The Fourier transform of f(x) is defined by
and its inverse by
where k is a real variable and i2 = –1. In many global modeling applications, information is repeatedly transferred between spectral and grid point representations (see Chapter 4) by applying Fourier transforms. If independent variable x represents space [m], variable k represents wavenumber [m–1]. If the independent variable x represents time [s], variable k represents frequency [s–1]. The Fourier transform of the derivative of f(x) is equal to i k F(k) if f(0) = 0.
In grid point models, the values of function f(x) are known only at N discrete points xj, (j = 0, 1, …, N – 1) of the spatial domain, for example, on regularly spaced grid points along a longitude circle on a sphere. In this case, the transfer of data between the grid point and spectral spaces is achieved by applying a discrete Fourier transform (DFT), which takes the form
(8)
(9)
Here, fj denotes the value of function f(x) at the geometric point xj, Fk is proportional to the value of function F(k) for a discrete value of the wavenumber k.
The application of expressions (8) and (9) requires N multiplications and additions. Since these operations have to be applied N times, the total number of operations required is of the order of N2. The method can become computationally impractical when the value of N is large. The fast Fourier transform (FFT) algorithm, introduced in its modern version by Cooley and Tukey (1965), but already considered by Gauss in 1805, circumvents this problem by breaking the N-point transform into two N/2 point transforms, one for the even (2 j) points and one for the odd (2 j + 1) points. The splitting is further repeated until the problem is broken into N single-point transforms. The number of steps required to calculate the discrete Fourier Transform is then of the order of N log2N operations, and the computational cost is considerably smaller.
Principal Component Analysis and Empirical Orthogonal Functions
Principal component analysis (PCA) is a standard method to identify the characteristic patterns of variability in a data set. Consider a time series of observations for an ensemble of K variables assembled into a vector x = (x1, x2, …, xK)T. The time series consists of successive values of x at discrete times. The variability of x with time is characterized by the (K × K) covariance matrix
This is a symmetric matrix of full rank and thus has K eigenvalues λk and K corresponding eigenvectors ek forming an orthonormal basis. The eigenvectors represent uncorrelated patterns in the data and are called the empirical orthogonal functions (EOFs) for the data set.
For a given realization of x at an individual time, one defines the principal components (PCs) as the projections of x onto the orthonormal basis defined by the EOFs. Thus the k-th principal component yk is the projection of x onto the k-th eigenvector ek:
Each realization of the vector x can thus be projected onto the basis of EOFs using the PCs:
The PCs form a time series corresponding to the time series of x, and the variance of a given PC is given by the corresponding eigenvalue: var(yk) = λk. We thus see that λk measures the variance associated with the EOF pattern defined by eigenvector ek. In this manner, we can define the dominant independent patterns accounting for the variability in a data set.
E.7 Probability and Statistics
A random (or stochastic) variable X is a variable whose values cannot be predicted deterministically, but can be described probabilistically. These values can be either discrete or continuous. In the first case, the variable may take only a countable number of distinct values xi, while in the second case it may take an infinite number of possible values x. Here, xi and x refer to specific values taken by X.
A random variable is described by its probability density. The probability density of a discrete random variable is given by a list of N probabilities pi associated with each of its possible values. All values of pi must be non-negative and normalized so that their sum is equal to 1:
Continuous random variables are described by a probability density function (PDF), denoted p(x), such that p(x)dx is the probability for X being in the range [x, x + dx], normalized to add up to 1 for all possible realizations of X:
The PDF for a discrete random variable is often represented by a histogram (in blue in Figure E.2) that displays the probability pi (i = 1, N) that X lies in certain ranges over a given domain. The red curve shows the corresponding continuous PDF p(x).
Figure E.2 Probability distribution (blue histogram) and probability density function (red curve) for a random variable.
Reproduced from http://glowingpython.blogspot.com.
Joint Probability Distribution
If p(x) dx is the probability that a random variable X takes a value between x and x + dx, and p(y) dy is the probability that variable Y takes a value between y and y + dy, the joint probability density function p(x, y) is defined as the probability that the values x and y be in the range x and x + dx, and y + dy, respectively. This joint PDF is normalized so that
Further, it is also normalized with respect to each variable
Conditional Probability Distribution
The probability that X takes a value between x and x + dx, given a known value y of Y, is expressed by the conditional probability p(x|y)
If random variables X and Y are independent,
p(x, y) = p(x) p(y)
and consequently
p(x| y) = p(x)
Mean of a Distribution
We consider a random variable X that is distributed according to a PDF p(x). The mean value m of a population, also called the expected value of x and denoted by E[x], is given by the first-order population moment:
The mean value mxy of the product of two independent random variables x and y is given by
Variance and Standard Deviation
The variance of a variable X whose distribution is expressed by the PDF p(x) relative to its mean value m is defined as the second population moment
The standard deviation σ is the square root of the variance.
Normal Density Function
A common PDF form is the Gaussian function (also called normal function)
where m and σ are the mean and standard deviation of the distribution, respectively. One standard deviation from the mean accounts for 68.2% of the population, two standard deviations for 95.4%, and three standard deviations for 99.7% (Figure E.3).
Figure E.3 Gaussian probability density function. Symbol σ denotes the standard deviation of the distribution. The fraction [percent] of the population included in different intervals is also shown.
Reproduced from Wikimedia Commons.
Sample Data
We now consider a sample of random data containing N points xi(i = 1, N) (e.g., N observations of variable x). The sample mean value (denoted by ) is given by
where pi is the weight given to point i with
For variables o
f equal weights, the sample mean is
The sum of the deviation of variables xi from their arithmetic mean is by definition equal to zero
For a scalar random variable x that takes N discrete values x1, x2, …, xN with the corresponding probabilities p1, p2, …, pN, the sample variance of x relative to its sample mean is given by
For equally probable values, we have
(10)
In many practical applications, one often examines a limited number of individual data points, and number N corresponds to a sample of the entire data population. One can show that the variance of the entire population is equal to the average of the variances derived for all possible samples if, in expression (10), the squared distance is divided by N – 1 rather than by N. In this case, the sample variance represents an unbiased estimate of the population variance. Therefore, it is often recommended to divide the squared distance by N – 1 when calculating the sample variance. The correction, however, is very small when the number of data in the sample becomes large.
Covariance
The covariance between two scalar-valued random variables x and y, each characterized by N sampled data points (x1, x2, …, xN) and (y1, y2, … yN), is defined by
or
where the overbar is again a representation of the sample mean. The variance is the covariance of two identical variables. One shows easily that
If z is a random vector (vector whose components are random numbers), the covariance matrix
represents a multivariate generalization of the variance and covariance defined above in the case of a scalar. If z1, z2, z3, … zk denote the K elements of vector z (assumed to be random variables), the K × K covariance matrix (also called variance–covariance matrix) has the following structure:
(11)
The inverse of this matrix is called the precision matrix or the concentration matrix. The diagonal elements represent variances of the individual elements of vector z, and the non-diagonal elements a measure of the correlation between the different elements of the vector. The covariance matrix is symmetric since the covariance operator is commutative (cov(zi,zj) = cov(zj,zi)).
The cross-covariance between the two random vectors x and y of dimension m and n, respectively, is a matrix of dimension m × n
Ordinary Regression Analysis
Relationships among different variables, specifically between a dependent variable y and an independent variable x, can be estimated by using a statistical process called regression analysis. We consider here the simple case in which we fit N given data points (x1, y1), (x2, y2), …, (xN, yN), of two correlated random variables x and y by a parameter-dependent regression function
We use the method of least squares to derive the parameters a, b, … that minimize the distance between the values of the data points and the corresponding values of the dependent variable y provided by function f. Thus, we minimize over a, b, … the function
and write
The first step is to discover the form of the relationship that exists between variables x and y. A scatterplot diagram for y versus x displaying the data points (xi, yi) may be useful. If the diagram suggests that a linear relationship is a suitable approximation, one expresses the regression curve as a linear statistical model (see Figure E.4)
Figure E.4 Scatterplot with a number of data points (xi, yi) and the corresponding regression line (statistical model).
More complex polynomial regressions may have to be considered if the scatterplot points to a nonlinear relationship.
In the linear ordinary regression method (ORM), coefficients a and b are derived from the minimization of the sum of the residuals between the data points yi and the corresponding predicted values for x = xi
This is expressed by
and results in linear equations in the parameters a and b called the normal equations of least-squares
The solution of these equations is for the intercept
and the slope
where
denote the averages of the sampled xi and yi values, and s(x) and s(y) denote the corresponding standard deviations. Figure E.4 shows an example of a regression line.
When quantity y depends on several independent variables x1, x2, x3, …, the simple regression model described here must be generalized and replaced by a multiple regression approach.
The strength and sign of the linear relationship between the two random variables x and y are expressed by the Pearson correlation coefficient r
or by the alternative formula
This coefficient is equal to the covariance of x and y divided by the standard deviation of the sampled values of variables x and y:
Its value ranges from –1 to +1. The sign of the correlation coefficient indicates whether the random variables x and y are positively or negatively correlated.
The coefficient of determination R2 provides information on the goodness of fit of a statistical model. It represents the degree by which a regression line or curve represents the data. It provides the proportion of the variance of variable y that is predictable from the value of the other variable (x). The coefficient is expressed by
where represents again the value of function y approximated by the regression model for x = xi, and is the mean value of all data yi. If R2 = 1, all data points perfectly fit the regression model. For linear least squares regression with an estimated intercept term, R2 equals the square r2 of the Pearson correlation coefficient.
In multilinear regression analyses, the value of R2 increases automatically when extra explanatory variables are added to the model. The adjusted R2 corrects for this spurious behavior by taking into account the number of explanatory variables p relative to the number of data points n. It is defined as
Reduced Major Axis Regression Analysis
The ordinary regression method described above minimizes the distance between the measured and predicted values of variable y (vertical distance, see Figure E.5a), but ignores errors in the measured values of variable x (horizontal distance). The uncertainties in both variables can be accounted for by applying the reduced major axis (RMA) regression method. In this case, the regression line is defined by minimizing the error in both Cartesian directions x and y, and specifically the area of the triangle shown in Figure E.5b.
Figure E.5 Geometric representation of the ordinary regression (a) and of the reduced major axis regression (b). Point P represents a data point (xi, yi). In the ordinary regression method, the sum (for all data points) of the squared distance (vertical red line) between the data points and the corresponding values of y on the regression line is minimized. In the reduced major axis regression method, the sum (for all data points) of the area of the red triangle is minimized.
The resulting value for the slope of the regression line is
where the sign is chosen to be the same as the sign of the correlation coefficient. Note that, in this approach, the slope is provided by the ratio between the standard deviations of the y and x variables, respectively. The slope bRMA from the RMA regression is related to the slope bORM from the ordinary regression method by bRMA = bORM/r(x, y), thus the RMA regression will produce a steeper slope.
Student’s t-Test
Introduced by the Irish chemist William Sealy Gosset under the pseudonym of Student, the t-test is used to determine whether two sets of data are significantly different from each other. We distinguish between the one-sample test, used to test whether the mean of the population from which a sample is drawn randomly differs significantly from the mean of a reference population, and the two-sample test, used to test whether two population means are significantly different from each other on the basis of randomly drawn samples. The test allows for some chance of error, and this is measured by the level of significance α, indicating a probability of error (p-value) not to be exceeded. Standard practice is to choose α = 0.05, meaning that we require the t-test to have less than a 5% chance of being in error. We then say that the p-value must be less than 0.05. Significance depends on the de
grees of freedom (df) in the data set, defined as the number of variables that are free to vary. Computation of the t-value for the test and comparison to a critical value tc(α, df) gives us the result. Critical values tc(α, df) are tabulated here.
One-sample t-test. Here, we compare the sample to a reference population with mean m. We denote by and s2 the mean and variance of the N independent data points in the sample, and compute t as
with degrees of freedom df = N – 1. We formulate the hypothesis that the difference between the mean m for the reference population and the mean of the population from which the sample was drawn is not significant (null hypothesis). If the value of |t| calculated from the above expression is larger than the critical value tc(α, df), then we reject that null hypothesis. The test can be one-tailed, in which we check for significant difference in only one direction (higher or lower), or two-tailed, in which case we allow for the possibility of significant difference at either end. Using α = 0.05, a one-tailed test checks whether the mean of the sample is within the lower 95th quantile of the PDF for it to be consistent with m, while a two-tailed test checks whether the mean of the sample is between the 2.5th and 97.5th quantiles of the PDF.