The statistics of cross-validation residuals
Here we derive the statistical expectation of the free residual
, at the convergence of a least-squares refinement.
The normal equations of least-squares refinement at convergence can be written
If the errors in the observations and the model are not too large
then a truncated Taylor expansion may be written about the expected values of the parameter vector
and the observation vector
.
The structure amplitudes and target distances corresponding to the excluded observations in a test set can be expressed in terms of the parameter estimates by the truncated Taylor expansion
Thus the column of residuals of the excluded observations is given by
Assuming that the errors in
and
are
uncorrelated, the VCM of the residuals associated with the excluded observations
is given by
Thus
is equal to the sum of the VCM of the
excluded observations and the VCM of the corresponding
quantities calculated from the refined parameters. From equation
(18) it follows that
where
is a unit matrix of order p. Using the fact
that
and noting that the trace of a product of matrices is invariant under a cyclic permutation of the order of matrix multiplication, we take the trace of both sides of equation (19) to give
If the p excluded observations are structure amplitudes and we assume that the weight matrix is diagonal, then equation (20) can be written as
where the angle brackets denote statistical expectation,
is the expected value of the residual associated with
the given p excluded observations and
is the i th
row of B. The expected value of a single weighted excluded
residual in the summation is obtained by taking a single diagonal term
from equation (19) which gives
Two notable assumptions have been made in the above analysis. First, it has been assumed that the test set residuals are uncorrelated with those in the working set. Equation (18) is invalid if this assumption is not true.
Second, it is assumed that the refinement has used a weight matrix
which correctly reflects the experimental and model
errors. Equation (21) is invalid if this assumption is not
true. When a diagonal weight matrix is used, as is almost always the
case in practice, correlated errors in reciprocal space will not be
correctly represented. In Appendix III it is
shown that the use of
weight matrices which do not correctly account for the errors will
lead to test set residuals with larger variance-covariance matrices.
The statistics of cross-validation residuals