previous section  contents page  next section  The statistics of cross-validation residuals


Appendix I

Statistical properties of free residuals

Derivation of the expected value of the free residual

Here we derive the statistical expectation of the free residual tex2html_wrap_inline1402 , at the convergence of a least-squares refinement.

The normal equations of least-squares refinement at convergence can be written

If the errors in the observations and the model are not too large then a truncated Taylor expansion may be written about the expected values of the parameter vector tex2html_wrap_inline1576 and the observation vector tex2html_wrap_inline1578 .

displaymath571

The structure amplitudes and target distances corresponding to the excluded observations in a test set can be expressed in terms of the parameter estimates by the truncated Taylor expansion

eqnarray581

Thus the column of residuals of the excluded observations is given by

eqnarray597

Assuming that the errors in tex2html_wrap_inline1580 and tex2html_wrap_inline1358 are uncorrelated, the VCM of the residuals associated with the excluded observations tex2html_wrap_inline1332 is given by

  eqnarray617

Thus tex2html_wrap_inline1326 is equal to the sum of the VCM of the excluded observations and the VCM of the corresponding quantities calculated from the refined parameters. From equation (18) it follows that

  equation643

where tex2html_wrap_inline1588 is a unit matrix of order p. Using the fact that

displaymath657

and noting that the trace of a product of matrices is invariant under a cyclic permutation of the order of matrix multiplication, we take the trace of both sides of equation (19) to give

  eqnarray664

If the p excluded observations are structure amplitudes and we assume that the weight matrix is diagonal, then equation (20) can be written as

  equation697

where the angle brackets denote statistical expectation, tex2html_wrap_inline1402 is the expected value of the residual associated with the given p excluded observations and tex2html_wrap_inline1302 is the i th row of B. The expected value of a single weighted excluded residual in the summation is obtained by taking a single diagonal term from equation (19) which gives

displaymath721

Two notable assumptions have been made in the above analysis. First, it has been assumed that the test set residuals are uncorrelated with those in the working set. Equation (18) is invalid if this assumption is not true.

Second, it is assumed that the refinement has used a weight matrix tex2html_wrap_inline1450 which correctly reflects the experimental and model errors. Equation (21) is invalid if this assumption is not true. When a diagonal weight matrix is used, as is almost always the case in practice, correlated errors in reciprocal space will not be correctly represented. In Appendix III it is shown that the use of weight matrices which do not correctly account for the errors will lead to test set residuals with larger variance-covariance matrices.


previous section  contents page  next section  The statistics of cross-validation residuals