previous section  contents page  next section  The statistics of cross-validation residuals


Introduction

One of the problems in macromolecular crystallography is that the crystallographer cannot always be sure that an apparently fully refined structure is free from large systematic errors. The agreement between the model of the molecular structure and the X-ray diffraction data from which it has been derived is measured by the crystallographic R factor, but it is well known that structures with acceptable values of this parameter can have significant errors (Brändén & Jones, 1990; Kleywegt & Jones, 1995a). The R-factor is susceptible to manipulation by leaving out weak data or by overfitting the data with too many parameters and so is not a completely reliable guide to accuracy. In small-molecule crystallography, where the number of X-ray intensity observations usually exceeds the number of parameters in the model by at least an order of magnitude, the R-factor is a more sure guide to both accuracy and precision.

In 1992 Brünger introduced the idea of an tex2html_wrap_inline1138 (Brünger, 1992, 1993), based on the standard statistical modelling technique of jack-knifing or cross-validatory residuals (McCullagh & Nelder, 1983). The tex2html_wrap_inline1138 is the same as the conventional R-factor, but based on a test set consisting of a small percentage (usually tex2html_wrap_inline1166 5-10%) of reflections excluded from a structure refinement. The remaining reflections included in the refinement are known as the working set. The tex2html_wrap_inline1138 value, unlike the R-factor, cannot be driven down by refining a false model because the reflections on which it is based are excluded from this process. tex2html_wrap_inline1138 is only expected to decrease during the course of a successful refinement. Consequently, a high value of this statistic and a concomitant low value of R may indicate an inaccurate model. The procedure assumes that the reflections removed for the cross-validation test have been randomly selected and have errors uncorrelated with those that remain in the set used in the refinement. This assumption may be partly invalidated by the presence of non-crystallographic symmetry. Ideally, the refinement should be repeated several times removing non-overlapping sets of reflections each time.

The tex2html_wrap_inline1138 is highly correlated with the phase accuracy of the atomic model (Brünger, 1992, 1993) and can detect various types of errors in the structure including phase errors and partial mistracing of the structure. It has also be used in evaluating different refinement protocols, such as the optimization of the weights used during refinement. It is particularly useful in preventing the overfitting of data (Kleywegt & Brünger, 1996).

Kleywegt & Jones (1995a, b) have shown that with low resolution data it is possible to completely mistrace a structure, deliberately tracing it backwards through the density, and still achieve an acceptable R factor. The tex2html_wrap_inline1138 , on the other hand, could not be duped so easily, and remained at a high value, close to that expected for a random set of scatterers, throughout the refinement.

The use of tex2html_wrap_inline1138 is thus a valuable guide to the progress of refinement, particularly for low-resolution data, and its use and publication are widely encouraged. A recent review (Kleywegt & Brünger, 1996) indicated that the use of the measure is becoming more widespread with it being reported in 44% of articles describing macromolecular X-ray structures.

However, the usefulness of tex2html_wrap_inline1138 is limited by the fact that what is an ``acceptable'' value is often not evident. One would expect tex2html_wrap_inline1138 to always be higher than R even when there are no systematic errors in the model structure, but it is not clear how much higher it should be. At present we merely have a number of rules of thumb (Kleywegt & Brünger, 1996).

Cruickshank has estimated that the expected value of the free R-factor (EFRF) is given by

displaymath72

where tex2html_wrap_inline1196 is the number of observations, tex2html_wrap_inline1198 is the number of parameters, and R is the conventional R-factor (Dodson, Kleywegt & Wilson, 1996). Bacchi, Lamzin & Wilson (1996) use this expression in an extension of the self-validation Hamilton test to assess the significance of any observed drop in tex2html_wrap_inline1138 during refinement.

The need for more understanding of the behaviour of tex2html_wrap_inline1138 was highlighted by Dodson, Kleywegt & Wilson (1996). In spite of the enthusiasm for its use, actual applications of tex2html_wrap_inline1138 have remained somewhat subjective without an understanding of its statistical basis. For example, if non-crystallographic symmetry (NCS) constraints are relaxed during a structure refinement, how much should tex2html_wrap_inline1138 rise during subsequent refinement if the restrained model is correct? Without understanding how tex2html_wrap_inline1138 varies as a function of the number of restraints and/or number of parameters it is only possible to make rather subjective judgements.

This paper begins to answer these questions by deriving the expected value of the free residual from which estimates of both tex2html_wrap_inline1138 and the ratio of tex2html_wrap_inline1138 to R are calculated.


previous section  contents page  next section  The statistics of cross-validation residuals