The statistics of cross-validation residuals - Discussion
We have examined
ratios for crystal
structures in the Protein Data Bank (Bernstein et al, 1977) as at 1 June
1997. Figure 1 shows a plot of the
ratio as a function of
, where
is the number of atoms included in the refinement and
f is the number of reflections used, for 725 macromolecular
structures for which all these values are reported. The points are colour
coded according to resolution range.
We define the
ratio as
and
. Values of y range from about 0.8 to 1.8. By
substituting
into equation (15), we have
Figure 1 shows the curves corresponding to equation (16) for different values of a.
In order to make the comparison between experimental and theoretical values easier, a function of y was sought which is a linear function of x. By squaring and rearranging the terms in equation (16) we arrive at
where
Figure 2 shows a plot of z against x where the points are colour coded as in Figure 1. The coloured straight lines in Figure 2 are least-squares lines fitted to the data points in the particular resolution range represented by points of the same colour. For example, the pink triangular points represent data between 3 and 4Å\ resolution and the pink line is the least-squares line through the pink triangular points. The pecked black lines emanating from the origin in Figure 2 are plots of z=ax for the same values of a as shown in Figure 1.
We requested information from some of the authors whose structures were
outliers in Figure 2. It became
apparent that very unusual
ratios are normally not the
result of careful refinement protocols. The coloured lines were therefore
plotted ignoring the points in the darker regions outside the sector
bounded by the black lines of slopes 0.5 and 10. The choice of these slopes
as cutoffs was somewhat arbitrary but the removal of these outliers caused
the coloured lines to pass nearer to the origin.
The plots of z=ax represent refinement regimes with different
numbers of parameters per atom. The gradients of the lines (a)
increase with the number of parameters per atom. In the absence of
relevant information in the Data Bank, it was assumed that in
restrained refinements,
and
. These estimates
ignore temperature factor restraints (if any) because our survey of
the latter revealed widely different restraint protocols. Using these
values, for restrained refinements
It can be seen that the z=2x line (isotropic temperature factors) passes through the constellations of orange crosses and pink triangular points, representing structures between 2.5 and 1.5Å\ resolution and is close to the green pecked line (2.5-2.0Å\ data). Similarly the z=x line (overall temperature factor) lies close to the pink line which is fitted to the 4 to 3Å data. Even in the absence of details of restraint procedures, the z=ax lines can be seen to pass through areas of the plot where the particular refinement regime is most relevant. The large spread of values about the straight lines is unlikely to be solely a statistical effect and may well say something about the quality of the refinements.
Comparison of the lines z=2x and z=4x, which differ only in
respect of restraints, shows how restraints lower the
ratio. Non-crystallographic symmetry (NCS, see Introduction) might
give rise to lower than predicted
ratios. However, a check
on structures in our plots which exhibit NCS, did not reveal any
obvious systematic effects.
The statistics of cross-validation residuals - Discussion