PUBLICATIONS
(1) Validation of protein models derived from experiment. Laskowski RA, MacArthur MW, Thornton JM Current Opinion in Structural Biology, 1998, vol.8, no.5, pp.631-639 (2) Who checks the checkers? Four validation tools applied to eight Atomic resolution structures. Wilson KS, Butterworth S, Dauter Z, Lamzin VS, Walsh M, Wodak S, Pontius J, Richelle J, Vaguine A, Sander C, Hooft RWW, Vriend G, Thornton JM, Laskowski RA, MacArthur MW, Dodson EJ, Murshudov G, Oldfield TJ, Kaptein R, Rullmann JAC Journal of Molecular Biology, 1998, vol.276, no.2, pp.417-436 (3) Assessment of comparative modeling in CASP2 Martin ACR, MacArthur MW, Thornton JM Proteins: Structure Function and Genetics, 1997, no.s1, pp.14-28 (4) Structures of N termini of helices in proteins. Doig AJ, MacArthur MW, Stapley BJ, Thornton JM Protein Science, 1997, Vol.6, No.1, pp.147-155 (5) AQUA and PROCHECK NMR: Programs for checking the quality of protein Structures solved by NMR. Laskowski RA, Rullmann JAC, MacArthur MW, Kaptein R, Thornton JM Journal of Biomolecular NMR, 1996, Vol.8, No.4, pp.477-486 (6) Deviations from planarity of the peptide bond in peptides and Proteins.MacArthur MW, Thornton JM Journal of Molecular Biology, 1996, Vol.264, No.5, pp.1180-1195 (7) Analysis of main chain torsion angles in proteins: Prediction of NMR Coupling constants for native and random coil conformations. Smith LJ, Bolin KA, Schwalbe H, MacArthur MW, Thornton JM, Dobson CM Journal of Molecular Biology, 1996, Vol.255, No.3, pp.494-506 (8) Intrinsic phi/psi propensities of amino acids, derived from the coil regions of known structures. Swindells MB, MacArthur MW, Thornton JM Nature Structural Biology, 1995, vol.2, no.7, pp.596-603 (9) Protein folds: towards understanding folding from inspection of native structures Thornton JM, Jones DT, MacArthur MW, Orengo CM, Swindells MB Philosophical Transactions of the Royal Society of London Series B Biological Sciences, 1995, vol.348, no.1323, pp.71-79 (10) Knowledge based validation of protein structure coordinates derived by X-ray crystallography and NMR spectroscopy. MacArthur MW, Laskowski RA, Thornton JM Current Opinion in Structural Biology, 1994, vol.4, no.5, pp.731-737 (11) NMR and crystallography: complementary approaches to structure determination. MacArthur MW, Driscoll PC, Thornton JM Trends in Biotechnology, 1994, vol.12, no.5, pp.149-153 (12) Conformation analysis of protein structures derived from NMR data MacArthur MW, Thornton JM Proteins Structure Function and Genetics, 1993, vol.17, no.3, pp.232-251 (13) Protein structures and complexes: what they reveal about the interactions that stabilize them Thornton JM, MacArthur MW, Mcdonald IK, Jones DT, Mitchell JBO, Nandi CL, Price SL, Zvelebil MJJM Philosophical Transactions of the Royal Society of London Series A Mathematical Physical and Engineering Sciences, 1993, vol.345, no.1674, pp.113-129 (14) PROCHECK: a program to check the stereochemical quality of protein structures. Laskowski RA, MacArthur MW, Moss DS, Thornton JM Journal of Applied Crystallography, 1993, vol.26, no.pt2, pp.283-291 (15) Stereochemical quality of protein structure coordinates. Morris AL, MacArthur MW, Hutchinson EG, Thornton JM Proteins: Structure Function and Genetics, 1992, vol.12, no.4, pp.345-364 (16) Influence of proline residues on protein conformation MacArthur MW, Thornton JM . Journal of Molecular Biology, 1991, vol.218, no.2, pp.397-412 (17) Protein side chain conformation: A systematic variation of chi1 with resolution: a consequence of multiple rotameric states? MacArthur MW & Thornton JM Acta Cryst, 1999, vol D55, pp 994-1004
PROCHECK: Parameter set update
1. INTRODUCTION
It is now some seven years since the PROCHECK program for validating the geometry of protein structures (Laskowski et al., 1993) was developed. During that period a number of improvements and additions have been made to it. Among major developments was the introduction of the G-factors quantifying the plausibility of certain geometrical parameters such as the mainchain and sidechain torsion angles. A modified version was developed to handle the special case of the ensembles of models resulting from structures solved by NMR. As part of a collaborative venture, this extension of the program named PROCHECK-NMR incorporated the output from Ton Rullmann's AQUA program ( Laskowski et al., 1996, see also MacArthur & Thornton 1993) for NMR restraints analysis which had been developed in parallel. However, the structural parameters and target values which were used in the original version have remained unchanged from the time of the initial release. These underlying statistics were derived from data based on the October 1990 release of the Brookhaven Protein Structure Databank (PDB). The work which led to the development of the PROCHECK suite of programs was described in the paper of Morris and co-workers in 1992. In the intervening period the number of protein structure coordinate sets in PDB have multiplied several fold - from a total of 463 entries in October 1990 to well over 10,000 at present. In the light of this much larger body of structural information now at our disposal which would provide us with a more robust set of statistics, it was deemed appropriate to conduct a re-appraisal of the basic assumptions. In addition, experience over time in the use of the program, together with feedback from operators and additional related studies (MacArthur & Thornton, 1996) in our own laboratory had indicated that such a re-examination would be timely. For example, recent studies on protein structures determined to atomic resolution at the DESY synchrotron radiation facility at Hamburg (Wilson et al., 1998), together with analyses of the stereochemistry and peptide bond geometry of small peptides in the Cambridge Structural Databank of small molecules had hinted that some fine tuning might be in order. Observations from the enlarged dataset of structures now available at high resolution (<=1.5A) had further reinforced this feeling. The boundaries which defined the CORE regions in the PROCHECK Ramachandran map for instance appeared to call for some slight adjustments. In particular, repeated observations from high resolution structures had indicated that the bottom row of 10x10 degree pixels in the alpha region were rarely populated to any significant extent. Similar considerations applied to the other CORE regions, and it was felt that a re-definition of all boundaries should be undertaken.
It will be recalled that when the boundaries were defined originally (Morris et al., 1992), the entire set of phi/psi values (exclusive of those for PRO and GLY) from the 463 protein crystal structure coordinates were used. The structures ranged in resolution from 1.0A to 3.5A and all chains were taken. Some structures that were grossly in error would thus have been included and made their contribution to the final outcome. Also, inevitably, biases could not be discounted arising from the large number of near identical and homologous entries which had been determined using molecular replacement. Use of a non-redundant representative set would certainly have been preferable but would have resulted in much weakened statistics, especially in the less populated regions of phi/psi space. Even with the much larger volume of data that we have now, it is not always easy to decide where to draw the border which defines the disallowed region. In addition, at that time, programs were not available for automatic provision of non-redundant sets in a systematic way at a chosen level of identity. Using the full dataset in its entirety was therefore considered to be the lesser of two evils.
The decision as to where to draw the boundaries is by its nature a completely arbitrary one, and ultimately a subjective one determined by whether the boundaries sensibly conform to what experience has taught us to expect as reasonable. This does not matter any more than does the definition of any arbitrary unit of measure, since all structures thereafter are judged with reference to the given criterion. In Morris et al. (1992) the boundaries were defined according to the actual number of observations in the 10x10 degree pixels. eg the CORE areas (alpha, beta, alpha-L) comprised all pixels containing 100 or more observations, and similarly for the other regions. This of course meant that the definitions were completely dataset-specific. Preferably, this should not be so.
We have always felt that the PROCHECK evaluation of non-bonded bad contacts was another area which needed improvement. The assessment curve used in the program was derived at a time when it was not possible to unambiguously distinguish between genuine steric clashes and cases where sidechain/sidechain or sidechain/mainchain hydrogen bonding may have played a part. With the program HBPLUS (McDonald & Thornton, 1994) all hydrogen bonds can be explicitly filtered out. The listing of bad contacts which up to now would almost certainly have contained some hydrogen bonded pairs will now report only genuine steric clashes if HBPLUS is incorporated. In practice of course, given a good electron density map, the crystallographer would in general have been able quite readily to pick out the hydrogen bonded pairs, and so eliminate them as the bad contacts spuriously claimed by PROCHECK.
In what follows, we shall be describing the results of repeating the analysis of Morris et al (1992) using the much enlarged dataset which is currently available. Any modifications or additions to the original procedures will be reported and the rationale explained. In addition all new observations and insights will be recorded and the possible implications discussed.
2. DATA and METHODS
The protein structure coordinates were taken from the September 1996 release of the PDB, from which a representative dataset at the 95% identity level was drawn using the method of Orengo & Taylor(1996). Only crystal structures of resolution up to and including 3.0A were considered. This gave a final working dataset of 1128 chains. All derived data and statistics were obtained using in- house programs or commercially available statistical software packages. The results are reported under headings which correspond to the various criteria as presented in the PROCHECK graphical output in what might be taken to be descending order of importance.
3. RESULTS
3.1. Ramachandran Plot Assessment
The distribution of phi/psi values as presented on the Ramachandran map is probably the most frequently used measure of the reliability of the experimentally derived model of a protein molecule. A model under test is assessed in PROCHECK by comparing the percentage of observations in the most highly favoured regions of phi/psi space (so-called CORE regions) for the model, to a standard plot of percentage-in-CORE versus resolution derived from a dataset of previously solved protein structures (which is assumed to reflect the characteristics of all proteins in existence). Other regions representing bands of decreasing population density around the central CORE areas are also defined in the current version of the PROCHECK program. It had been suggested that a simplified map of perhaps two regions (CORE and OUTSIDE say) would serve the purpose equally well, but after due consideration we decided to retain the existing layout . In any event, it would be quite easy to provide this option if a strong demand for it became apparent.
As was done before, the area representing phi/psi space was divided into 1296 (36x36) 10x10 degree pixels and the percentage population density in each one was calculated. A whole series of contour maps using different subsets (<=1.5A,<=2.0A,<=2.5A and <=3.0A) and different contour intervals were produced in order to provide a feel for the most sensible basis on which to construct the regional boundaries. Such a contour map for the 2A subset with intervals at 0.0125,0.025,0.05,0.1,0.2,0.4,0.8,1.6,3.2, and 6.4 is shown in figure 1 below.
This reflects very closely the outlines of the schematic map that was finally chosen. It is also reassuringly close to that illustrated in Kleywegt & Jones (1996 and see also
A desirable requirement of the evaluation curve used by PROCHECK is that it have a pronounced negative gradient, in order to provide a more strongly discriminating measure of quality. Ideally, one should choose the conditions which produces the largest possible difference between the %-in-CORE at 1.0A resolution and %-in-CORE at 3.0A resolution. The slope of the curve will depend on the shape and extent of the CORE regions. If too large an area is chosen, the resulting curve will tend to flatness; restricting the area excessively produces no further improvement and becomes unrealistic. At an early stage in the proceedings when we were thinking of using a simplified map layout of only two zones (CORE and OUTSIDE) we considered adopting the CORE regions defined by Kleywegt & Jones (1996). However, when this map is used the evaluation curve produced is a rather flat one, with only about 11 percentage points separating the best from the worst (figure 2a). (see below)
This is due to the large size of the area defined as CORE. This compares with a separation of 17% for the CORE regions on the map that is used in the current version of PROCHECK (figure 2b).
After repeated experimentation we finally decided that the contour joining the pixels on the 1.5A map at 0.2% population density produced the best result. This gives a much steeper evaluation curve (figure 2c) with a separation of 27 percentage points between high and low resolution. Furthermore the contour at this level describes a smooth continuous envelope around the CORE regions with neither breaks in the contour nor isolated peaks of high density in the ALLOWED regions nor holes of low density in the CORE areas. The corresponding contour on the <=2.0A, <=2.5A and <=3.0A define remarkably similar areas in both shape and size. We are therefore satisfied that this produces a result that is not dataset-dependent. In a similar fashion the other boundaries were also defined, but because of the paucity of observations outside the CORE regions in the 1.5A map the 2.0A map was used. The ALLOWED region was found to be best defined by the contour joining the pixels having 0.01% population density where once again the envelope was clear-cut and unambiguous and similar to that obtained from the 2.5A and 3.0A plots. The boundaries of the DISALLOWED region presented more of a problem. While, quite clearly the most appropriate contour to choose was the one joining pixels of 0.001% population density it showed several discontinuities and had many small isolated islands of density within the DISALLOWED area. Many of these single pixel anomalies representing perhaps just one observation could justifiably be dismissed as genuine outliers or errors. Nevertheless, the decision as to which points to join for defining the boundary in some areas remained subjective. The problem persists even in the 2.5A and 3.0A maps despite the higher absolute number of observations because of the difficulty in distinguishing between genuine values and the increasing number of errors at the lower resolutions.
It might be more appropriate to call the region DISFAVOURED rather than DISALLOWED, as it would seem that the region of phi/psi space which is ABSOLUTELY disallowed due to unavoidable interpenetrating of spatially neighbouring atoms is a much more restricted one. There might be a case for showing the DISALLOWED border as a fuzzy one using progressively varying shading between the two zones. Although PROCHECK lays emphasis on %-in-CORE for judging the quality, many crystallographers are frequently concerned with residues that fall in the DISALLOWED region. These outliers which have been confirmed as representing a genuine configuration after the most stringent process of validation, very often fall just outside the generously allowed zone.
The complete final revised map is shown in figure 3 below right with the old map (left) shown beside it for comparison.
The trimming of the CORE regions is reflected in their smaller % area compared to the old map (8.1% in the new as against 11.0% in the old). The similarity of the schematic pixel-based drawing to the contour map of figure 1 is apparent. When one examines the evaluation curve produced by the new map using the new dataset (figure 4 below) it is seen to be surprisingly similar to that which was produced from the old map with the old dataset of 463 proteins as used in the current version of the program. It will be noted that at the highest resolution, structures can till be expected to have over 90% of their residues within the CORE regions despite their reduced areas. This is because the trimmed bits were invariably very thinly populated in high resolution structures.
3.2. BAD CONTACTS
From the beginning we were not fully satisfied with the curve used in the PROCHECK assessment of bad contacts. In the early stages of developing the program, we still had not devised a means of explicitly filtering out with complete confidence all hydrogen bonds in a structure prior to its examination for implausibly short interatomic distances. Instead, the program relied on a blanket elimination of potential hydrogen bonded atom pairs based solely on a distance criterion. In addition, not all potential hydrogen bonding atoms were entered into the filtering process. e.g. ambiguous donor/acceptors such as the nitrogens of the histidine side chain. In most cases the existence of a hydrogen bond would be apparent to the crystallographer from an examination of the electron density map.
The solution to this problem was provided by the development of the program HBPLUS (McDonald & Thornton, 1994). When this program is incorporated the PROCHECK evaluation curve has a much shallower gradient, (figure 5 below) and therefore unfortunately is less discriminating in distinguishing between good and bad structures. Most protein structures being determined nowadays are generally expected to be free from seriously bad steric clashes, and since the new dataset must contain a relatively large number of recently solved structures this is possibly another (or perhaps the main) reason for the reduced slope. It may also have been influenced by the omission of structures beyond 3.0A resolution. Perhaps more serious than atomic overlap which is readily detectable, is the presence of spurious voids and excessive looseness of packing. This is addressed by the program PROVE (Pontius et al., 1996).
3.3. Omega Angles
The application of artificial restraints to the peptide bond omega angle in the course of model refinement results in the absence of any correlation of its variance with resolution. What weights to give the applied restraints that define the possible range of omega variation has been an open question. Thinking on the subject has over the years been influenced by the earlier strongly held belief in the planarity of the peptide bond. It has however gradually become apparent that considerable variability is the norm (MacArthur & Thornton 1996). While the current version of PROCHECK suggests a target for the standard deviation of 4.7 degrees about a mean value of 179.6 degrees, the above work on the study of the peptide bond in small peptides would indicate that this range is too restrictive. The conclusion is further reinforced by the results from the detailed studies on structures determined to atomic resolution mentioned earlier. Both of these studies suggest that a standard deviation of up to 6 degrees might be more realistic.
Since the consensus among crystallographers continues to remain conservative it is perhaps not surprising that the new enlarged dataset still produces a standard deviation not much different from the old (figure 6 below). For the high resolution subset of 69 chains <= 1.5A resolution it is 4.1 degrees about a mean value of 179.5 degrees. For individual proteins within the set the standard deviation ranges from 1.4 degrees minimum to 8.4 degrees maximum, and the mean from 178.3 degrees to 180.7 degrees. We feel however that greater recognition be given to the evidence from the work on small peptides and atomic resolution structures where artificially applied restraints were minimal. It is proposed therefore that a standard deviation of 6.0 degrees be tolerated. The occurrence of the occasional outlier within the range 160-200 degrees would not be inconsistent with this value.
3.4. Chi1 Angles
A comparison of the old and new evaluation curves for chi1 angles is shown in figures 7a-d. While they differ only slightly, it is noticeable that once again, consistently lower slopes are observed for all three rotamers in the regression curves of standard deviation versus resolution. This may be due partly to the omission of the lower resolution data from 3.0 to 3.5 Angstroms, or perhaps improved refinement techniques are producing a tighter clustering at all levels of resolution. One might therefore have expected a narrowing of the shaded bands that represent the standard deviation of the scatter of individual proteins about the regression line. This does occur, but is negligible, and is barely apparent from a visual examination.
3.4.1 Variation of chi1 mean values with resolution
In the course of our investigations of the chi1 angle attributes we made the surprising observation that for all three rotamers the mean value of the angle varied systematically with resolution in a highly correlated manner. This is illustrated in figure 8 below for the chi1 angles of the gauche minus rotamer.
3.5. Backbone Hydrogen Bonds
The regression line for the variation of the standard deviation in mainchain/ mainchain energies versus resolution has undergone the largest change in character with the new enlarged dataset (figure 9 below).
The slope is much shallower and one wonders whether it is worth while continuing to include it as an indicator for quality assessment, as it varies so little with change in resolution. Apart from the reasons advanced earlier as an explanation for the flattening of the other plots it is not obvious why the difference here should be so more pronounced. Of course we are looking at changes in energy which are tiny to begin with. The hydrogen bonds involved are almost entirely those associated with the standard secondary structure elements and reverse turns. If, as adumbrated above, the flattening is partly due to a general overall improvement in the final model, resulting from better data collection and refinement procedures, then this would be even more strongly apparent in the regions of regular secondary structure, which is consistent with the tighter clustering observed in the favoured regions of the Ramachandran plot.
3.6 Chi2 trans Angles
There is really not much to comment on here (figure 10 below) except to repeat the observation already made in connection with the regression lines of standard deviation versus resolution for the chi1 rotamers.>/P>
In the case of the chi2 angles (trans only and excluding those with planar geometry at the C-gamma) there is a reduction in the variance at all levels of resolution compared to the old dataset.
3.7. C-alpha Chirality
This is another case where the geometry is very much restrained during refinement. Thus, the scatter in the values of the zeta angle Ca - N- C¢ - Cb is not expected to vary in any systematic way with resolution. And in fact it doesn't. With the new dataset the variance seems to be even less. The high resolution subset (<=1.5A, <=0.20 R-factor) shows an increase in the mean value and decrease in the standard deviation compared to our original target values. (34.3+/-2.6 for the new; 33.9+/-3.5 for the old - see Table 1 and original Morris et al paper). While the true value for a perfect tetrahedron is 35.3° , an examination of symmetrical molecules such as CF4, CCl4 and CBr4 in the Cambridge Structure Database (CSD) of small molecules (Allen et al., 1983) reveals a surprising degree of variation. For the 24 examples available the mean value is 35.4° with a standard deviation of 1.15° over a range of 33.2° to 38.1° . Of course in proteins, the tetrahedron defined by the chiral Ca is not a perfect one. It is instructive therefore to see how the values vary in crystal structures of the individual amino acids in the CSD. Of the 20 standard amino acids, only Ala, Leu, Tyr, Trp, Pro, Lys, His, Asp and Glu are represented by their L-isomers in crystals refined to below 10% R-factor. For an overall total of 20 observations the mean value is 34.22° with a standard deviation of 0.96° and ranging from 33.1° to 37.3°
Table 2 gives the mean values by residue type in the dataset of 69 chains (<= 1.50Å). Although some individual angles even in this "high quality" subset are still quite clearly in error, the mean values and standard deviations are reasonably uniform throughout. Proline, perhaps not unexpectedly is an exception. The strain in the five-membered pyrrolidine ring is evidently such as to produce a distortion at the Ca atom sufficient to produce a deviation from the mean zeta value of about 3° .
3.8 Disulphide Bonds and Chi3 angles
Disulphide bonds remain unchanged at almost 2.0A exactly. In the new dataset of 1128 chains however there are sufficient disulphide bonds to provide us with a more robust set of statistics and when the standard deviation of Left Handed and Right Handed chi3 torsion angles are plotted against resolution we observe a systematic variation as is seen in the other unrestrained attributes. Figure 11 below shows a plot of the pooled standard deviation against resolution for Left Handed and Right Handed angles. There is a hint of levelling off at high resolution.
3.9 Target Values
These have been derived from the subset of 69 chains <=1.5A, <=0.20 R-factor, and include the most recent atomic resolution structures. Where the results from the analysis of the data from small peptides and atomic resolution structures appear to be at odds with the values from artificially restrained elements of geometry in proteins we take the view that the latter might benefit from some relaxation. (for example, omega angles). All target values are given in table 1 (for comparison, see original Morris et al paper).
3.10 Conclusion
The dataset from which we derived the original parameters in 1990 consisted of only 463 entries in total. The entire dataset was used in the analysis, and all chains were included. The aim of the present work is to report the new set of empirical geometric and conformational parameters, utilising the greatly enlarged database of protein structures now at our disposal. When these newly derived parameter values are incorporated into PROCHECK we observe systematic changes of varying degree to all the criteria used to evaluate the different attributes. None of them however is so dramatic as to require any major overhaul of the underlying conceptual framework. And in every instance the changes have been in the expected direction. Overall, distributions tend to be tighter, and regression lines tend to be more flat, and have a reduced slope. This is likely to be a consequence of the very greatly reduced number, in relative terms, of unrefined and poorly determined structures. Models which are grossly in error are now routinely filtered out before they can be allowed entry into the Protein Databank. Improved techniques and greater vigilance in checking have contributed to an increasingly reliable resource.