Cross-validation: The Free $R$ Value

X-PLOR provides the possibility of cross-validation in reciprocal space, as described by Brünger (1993,1992).

The most common measure of the quality of a crystal structure is the $R$ value (Eq. 13.2). $R$ is closely related to the crystallographic residual (cf. Eq. 13.1)

\begin{displaymath}
R' = \sum_{h,k,l}
(\vert F_{obs}(h,k,l)\vert- k \vert F_{calc}(h,k,l)\vert)^2
\end{displaymath} (17.1)

which is a linear function of the negative logarithm of the likelihood of the atomic model, assuming that all observations are independent and normally distributed. $R$ can be made arbitrarily small by increasing the number of model parameters and subsequent refinement against $R'$; i.e., the diffraction data can be overfit without changing the information content of the atomic model.

Crystallographic diffraction data are redundant to some degree; e.g., a small portion of the data can be omitted without seriously affecting the result. Following the statistical concept of cross-validation, the observed reflections are partitioned into a test set $T$ and a working set $A$ (Brünger 1992); that is, $T$ and $A$ are disjoint, and their conjunction is the full set of observed reflections. The value

\begin{displaymath}
R^{free}_T = \frac{\sum_{(h,k,l) \in T}
\vert\vert F_{obs...
...\vert\vert }
{\sum_{(h,k,l) \in T}\vert F_{obs}(h,k,l)\vert}
\end{displaymath} (17.2)

is referred to as the free $R$ value computed for the $T$ set of reflections. $T$ is omitted in the modeling process; e.g., in the case of crystallographic refinement, the residual to be minimized is given by
\begin{displaymath}
R'_A = \sum_{(h,k,l) \in A}
(\vert F_{obs}(h,k,l)\vert- k \vert F_{calc}(h,k,l)\vert)^2
\end{displaymath} (17.3)

One would expect $R^{free}_T$ to be less prone to overfitting than $R$. This concept can be applied to the other statistical quantities available in X-PLOR, such as the standard linear correlation coefficient (Eq. 13.1). It can even be applied to crystal structures that have already been refined with all diffraction data included: refinement by simulated annealing with $T$ omitted will remove some of the memory toward $T$.

Both $R^{free}_T$ and the rms difference between the model refined against the complete data set and the model refined against $A$ increase more or less monotonically as a function of the percentage of omitted data. This is to be expected of terms that monitor the validity of a model. $R$ decreases, which is a paradoxical and misleading behavior for an indicator of the model's accuracy. As a compromise between avoiding fluctuations of $R^{free}_T$ and maintaining small rms differences between refined models, obtain $T$ from a random selection of 10% of the observed reflections.

The free $R$ value (or correlation coefficient) is printed along with the conventional $R$ value (correlation coefficient) during all refinement procedures in X-PLOR, including $PC$-refinement for molecular replacement. In addition, the data analysis can be carried out for both the test set $T$ and the working set $A$ when one is using the “PRINt R", “PRINt PHASe", and “PRINt COMPleteness" statements. The $R$ values or correlation coefficients are stored in the symbols $R, $TEST R, $CORR, and $TEST CORR whenever a computation of $E_{XREF}$ has been carried out, e.g, when a “PRINt TARGet" statement has been issued or an energy calculation has been carried out.

The following two example files show how to use the free $R$ value concept in X-PLOR. Basically, none of the example files described in the previous section have to be changed. The only requirement is to create a special reflection file that tells X-PLOR which reflections belong to the test set and the working set. This is indicated by the TEST array. The example file below randomly selects 10% of the data and sets the TEST array to 1 for them. Subsequently, a new reflection file “amy.cv" is written that should be used for all subsequent X-PLOR runs. X-PLOR automatically partitions the data into the working set and the test set whenever the TEST array contains nonzero elements. The reflections with TEST=1 are used for the free $R$ value (correlation) computation.

setup_free_r.inp

The example file below is a combination of the slow-cooling simulated annealing refinement cycle described in Section 14.1.3 and the restrained B-factor refinement described in Section 14.4. Note that no change was required in the input files except for using the “amy.cv" reflection file.

full_refinement.inp

As a consequence of the SA-refinement with the test set omitted, the free $R$ value deviates from the conventional $R$ value. However, the free $R$ value decreases during the course of the refinement, even though the test set of reflections has been omitted from the refinement process. This indicates that the information content and phase accuracy of the model increase during the refinement process. If at any stage in the refinement process--e.g., after refining additional water molecules--the free $R$ value increased, it would indicate that the phase accuracy of the model was worsened by the additional refinement. The free $R$ value can thus be used to prevent the user from overfitting the diffraction data.

Figure 17.1 was produced by obtaining the free and conventional $R$ values using the UNIX grep facility from the X-PLOR output file (searching for “TEST=1" and “TEST=0"). The resulting lines were fed into a spreadsheet program.

Figure 17.1: Course of refinement.
\begin{figure}{\epsfxsize =300pt
\noindent\epsffile{xtal_free_r___free_r.eps}
}\end{figure}

Xplor-NIH 2023-11-10