It has been almost 50 years since York published an exact
and general solution for the best-fit straight line to independent points
with normally distributed errors in both

A common analytical task in the physical sciences is to find the true
straight-line relationship underlying independently measured points with
normally distributed measurement errors in both the ordinate

York was motivated by rubidium–strontium isochrons but his landmark solution
was universal. Unfortunately, while it became standard in geophysics, York's
solution has remained largely unknown to the broader scientific community:
York's paper has been cited almost 2000 times within the geophysical
literature and only about two dozen times in all the rest of the scientific
literature combined. Meanwhile, the number of papers and chapters expounding
on the subject as if the solution did not exist is staggering. Examples
include widely consulted books like

One scientific problem wanting for York's solution is the isotopic mixing
line. In 1958, Charles Keeling (Keeling, 1958) introduced the idea of using
an isotopic mixing line to determine the stable isotopic signature of a trace
gas source. If the isotopic composition

Here we introduce the reader to York's solution and its practical application, using the isotopic mixing line as a concrete example. We then use Monte Carlo simulations to precisely quantify the biases inherent to York's solution and to the popular special-case regression methods under various common (and some uncommon) conditions. In Appendix A, we provide a short, fast, and easily implemented algorithm for computing York's best-fit slope and intercept, as well as their associated uncertainties.

The goal of straight-line fitting is to retrieve the true straight-line
relationship between two variables,

If the points are independent of one another and their errors are normally distributed, then the problem can be treated by least-squares estimation (LSE), which is equivalent to maximum likelihood estimation (MLE) (Myung, 2003) in this situation. Most of the literature on straight-line fitting concerns LSE, as it is appropriate to the vast majority of straight-line fitting problems.

York's solution is the general LSE solution: his equations provide the best
possible, unbiased estimates of the true intercept

Before discussing OLS, ODR, and GMR further, we should note that each is
known by other names that add confusion to the literature (York, 1966; Hirsch
and Gilroy, 1984): OLS is called “the regression of

OLS is by far the most widely known fitting method. The OLS fit line is
unbiased only when there is negligible error in

The ODR fit line (Pearson, 1901) is unbiased only when the error variances
for the

Neither OLS, ODR, nor GMR uses estimates of the actual measurement uncertainty in its determination of the best-fit line.

A superior fitting method, called

When the errors in any or all of the

If LSE must be rejected, MLE may or may not be tractable. MLE requires that
the correct error distributions be written analytically and that a useable
expression be derived for the likelihood function

York's method deals with the situation in which there is a linear
relationship between the true values

However, it is often the case that the natural variability is not
well characterized, or that it is not well described as additional
measurement error. In this case, we argue that one cannot proceed to
determine the best-fit line or even to define what “best-fit” means. In
general, one can view natural variability as variation in the true

If one is interested not in the

A Keeling plot is a scatterplot of the stable isotopic composition

An example Keeling plot from the set described in Sect. 3.3, showing
55 measurements made on the night of 25 May 2011, spanning a CO

Multiplying the Keeling fit line equation

In many ecosystems, the source–sink signatures of interest differ from one another by just 1 ‰ or less (Bowling et al., 2014), so that Keeling or Miller–Tans plot fit biases of 0.1 ‰ can be important.

The first studies to consider the effect of error in

An algorithm that solves the York equations (see Appendix A) requires, for each data point

The accuracy of the best-fit line will depend on the accuracy of the error
and correlation estimates, but almost any reasonable estimates will be better
than the estimates implicit in OLS, GMR, and ODR. (Good error estimates
are to be sought anyway, as a measurement is only meaningful to the extent
that its uncertainty has been quantified.) The Keeling plot is an interesting
application partly because, if we take the errors in

Regarding correlation between the errors in

The standard deviations of the errors in

For a Keeling fit, the correlation coefficient

It follows immediately from the above decompositions that

The approximations in Eqs. (1), (2), and (5) are usually excellent. Despite
the precision afforded our Monte Carlo simulations by using

On a modest 2009-model notebook computer, using the Igor Pro programming
language (WaveMetrics, Inc.), 5000 five-iteration York fits to a 5000-point
mixing line took 215 s (compare OLS at 14 s and

We tested the York, OLS, ODR, GMR, and

Real-world mixing line plots are not likely to contain 5000 points each, but using a large number of points per plot can be important when precisely quantifying fit bias in an ensemble of lines. A demonstration of this point is provided in Appendix B for the interested reader. Uninterested readers will be content to know that using more points per plot does not add any new bias to the results – although it might clarify existing bias from an inappropriate fit model, which is what is going on in the discussion under “Sample size effect” in Kayler et al. (2010).

We performed tests for

Bias in retrieved isotopic source signatures from Keeling line fits
for

In addition to our Monte Carlo simulations, we analyzed 429 Keeling plots
composed of real measurements, specifically nighttime measurements of the
mixing ratio and

Isotopic source signatures retrieved from our simulated Keeling plots for

Bias in retrieved isotopic source signatures from Keeling and
Miller–Tans line fits for

Indeed, as reported by Zobitz et al. (2006), GMR produces highly biased fit
lines (that is, the retrieved source signature falls much more than
3 standard errors from

The York equations, however, produce unbiased Keeling and
Miller–Tans fit lines for all conditions in the table. Because the emergence
of high-frequency isotopic measurements is starting to raise the issue, we
show in detail how some OLS- and York-retrieved isotopic source signatures
compare at the lowest

Mean isotopic signatures retrieved from ensembles of 5000 simulated
Keeling plots (each containing 5000 points) using the York and OLS methods,
for CO

Isotopic source signatures retrieved from our simulated Keeling and
Miller–Tans plots for

We also tested

The intercepts from OLS, GMR, and York fits to our measured Keeling plots are
compared in Fig. 3, as a function of the CO

Monte Carlo (MC) and York estimates of the error in the isotopic
source signature retrieved from an individual Keeling plot, in units of
‰. The mean goodness of fit

York et al. (2004) provide compact equations not only for the slope and
intercept but for their standard errors as well. They further show that
these error estimates are algebraically identical to those of the more
general error formulation of MLE. Note, however, that while the York
equations for the slope and intercept are exact (when the measurement
uncertainties are normally distributed), the York–MLE estimates of the errors
in the slope and intercept are approximations (Titterington and Halliday,
1973). In Table 3, we compare York's error estimates against the standard
deviations of the Keeling plot intercepts retrieved from our Monte Carlo
simulations. York's error estimates agree extremely well with the Monte Carlo
results except when the measurement error variances are so large as to
approach the total variances in

Difference between the fit intercept obtained by OLS or GMR and that
obtained by York's equations, for 429 real measured Keeling plots with
measurement uncertainties of 0.05 ppm and 0.02 ‰ and with CO

The errors estimated by

The standard error is a measure of precision; it does not speak to how well
the straight-line model represents the data – a concept known as goodness of
fit. York et al. (2004) note that the goodness of fit of the York solution is
estimated by the quantity

The right-hand column of Table 3 gives the mean value of

When comparing measurements of the same quantity by two different instruments, it is common to plot the values obtained by one instrument against those obtained by the other, so that the relative bias between the instruments can be determined from a straight-line fit to the plot. Monte Carlo simulations similar to those used for our isotopic mixing line example confirm that OLS and GMR may incorrectly estimate that bias. For example, if an old, unbiased instrument is being replaced by a new, also-unbiased instrument whose precision is 5 times better, and if the two instruments are compared over a trial period in which the measured quantity varies over a range that is 20 times greater than the precision of the old instrument, then OLS (GMR) will incorrectly indicate that the new instrument is biased low by 4 % (2 %) of the reading. The York equations will correctly indicate no relative bias.

We have shown that the general least-squares solution for the best-fit straight line, published by Derek York in 1969, is the least biased estimator of the isotopic signature of a trace gas source from a Keeling or a Miller–Tans plot, regardless of the measurement conditions. In contrast, the popular regression methods considered in the literature are unbiased only under particular, often unrealistic conditions. The isotopic mixing line illustrates the virtue and convenience of York's solution, which is applicable to line fitting problems in many scientific disciplines. We have provided a short, fast pseudo-code algorithm for computing York's solution and derived simple equations for the required measurement error and correlation parameters in the case of a Keeling or Miller–Tans plot. Being both accurate and convenient, York's solution supersedes all other least-squares straight-line fit methods.

No original measurements were used. The forest measurements can be accessed as described in the original study (Wehr et al., 2016). The simulations used the code described in Appendix A.

Here we provide an algorithm in pseudo-code for computing the slope and
intercept of the best-fit straight line according to Eqs. (13) of York et
al. (2004). The data consist of

Why simulate 5000 points per plot when real plots are more likely to contain
20 or 50 points? Consider the simple picture in Fig. B1. Here a true line with
slope 1 and

Translating the simple picture of Fig. B1 into practical reality, one finds
that when there is normally distributed measurement error only in

Fortunately, as the CO

Two lines, each defined by measurements of the true points (2,2) and
(6,6), that are in error by

One might be worried about increasing the number of points per plot because
it has been reported that fit bias appears to increase with the number of
points per plot (Kayler et al., 2010). Actually, increasing the number of
points per plot merely clarifies the bias associated with a poor fitting
method. Consider Fig. B2, where we plot distributions of

Distributions of the

Richard Wehr designed and performed the study and prepared the manuscript with contributions from Scott R. Saleska.

The authors declare that they have no conflict of interest.

This research was supported by the US Department Of Energy, Office of Science, Terrestrial Ecosystem Science program (award DE-SC0006741), by the National Science Foundation Macrosystems Biology program (award no. 1241962), and by the Agnese Nelms Haury Program in Environment and Social Justice at the University of Arizona. Edited by: J.-A. Subke Reviewed by: J. Miller and one anonymous referee