Evaluation of biospheric components in Earth system models using modern and palaeo-observations : the state-ofthe-art

Earth system models (ESMs) are increasing in complexity by incorporating more processes than their predecessors, making them potentially important tools for studying the evolution of climate and associated biogeochemical cycles. However, their coupled behaviour has only recently been examined in any detail, and has yielded a very wide range of outcomes. For example, coupled climate– carbon cycle models that represent land-use change simulate total land carbon stores at 2100 that vary by as much as 600 Pg C, given the same emissions scenario. This large uncertainty is associated with differences in how key processes are simulated in different models, and illustrates the necessity of determining which models are most realistic using rigorous methods of model evaluation. Here we assess the state-of-the-art in evaluation of ESMs, with a particular emphasis on the simulation of the carbon cycle and associated biospheric processes. We examine some of the new advances and remaining uncertainties relating to (i) modern and palaeodata and (ii) metrics for evaluation. We note that the practice of averaging results from many models is unreliable Published by Copernicus Publications on behalf of the European Geosciences Union. 8306 A. M. Foley et al.: Evaluation of biospheric components in Earth system models and no substitute for proper evaluation of individual models. We discuss a range of strategies, such as the inclusion of pre-calibration, combined processand system-level evaluation, and the use of emergent constraints, that can contribute to the development of more robust evaluation schemes. An increasingly data-rich environment offers more opportunities for model evaluation, but also presents a challenge. Improved knowledge of data uncertainties is still necessary to move the field of ESM evaluation away from a “beauty contest” towards the development of useful constraints on model outcomes.


Introduction
Earth system models (ESMs), which use sets of equations to represent atmospheric, oceanic, cryospheric, and biospheric processes and interactions (Claussen et al., 2002;Le Treut et al., 2007;Lohmann et al., 2008), are intended as tools for the study of the Earth system.The current generation of ESMs are substantially more complex than their predecessors in terms of land and ocean biogeochemistry, and can also account for land cover change, which is an important driver of the climate system through both biophysical and biogeochemical feedbacks.Yet their coupled behaviour has only recently begun to be explored.
The carbon cycle is a central feature of current ESMs, and the representation and quantification of climate-carbon cycle feedbacks involving the biosphere has been a primary goal of recent ESM development.ESM results submitted to the Coupled Model Intercomparison Project Phase 5 (CMIP5) simulate total land carbon stores in 2100 that vary by as much as 600 Pg C across models with the ability to represent land-use change, even when forced with the same anthropogenic emissions (Jones et al., 2013).This indicates that there are large uncertainties associated with how carbon cycle processes are represented in different models.In addition to these uncertainties in the biogeochemical climate-vegetation feedbacks, there are considerable uncertainties in the biogeophysical feedbacks (Willeit et al., 2013).
Robust evaluation of a model's ability to simulate key carbon cycle processes is therefore a critical component of efforts to model future climate-carbon cycle dynamics.Robust evaluation establishes the confidence which can be placed on a given model's projection of future behaviours and states of the system.However, evaluation is complicated by the fact that ESMs differ in their level of complexity.To take the example of land cover, while some models only account for biophysical effects (e.g.related to changes in surface albedo), some ESMs also account for biogeochemical effects (e.g.principally a change in carbon storage following land conversion).Another example is the representation of nutrient cycles.Not all ESMs include nutrient cycles.Current model projections that do include the coupling between ter-restrial carbon and nitrogen (and in some cases phosphorus) cycles suggest that taking nutrient limitations into account attenuates possible future carbon cycle responses.This is because soil nitrogen tends to limit the ability of plants to respond positively to increases in atmospheric CO 2 , reducing CO 2 fertilisation, and, conversely, tends to limit ecosystem carbon losses with temperature increases, as these also increase rates of nitrogen mineralisation.The reduction in CO 2 fertilisation is found to dominate, leading to a stronger accumulation of CO 2 in the atmosphere by the end of the 21st century than is projected by carbon cycle models that do not include nutrient feedbacks (Sokolov et al., 2008;Thornton et al., 2009;Zaehle et al., 2010).
Evaluation studies in climate modelling have highlighted how choice of methodology can significantly impact the conclusions reached concerning model skill (e.g.Radic and Clarke, 2011;Foley et al., 2013).Several studies have found that the mean of an ensemble of models outperforms all or most single models of that ensemble (e.g.Evans, 2008;Pincus et al., 2008).However, Schaller et al. (2011) demonstrated that although the multi-model mean outperforms individual models when the ability to reproduce global fields of climate variables is evaluated, it does not consistently outperform the individual models when the ability to simulate regional climatic features is evaluated.This highlights the need for robust assessments of model skill.Model evaluations which use inappropriate metrics or fail to consider key aspects of the system have the potential to lead to overconfidence in model projections.In particular, the averaging of results from different models is not an adequate substitute for proper evaluation of each model in turn.
Developing robust approaches to model evaluation, that is, approaches which reduce the data-and metric-dependency of statements about model skill, is challenging for reasons that are not exclusive to carbon cycle modelling but applicable across all aspects of Earth system modelling.Data sets may lack uncertainty estimates, significantly reducing their usefulness for model evaluation.Critical analysis may be required to reconcile differences between data sets intended to describe similar phenomena, such as temperature reconstructions based on different indicators (Mann et al., 2009).Furthermore, there are many metrics in use in model evaluation and often, the rationale for applying a specific metric is unclear.This paper considers these issues, along with strategies for improvement.mismatches between available data and what is required for evaluation, and the challenges of using data collected at a specific spatial or temporal scale to develop larger-scale tests of model behaviour.
Next, we consider metrics for model evaluation.Metrics are simple formulae or mathematical procedures that measure the similarity or difference between two data sets.Whether using classical metrics (such as root mean square error, correlation, or model efficiency), or advanced analytical techniques (such as artificial neural networks), to compare models with data and quantify model skill, it is necessary to be aware of the statistical properties of metrics, as well as the properties of the model variables under consideration and the limitations of the evaluation data sets.Otherwise, there is a strong potential to draw false conclusions concerning model skill.Recent attempts to provide a benchmarking framework for land surface model evaluation indicate a move toward setting community-accepted standards (Randerson et al., 2009;Luo et al., 2012;Kelley et al., 2012).However, different levels of complexity in ESMs, different parameterisation procedures and modelling approaches, the validity of data, and an unavoidable level of subjectivity complicate the task of identifying universally applicable procedures.
Finally, recommendations for more robust evaluation are discussed.We note that evaluation can be process-based ("bottom-up") or system-level ("top-down") (Fig. 1).Evaluation can utilise pre-calibration, and/or emergent constraints across multiple models.A combination of approaches can increase our understanding of a model's ability to simulate processes across multiple temporal and spatial scales.
Consideration will also be given to how key questions arising in the paper could potentially be resolved through coordinated research activities.

The role of data sets in ESM evaluation
ESMs aim to simulate a highly complex system.Nonlinearities in the system imply that even a small change in one of the components might unexpectedly influence another component (Roe and Baker, 2007).As such, robust model evaluation is critical to assist in understanding the behaviour of ESMs and the limitations of what we can and cannot represent quantitatively.The development of such approaches to model evaluation requires consideration of many different data types.
Modern and palaeodata are both used for model evaluation, although each kind of data has advantages and limitations (Table 1).Experimental data provide benchmarks for a range of carbon cycle-relevant processes (e.g.physiologically-based responses of ecosystems to warming and CO 2 increase) that cannot be tested in other ways.However, for processes that are biome-specific, the limited geographical scope of the relatively few existing records is problematic.Data sets also exist with more global coverage, documenting changes in the recent past (last 30-50 yr), but an inherent limitation of these data sets is that they sample the carbon cycle response to a limited range of variation in atmospheric CO 2 concentration and climate.
Palaeoclimate evaluation is an important test of how well ESMs reproduce climate changes (e.g.Braconnot et al., 2012).The past does not provide direct analogues for the future, but does offer the opportunity to examine climate changes that are as large as those anticipated during the 21st century, and to evaluate climate system feedbacks with response times longer than the instrumental period (e.g.cryosphere, ocean circulation, some components of the carbon cycle).

Modern data sets
Evaluation analysis can benefit from modern data sets, to test and constrain components within ESMs in a hierarchical approach (Leffelaar, 1990;Wu and David, 2002).Recent initiatives in land and ocean model evaluation and benchmarking (land: Randerson et al., 2009;Luo et al., 2012;Kelley et al., 2012;Dalmonech and Zaehle, 2013;ocean: Najjaret al., 2007;Friedrichs et al., 2009;Stow et al., 2009;Bopp et al., 2013) give examples of suitable modern data sets for model evaluation and their use in diagnosing model inconsistencies with respect to behaviour of the carbon cycle.These include instrumental data, such as direct measurements of CO 2 , and CH 4 spanning the last 30-50 yr, measurements from carbon flux monitoring networks, and satellite-based data of various kinds (Table 1).
Due to their detailed spatial coverage and high temporal resolution, satellite data sets offer the potential to explore the representation of processes in models in detail, and to reveal www.biogeosciences.net/10/8305/2013/Biogeosciences, 10, 8305-8328, 2013 Lack fully data-model independency, as data is model-derived.
Inconsistent documentation of errors and uncertainties.

Palaeo
Reconstructions based on interpretation of biological or geochemical records.
Measurements of concentrations and isotopic ratios from ice cores.
Tree-ring data sets.
Tests ability to capture behaviour of the system outside modern range.
Signal is large compared to noise.
Site-specific records (except for long-lived greenhouse gases), synthesis required to produce global estimates.1), with some sort of model used to transform the direct measurements of the satellite into other parameters of interest.If, for example, a radiative transfer model is used to estimate the atmospheric or surface state from measured radiances, then there will likely be similarities between the functions used for the retrieval and those used in a climate model.This is not a major problem if the data are used in an informed way, and indeed it presents opportunities (e.g. the estimated surface variable can be compared with a modelled variable without the model radiative transfer functions being involved).Statistical and change detection retrievals rely not on physical models but on statistical links between variables or on a modulation of the satellite signal.These two types of retrievals sometimes use model data for calibration but are otherwise independent of models.Statistical models in particular are not only useful to evaluate specific parameters in a model, but can also be used to perform process-based evaluations.Uncertainty estimates are not always provided or propagated during the retrieval process.Nevertheless, modern data sets are a very rich data source with a number of useful applications.For example, robust spatial and temporal information emerging from data can be used to rule out unreasonable simulations and diagnose model weaknesses.Satellitebased data sets of vegetation activity depict ecosystem response to climate variability at seasonal and interannual time scales and return patterns of forced variability that can be useful for model evaluation (Beck et al., 2011;Dahlke et al., 2012), even if bias within the data set is greater than datamodel differences (e.g.Fig. 2).
Ecosystem observations, such as eddy covariance measurements of CO 2 and latent heat exchanges between the atmosphere and land, and ecosystem manipulation studies, such as drought treatments and free air CO 2 enrichment (FACE) experiments, provide a unique source of information to evaluate process formulations in the land component of ESMs (Friend et al., 2007;Bonan et al., 2012;de Kauwe et al., 2013).Manipulation experiments (e.g.FACE experiments: Nowak et al., 2004;Ainsworth and Long, 2005;Norby and Zak, 2011) are a particularly powerful test of key processes in ESMs and their constituent components, as shown by Zaehle et al. (2010) in relation to C-N cycle coupling, and de Kauwe et al. (2013) for carbon-water cycling.It should be expected that models would be able to reproduce experimental results involving manipulations of global change drivers such as CO 2 , temperature, rainfall, and N addition.
The application of such data for the evaluation of ESMs is challenging because of the limited spatial representativeness of the observations, resulting from the lack of any coherent global strategy for the placement of flux towers or experimental sites, and the high costs of running these facilities.Upscaling monitoring data using data-mining techniques and ancillary data, such as remote sensing and climate data, provides one possible means to bridge the gap between the spatial scale of observation and ESMs (Jung et al., 2011).However, this can be at the cost of introducing model assumptions and uncertainties that are difficult to quantify.Furthermore, such upscaling is near impossible for ecosystem manipulation experiments, as they are so scarce, and rarely performed following a comparable protocol.More and bettercoordinated manipulation studies are needed to better constrain ESM prediction (Batterman and Larsen, 2011;Vicca et al., 2012).Hickler et al. (2008), for example, showed that the LPJ-GUESS model produced quantitatively realistic net primary production (NPP) enhancement due to CO 2 elevation in temperate forests, but also showed greatly different responses in boreal and tropical forests, for which no adequate manipulation studies exist.These predictions remain to be tested.
The interpretation of experiments is not unambiguous because it is seldom that just one external variable can be altered at a time.To give just one recent example, Bauerle et al. (2012) showed that the widely observed decline of Rubisco capacity (V c,max ) in leaves during the transition from summer to autumn could be abated by supplementary light- ing designed to maintain the summer photoperiod, and concluded that V c,max is under photoperiodic control, asserting that models should include this effect.However, their treatment also inevitably increased total daily photosynthetically active radiation (PAR) in autumn.On the basis of the information given about the experimental protocol, these results could therefore also be interpreted as showing that seasonal variations inV c,max are related to daily total PAR.This example draws attention to a key principle for the use of experimental results in model evaluation, namely that such comparisons are only valid if the models explicitly follow the experimental protocol.It is not sufficient for models to attempt to reproduce the stated general conclusions of experimental studies.The possibility of invalid comparisons can most easily be avoided through the inclusion of experimentalists from the outset in model evaluation projects.This was the case, for example, in the FACE data-model comparison study of De Kauwe et al. (2013).
This example also illustrates a general challenge for the modelling community.One response to new experimental studies is to increase model complexity by adding new processes based on the ostensible advances in knowledge.However, we advise a more critical and cautious approach, employing case-by-case comparisons of model results and experiments, rather than general interpretation of experiments, to reduce the potential for ambiguities and avoid unnecessary complexity in models.Such an approach would prevent the occurrence of overparameterisation, the implications of which have been explored by Crout et al. (2009).

Palaeodata
The key purpose of palaeo-evaluation is to establish whether the model has the correct sensitivity for large-scale processes.Models are typically developed using modern observations (i.e.under a limited range of climate conditions and behaviours), but we need to determine how well they simulate a large climate change, to assess whether they can www.biogeosciences.net/10/8305/2013/Biogeosciences, 10, 8305-8328, 2013 capture the behaviour of the system outside the modern range.If our understanding of the physics and biology of the system is correct, models should be able to predict past changes as well as present behaviour.
Reconstructions of global temperature changes over the last 1500 yr (e.g.Mann et al., 2009) are primarily derived from tree-ring and isotopic records, while reconstructions of climates over the last deglaciation and the Holocene (e.g.Davis et al., 2003;Viau et al., 2008;Seppä et al., 2009) are primarily derived from pollen data, although other biotic assemblages and geochemical data have been used at individual sites (e.g.Larocque and Bigler, 2004;Hou et al., 2006;Millet et al., 2009).Marine sediment cores have been used extensively to generate sea-surface temperature reconstructions (e.g.Marcott et al., 2013), and to reconstruct different past climate variables (see review in Henderson, 2002) related to ocean conditions.For example, δ 13 C is used in reconstructions of ocean circulation, marine productivity, and biosphere carbon storage (Oliver et al., 2010).However, the interpretation of these data is often not straightforward, since the measured indicators are frequently influenced by more than one climatic variable (e.g. the benthic δ 18 O measured in foraminiferal shells contains information on both global sea level and deep water temperature).Errors associated with the data and their interpretation also need to be stated, as while analytical errors on the measurements are often small, errors in the calibrations used to obtain reconstructions tend to be much bigger.Therefore the incorporation of measured variables such as marine carbonate concentrations (e.g.Ridgwell et al., 2007), δ 18 O (e.g.Roche et al., 2004) or δ 13 C (e.g.Crucifix, 2005) as variables in models is an important advance, because it allows comparison of model outputs directly with data, rather than relying on a potentially flawed comparison between modelled variables and the same variables reconstructed from chemical or isotopic measurements.
Ice cores provide a polar contribution to climate response reconstruction, as well as crucial information on a range of climate-relevant factors.For example, responses to forcings by solar variability (through 10 Be), volcanism (through sulfate spikes), and changes in the atmospheric concentration of greenhouse gases (e.g.CO 2 , CH 4 , N 2 O) and mineral dust can be assessed.CH 4 can be measured in both Greenland and Antarctic ice cores.CO 2 measurements require Antarctic cores, due to the high concentrations of impurities in Greenland samples, which lead to the in situ production of CO 2 (Tschumi and Stauffer, 2000;Stauffer et al., 2002).For the last few millennia, choosing sites with the highest snow accumulation rates yields decadal resolution.The highest resolution records to date are from Law Dome (MacFarling Meure et al., 2006), making these data more reliable, particularly for model evaluation (e.g.Frank et al., 2010).Further work at high accumulation sites would provide reassurance on this point.Over longer time periods, sites with progressively lower snow accumulation rates, and therefore lower intrinsic time resolution, have to be used.Through the Holocene (last ∼ 11 000 yr) (Elsig et al., 2009), and the last deglaciation, i.e. the transition out of the Last Glacial Maximum (LGM) into the Holocene (Lourantou et al., 2010;Schmidt et al., 2012), there are now high-quality 13 C / 12 C of CO 2 data available, as well as much improved information about the phasing between the change in Antarctic temperature and CO 2 (Pedro et al., 2012;Parrenin et al., 2013), and between CO 2 and the global mean temperature (Shakun et al., 2012) .
Compared to the amount of effort spent on reconstructing past climates and atmospheric composition, comparatively few data sets provide information on different components of the terrestrial carbon cycle.Nevertheless, there are data sets -synthesised from many individual published studies -that provide information on changes in vegetation distribution (e.g.Prentice et al., 2000;Bigelow et al., 2003;Harrison and Sanchez Goñi, 2010;Prentice et al., 2011a), biomass burning (Power et al., 2008;Daniau et al., 2012), and peat accumulation (e.g.Yu et al., 2010;Charman et al., 2013).These data sets are important because they can be used to test the response of individual components of ESMs to changes in forcing.
The major advantage of evaluating models using the palaeorecord is that it is possible to focus on times when the signal is large compared to the noise.The change in forcing at the LGM relative to the pre-industrial control is of comparable magnitude, though opposite in direction, to the change in forcing from quadrupling CO 2 relative to that same control (Izumi et al., 2013).Thus, comparisons of palaeoclimatic simulations and observations since the LGM can provide a measure of individual model performance, discriminate between models, and allow diagnosis of the sources of model error for a range of climate states similar in scope to those expected in the future.For example, Harrison et al. (2013) evaluated mid-Holocene and LGM simulations from the CMIP5 archive, and from the second phase of the Palaeoclimate Modelling Intercomparison Project (PMIP2), against observational benchmarks, using goodness-of-fit and bias metrics.However, as is the case for many modern observational data sets (e.g.Kelley et al., 2012), not all published palaeoreconstructions provide adequate documentation of errors and uncertainties, and there is a lack of standarisation between data sets where such estimates are provided (e.g.Leduc et al., 2010;Bartlein et al., 2011).Reconstructions based on ice or sediment cores are intrinsically site-specific (except for the globally significant greenhouse gas records), therefore many records are required to synthesise regional or global distribution patterns and estimates (Fig. 3).Community efforts to provide high-quality compilations of already available data (e.g., Waelbroeck et al., 2009;Bartlein et al., 2010) make it possible to use palaeodata for model evaluation, but an increase in the coverage of palaeoreconstructions is still required to evaluate model behaviour at regional scales.
Unfortunately, most attempts to compare simulations and reconstructions using palaeodata have focused on purely qualitative agreement of simulated and observed spatial patterns (e.g.Otto-Bliesner et al., 2007;Miller et al., 2010).There has been surprisingly little use of metrics for palaeodata-model comparisons (for exceptions see e.g.Guiot et al., 1999;Paul and Schäfer-Neth, 2004;Harrison et al., 2013).This situation probably reflects problems in developing meaningful ways of taking uncertainties into account in these comparisons.Quantitative assessments have generally focused on individual large-scale features of the climate system, for example the magnitude of insolation-induced increase in precipitation over northern Africa during the mid-Holocene (Joussaume et al., 1999;Jansen et al., 2007), zonal cooling in the tropics at the LGM (Otto-Bleisner et al., 2009), or the amplification of cooling over Antarctica relative to the tropical oceans at the LGM (Masson-Delmotte et al., 2006;Braconnot et al., 2012).Comparisons of simulated vegetation changes have been based on assessments of the number of matches to site-based observations from a region (e.g.Harrison and Prentice, 2003;Wohlfahrt et al., 2004Wohlfahrt et al., , 2008)).Observational uncertainty is represented visually in such comparisons, and only used explicitly to identify extreme behaviour amongst the models.Nevertheless, the recent trend is towards explicit incorporation of uncertainties and systematic model benchmarking (Harrison et al., 2013;Izumi et al., 2013).Many metrics have been proposed (Tables 2-4), and the choice of an appropriate metric in model evaluation is crucial because the use of inappropriate metrics can lead to overconfidence in model skill.The choice should be based on the properties of the data sets, the properties of the metric, and the specific objectives of the evaluation.Metric formalismthat is, the treatment of metrics as well-defined mathematical and statistical concepts -can help the interpretation of metrics, their analysis, or their combination into a "skill-score" (Taylor, 2001) in an objective way.
The use of metrics draws on the mathematical concept of "distance" (d(x, y)), expressed in terms of three characteristics: separation: d(x, y) = 0 ←→ x = y, symmetry: d(x, y) = d(y, x), and the triangle inequality d(x, z)d(x, y) + d(y, z).The two data sets could be two model outputs, where the metric is used to measure how similar the two models are, or one model output and one reference observation data set, where the metric is used to evaluate the model against real measurements.Three levels of metric complexity can be identified, relating to the state-space on which to apply the distance: -Level 1 -"comparisons of raw biogeophysical variables".Here the distance generally reflects errors and provides assessment of model performance where there is a reasonable degree of similarity between the model and reference data set (such as climate variables in weather models).
-Level 2 -"comparisons of statistics on biogeophysical variables".Here the distance is measured on a statistical property of the data sets.This is particularly useful for models that are expected to characterise the statistical behaviour of a system (e.g.climate models).This level is appropriate for most of the biophysical variables simulated by ESMs.
-Level 3 -"comparisons of relationships among biogeophysical variables".Here, the distance is diagnostic of relationships related to physical and/or biological processes and this level of comparison is therefore useful for understanding the behaviour of two data sets.
At all levels of metric complexity, the metric needs to be both synthetic enough to aid in understanding the similarities and differences between the two data sets, and to be understandable by non-specialists in order to facilitate its use by other communities.Next, the particular uses, advantages, and limitations of metrics in each level of metric complexity will be discussed.

Metrics on raw biogeophysical variables
Level 1 metrics are the most widely used.The distance measures the discrepancies between two data sets of a key bio-geophysical variable.Discrepancies can be measured at site level or at pixel level for gridded data sets, and thus such comparisons can be used for model evaluation against sparse data, such as site-based NPP data (e.g.Zaehle and Friend, 2010), eddy-covariance data (e.g.Blyth et al., 2011), or atmospheric CO 2 concentration records at remote monitoring stations (e.g.Cadule et al., 2010;Dalmonech and Zaehle, 2013).Where there is sufficient data to make the calculation meaningful, comparisons can be made against spatial averages or global means of the biogeophysical variables.Comparisons can also be made in the time domain because climate change and climate variability act on Earth system components across a wide range of temporal scales.The distance can thus be measured on instantaneous variables or on time-averaged variables, such as annual means.Many distances, summarised in Table 2, can be considered to measure these discrepancies.
The Euclidean distance (Eq. 1) is the most commonly used distance.It is more sensitive to outliers than the Manhattan distance (Eq.2).Both of these distances assume that direct comparisons of the data can be made.Some examples are reported in Jolliff et al. (2009), where the Euclidean distance is used to evaluate three ocean surface bio-optical fields.
In the case of the weighted Euclidean distance (Eq.3), a weight is associated with each variable.This is useful for various reasons: (1) normalisation against a mean value provides a dimensionless metric and allows comparisons to be drawn between data sets with different orders of magnitude; (2) the weighting can take account of uncertainties in the reference data set (e.g.instrumental errors in an observational data set, or uncertainty in a model ensemble); and (3) this type of metric can be useful when the data have a different dynamical range.For example, in a time series of Northern Hemisphere monthly surface temperature, the variability is different for summer and winter, and it makes sense to normalise the differences by the variance.
The Chi-square "distance" (Eq.4) is related to the Pearson Chi-square test or goodness-of-fit, and differs from previous distances discussed here as it measures the similarity between two probability density functions (PDFs), rather than between data points.It is particularly useful if the focus of the analysis is at the population level.Distances on PDFs are defined, in this paper, to be Level 2 metrics, but the Chi-square distance can be used when the geophysical variables are supposed to have a particular shape (e.g. an atmospheric profile of temperature).Equation ( 5) can also be used, in particular to facilitate the symmetry property of distances.
The Tchebychev distance (Eq.6) can be used, for example, to identify the maximum annual discrepancy in a climatic run.It can be useful if the focus is on extreme events.
The Mahalanobis distance (Eq.7) is particularly suitable if variables have very different units, as each one will be normalised by its variance, and/or if they are correlated with each other, since the distance takes these correlations into account.High correlation between two data sets has no impact x i − y i (2) Implicitly supposes that x and y are comparable, so is not suited to mixed variables (e.g.variables with different units).
Weighted Euclidean distance (3) Uncertainty in the reference data set, such as instrumental errors in an observation data set, or model uncertainty in a model ensemble, can be accounted for using a model efficiency metric: e.g. (5) Measures how similar two PDFs (probability distribution functions) are.Particularly useful if the focus of the analysis is at population level.Alternatively Eq. ( 5) can be used to facilitate the symmetry property of distances.
) Useful for extreme events, or maximum annual discrepancy in a climatic run.

Mahalanobis distance
where A −1 = covariance matrix of x or y.
(7) Particularly useful if x or y include coordinates with very different units (each one will be normalised by its variance), if they are correlated one with each other (since the distance takes into account these correlations), and for combination of multiple sources of information.
Normalised mean error where E is the total number of samples in D 1 and D 2 .
(8) Applies the distance over the entirety of two data sets D 1 and D 2 .
on the distance computed, compared to two independent data sets.This distance is directly related to the quality criterion of the variational assimilation and Bayesian formalism that optimally combines weather forecast and real observations.This criterion needs to take into account the covariance matrices and the uncertainties of the state variables.
Interesting links can be established between metrics and the operational developments of the numerical weather prediction centres.The Mahalanobis distance is well suited for Gaussian distributions (meaning here that the data/model misfit distribution follows a Gaussian distribution with covariance matrix A, e.g.Min and Hense, 2007).General Bayesian formalism can be used to generalise this distance to more complex distributions.The Mahalanobis distance and the more general Bayesian framework are particularly suitable to treat several evaluation issues at once, such as the quantification of multiple sources of error and uncertainty in models or the combination of multiple sources of information (including the acquisition of new information).For instance, Rowlands et al. (2012) use a goodness-of-fit statistic similar to the Mahalanobis distance applied to surface temperature data sets.
We present here distances between two points, possibly multivariate.Some metrics use these distances and have been defined over the two whole data sets D 1 and D 2 .For example, the Normalized Mean Error (NME) is a normalisation of the bias between the two data sets (Eq.8).Several other distances exist in the literature that have been applied in different scientific fields and that are not listed here (e.g.Deza and Deza, 2006).However most of these distances are particular cases or an extension of the preceding distances.

Metrics on statistical properties
Level 2 metrics, summarised in Table 3, use statistical quantities estimated for two data sets D 1 and D 2 .Some of the metrics presented in the previous section can then be applied to the selected statistics.For instance, the PDF can be estimated for both data sets and the Chi-Square distance can be used to measure their discrepancy.For example, Anav et al. (2013)
For PDFs of two data sets (e.g.observed and modelled data).
Kullback-Leibler divergence (9) For PDFs of two data sets, p and q.
Variance Depends on application.Suitable if long observational record is available.
Use the diagnostic that best suits the application.
(GPP) and leaf area index (LAI) from the CMIP5 model simulations with two selected data sets.
The Kullback-Leibler divergence (Eq.9) is based on information theory and can also be used to measure the similarity of two PDFs.The Kolmogorov-Smirnov distance can be used when it is of interest to measure the maximum difference between the cumulative distributions.Tchebychev or other distances acting on estimated seasons are also considered here to be Level 2 metrics, since the seasons are statistical quantities estimated on D 1 and D 2 (although very close to level 1 raw geophysical variables).Similarly the distance can operate on derived variables from the original time series as decomposed signals in the frequency domain.Cadule et al. (2010), for example, analysed model performance in terms of representing the long-term trend and the seasonal signal of the atmospheric CO 2 record.
The variance of data and model is often used to formulate metrics for the quantification of the data-model similarity.In coupled systems, the use of a metric based on distance can become inadequate; the metric no longer facilitates definite conclusions on the model error, because it includes an unknown parameter in the form of the unforced variability.Furthermore, when applied to spatial fields, as variance is strongly location-dependent, a global spatial variance can be misleading.Gleckler et al. (2008) proposed a more suitable model variability index which has been applied to climatic variables, but is also highly applicable to several of the biogeophysical and biogeochemical variables simulated by land and ocean coupled models, and thus relevant to the carbon cycle.The metric can also focus on extreme events, with the distance acting on the percentile, assuming that the length of the records is sufficient to characterise these extremes.

Metrics on relationships
Level 3 metrics, summarised in Table 4, focus on relationships.The aim here is to diagnose a physical or a biophysical process that is particularly important, such as the link between two variables in the climate system.Various "relationship diagnostics" have been used, and are summarised in Table 4.
The correlation between two variables is a very simple and widely used metric; it satisfies the need to compare the data-model phase correspondence of a particular biogeophysical variable.In this case parametric statistics such as the Pearson correlation coefficient (Eq.10), or non-parametric statistics such as the Spearman correlation coefficient, are directly used as a metric.This is particularly used to evaluate the correspondence of the mean seasonal cycle of several variables, from precipitation (Taylor, 2001), to LAI, GPP (Anav et al., 2013), and atmospheric CO 2 (Dalmonech and Zaehle, 2013).
The sensitivity of one variable to another can be estimated using simple to very complex techniques (Aires and Rossow, 2003).It can be obtained by dividing concomitant perturbations of the two variables using spatial or temporal differences (Eq.11), or by perturbing a model and measuring the impact when reaching equilibrium.The first approach can be used to evaluate, for example, site-level manipulative experiments to estimate carbon sensitivity to soil temperature or nitrogen deposition in terrestrial ecosystem models (e.g.Luo et al., 2012).
From the linear regression of two variables the slope or bias can be compared for D 1 or D 2 (Eq.12).The slope is very close to the concept of sensitivity, but sensitivities are very dependent on the way they are measured.For example, sensitivity of the atmospheric CO 2 to climatic fluctuations may depend on the timescales they are calculated on (Cadule et al., 2010).An alternative, when more than two variables are involved in the physical or biophysical relationship under study, is a multiple linear regression (Eq.13), or any other linear or nonlinear regression model such as neural networks.See, for example, the results obtained at site-level by Moffat et al. (2010).
Pattern-oriented approaches use graphs to identify particular patterns in the data set.These graphs aim at capturing relationships of more than two variables.For example, in Bony and Dufresne (2005), the tropical circulation is first decomposed into dynamical regimes using mid-tropospheric vertical velocity and then the sensitivity of the cloud forcing to a change in local sea surface temperature (SST) is examined for each dynamical regime.Moise and Delage (2011) proposed a metric that assesses the similarity of field structure of rainfall over the South Pacific Convergence Zone in terms of errors in replacement, rotation, volume, and pattern.The same metric could be applied to ocean Sea-viewing Wide Field-of-view Sensor (SeaWiFS) satellite-based fields ( Nonlinear model that provides access to threshold, interactions and saturation behaviours.
The metric can then be defined as the percentage of variance of a explained by B in the data and in the model.
Still not causal.

Pattern-oriented approaches
Various methods.Very process oriented, but requires a good understanding a priori of what needs to be examined.
Clustering algorithms e.g.K-means, self-organising maps Uses a similarity distance, similar to level 1 metric.
Ideal for obtaining a limited set of prototypes, describing the variability of the data sets as much as possible.
in areas where particular spatial structures emerge.These powerful techniques could be more widely applied to evaluating ESM processes.
Clustering algorithms have been used to obtain weather regimes based only on the samples of a data set.For example, Jakob and Tselioudis (2003) and Chéruy and Aires (2009) obtained cloud regimes based on cloud properties (optical thickness, cloud top pressure).The same methodology can be used in D 1 and D 2 and the two sets of regimes can be compared.The regimes can also be obtained on one data set and only the regime frequencies of the two data sets are compared.Abramowitz and Gupta (2008) applied a distance metric to compare several density functions of modelled net ecosystem exchange (NEE) clustered using the "self-organising map" technique.
It is often difficult to use a real mathematical distance to measure the discrepancy between the two "relationship diagnostics".Although very useful for understanding differences in the physical behaviour, the simple comparison of two graphs (for D 1 and D 2 ) is not entirely satisfactory since it does not allow combination of multiple metrics or definition of scoring systems.In this paper, it is not possible to list all the ways to define a rigorous distance on each one of the relationship diagnostics that have been presented: Euclidean distance can be used on the regression parameters or the sensitivity coefficients, or two weather regime frequencies can be measured using confusion matrices (e.g.Aires et al., 2011).The distance needs to be adapted to the rela-tionship diagnostic.The most limiting factor to this type of approach for ESM evaluation is that the relationship obtained might be not robust enough (i.e.statistically significant), or not easily framed within a process-based context.

A framework for robust model evaluation
Robust model evaluation relies on a combination of approaches, each informed using appropriate data and metrics (Fig. 4).Calibration and, ideally, pre-calibration (Sect.4.2.2) must first be employed to rule out implausible outcomes, using data independent of that which may be subsequently used in model evaluation.Then, evaluation approaches must be a combination of process-focussed and system-wide, to ensure that both the representation of processes and the balance between them are realistic in the model.Optionally, the results of different model evaluation tests can be combined into a single model score, perhaps for the purposes of weighting future projections.When employed as part of a multi-model ensemble, the simulation can also contribute to the calculation of emergent constraints, which can then be used in subsequent model development (Sect. 4.3.3).

Recommendations for improved data availability and usage
The increasingly data-rich environment is both an opportunity and a challenge, in that it offers more opportunities for model validation but requires more knowledge about the generation of data sets and their uncertainties in order to determine the best data set for evaluation of specific process representations.While improved documentation of data sets would go some way to alleviating the latter problem, there is scope for improved collaboration between the modelling and observational communities to develop an appropriate benchmarking system, that evolves to reflect new model developments (such as representing ecosystem-scale responses to combined environmental drivers) not addressed by existing benchmarks.

Coordinating data collection efforts
A key question for both the modelling and data communities to address together is how well model evaluation requirements and data availability are reconciled.There is an ongoing need for new and better data sets for model evaluation: data sets that are appropriately documented and for which useful information about errors and uncertainties are provided.The temporal and spatial coverage of data sets also needs to be sufficient to capture potential climatic perturbations, a point that is illustrated in the evaluation of marine productivity.Modelling studies offer conflicting evidence of the behaviour of this key variable in controlling marine carbon fluxes and exchanges of carbon with the atmosphere under a changing climate (e.g.Sarmiento et al., 2004;Steinacher et al., 2010;Taucher et al., 2012;Laufkötter et al., 2013), therefore model evaluation is essential.Recent compilations of observations of marine-productivity proxies give us a reasonably well-documented picture of qualitative changes in productivity over the last glacial-interglacial transition (e.g.Kohfeld et al., 2005), and in response to Heinrich events (e.g.Mariotti et al., 2012).These data sets are being used to evaluate the same ESMs used to predict changes in NPP in response to climate change (e.g.Bopp et al., 2003;Mariotti et al., 2012), and these studies show reasonable agreement.On more recent timescales, remote sensing observations of ocean colour have been used to infer decadal changes in marine NPP.Studies show an increase in the extent of oligotrophic gyres over 1997-2008 with the SeaWiFS data (Polovina et al., 2008).However, on longer timescales, and using Coastal Zone Color Scanner (CZCS) and SeaWiFS data sets, analysis yields contrasting results of increase or decrease of NPP from 1979-1985to 1998-2002(Gregg et al., 2003;Antoine et al., 2005).Henson et al. (2010) have shown, based on a statistical analysis of biogeochemical model outputs, that an NPP time series of ∼ 40 yr is needed to detect any global-warming-induced changes in NPP, highlighting the need for continued, focused data collection efforts.

Maximising the usefulness of current data in modelling studies
Modelling studies should be designed in a manner that makes the best use of the available data.For example, equilibrium model simulations of the distant past require time-slice reconstructions for evaluating processes relating to the carbon cycle.These reconstructions rely on synchronisation of records from ice cores, marine sediments, and terrestrial sequences, to take account of differences between forcings and responses in different archives, which is a significant effort even within a particular palaeo-archive, let alone across multiple archives.Yet the strength of palaeodata is precisely that it offers information about rates of change, and such information is discarded in a time-slice simulation.For that reason, the increasing use of transient model runs to simulate past climate and environmental changes is a particularly important development.
There is also an increasing need for forward modelling to simulate the quantities that are actually measured, such as isotopes in ice cores and pollen abundances.Ice core gas concentration measurements are unusual because what is measured is what we want to know, and is a variable that ESMs yield as a direct output.This is not generally the case, nor are all model setups easily able to simulate even the trace-gas isotopic data that are available from ice.A corollary is that we need to recognise the difficulty of trying to use palaeodata to reconstruct quantities that are essentially model constructs, for example inferring the strength of the meridional overturning circulation (MOC) from the 231 Pa / 230 Th ratio in marine sediment cores (McManus et al., 2004).In the latter context, direct simulation of the 231 Pa / 230 Th ratio is necessary to deconvolute the multiple competing processes (Siddall et al., 2005(Siddall et al., , 2007)).

Using data availability to inform model development
Model development should also focus on incorporating processes that, at least collectively, are constrained by a wealth of data.Notable examples are processes such as those governing methane (CH 4 ) emissions (e.g. from wetlands and permafrost) and the removal of methane from the atmosphere (e.g. via oxidation by the hydroxyl radical and atomic chlorine).There are four main observational constraints on the CH 4 budget with which we can evaluate the performance of ESMs: the concentration, [CH 4 ]; its isotopic composition with respect to carbon and deuterium, δ 13 CH 4 and δD (CH 4 ); and CH 4 fluxes at measurement sites.We have no natural record of CH 4 fluxes so their use in ESM evaluation is limited to the relatively recent period in which they have been measured, though measurements of CH 4 fluxes at specific sites can be used to verify spatial and seasonal distributions of CH 4 emissions inferred from tall tower and satellite measurements of [CH 4 ], by inverse modelling.However, a range of [CH 4 ], δ 13 CH 4 , and δD (CH 4 ) records are available, spanning up to 800 000 yr in the case of polar ice cores, which can be used to evaluate the ability of ESMs to capture changes to the CH 4 budget in response to past changes in climate.The variety of climatic changes we can probe, from large glacialinterglacial changes spanning thousands of years to substantial changes over just a few tens of years at the beginning of Dansgaard-Oeschger events, and still more rapid, subtle changes following volcanic eruptions, enables us to evaluate the ability of ESMs to capture both the observed size and speed of changes known to have taken place.The complementary natures of the [CH 4 ], δ 13 CH 4 , and δD (CH 4 ) constraints is key to ESM evaluation.Each CH 4 source and sink affects these three constraints in different ways.As such, scenarios that explain only one set of observations can be eliminated.For instance, an increase in CH 4 emissions from tropical wetlands, biomass burning, or methane hydrates could explain an increase in [CH 4 ], but of these only an increase in biomass burning emissions could explain an accompanying enrichment in δ 13 CH 4 .Of course, more than one factor can change at a time, but the key point is that the most rigorous test of ESM performance utilises all three constraints and, therefore, ESMs should track the influence of each source and sink.

Key principles of model calibration
Model evaluation is closely linked to model calibration.
ESMs contain a large number of (sometimes poorly constrained) parameters, resulting from incomplete knowledge of certain processes or from the simplification of complex processes, which can be calibrated in order to improve model behaviour.In general, model calibration should follow a number of fundamental guiding principles.The principles detailed here are mostly based on the discussion in Petoukhov et al. (2000) for the CLIMBER-2 model.First, parameters which are well constrained from observations or from theory must not be used for model calibration.Normally it would be physically inappropriate to mod-ify the values of fundamental constants, for example, or use a value for a parameter which is different from the accepted empirical measurement just to improve the performance of the model.
Second, whenever possible, parameterisations and submodules should be tuned separately against observations rather than in the coupled system.In the case of parameterisations, this ensures that they represent the physical behaviour of the process described rather than their effect on the coupled system.The same principle should be applied as far as possible to the individual sub-modules of any ESM to make sure that their behaviour is self-consistent and to facilitate calibration of the much more complex fully coupled system.
Third, parameters must describe physical processes rather than unexplained differences between geographic regions.It is preferable for the model to represent the physical behaviour of the system rather than apply hidden flux corrections.
Fourth, the number of tuning parameters must be smaller than the predicted degrees of freedom.However, this is usually large for ESMs.
Finally, one of the key challenges relating to data used in ESM evaluation is to what extent ESM development and evaluation data are independent.In principle, the same observational data should not be used for calibration and evaluation.This is difficult to enforce in practice, however.Even if the observational data are divided into two parts, with one part used for calibration and the other for evaluation, for example, any mismatch in the evaluation will likely lead to a readjustment of model tuning parameters, making the evaluation not completely independent of the calibration procedure (Oreskes et al., 1994).Standard leave-one-out crossvalidation techniques divide calibration data sets into multiple subsets, sequentially testing the calibration on each leftout subset (in the limit each data point) in turn but in Earth system modelling the subsets are unlikely to be fully independent.

Utilising pre-calibration to constrain implausible outcomes
The essence of pre-calibration is to apply weak constraints to model inputs in the initial ensemble design, and weak constraints on the model outputs to rule out implausible regions of input and output spaces (Edwards et al., 2011).The pre-calibration approach is based on relatively simple statistical modelling tools and robust scientific judgements, but avoids the formidable challenges of applying full Bayesian calibration to a complex model (Rougier, 2007).A large set of model experiments sampling the variability in multiple input parameter values with the full simulator, here the ESM, is used to derive a statistical surrogate model or "emulator" of the dependence of key model outputs on uncertain model inputs.The choice of sampling points must be www.biogeosciences.net/10/8305/2013/Biogeosciences, 10, 8305-8328, 2013 highly efficient to span the input space and is usually based on Latin hypercube designs.The resulting emulator is computationally many orders of magnitude faster than the original model and can therefore be used for extensive, multidimensional sensitivity analyses to understand the behaviour of the model.Holden et al. (2010Holden et al. ( , 2013a, b) , b) demonstrated the approach in constraining glacial and future terrestrial carbon storage.
The process is usually iterative, in that a large proportion of the initial parameter space may be deemed implausible, but one or more subsequent simulated ensembles can be designed by rejection sampling from the emulator to locate the not-implausible region of parameter space.The resulting simulated ensembles are then used to refine the emulator and the definition of the implausible space.The final output is an emulator of model behaviour and an ensemble of simulations, corresponding to a subset of parameter space that is deemed "plausible" in the sense that simulations from the identified parameter region do not disagree with a set of observational metrics by more than is deemed reasonable for the given simulator.The level of agreement is therefore dependent on the model and represents an assessment of the expected magnitude of its structural error (i.e.error due to choices for how processes are represented and relate to one another).The plausible ensemble, however, is a general result for the model that can be applied to any relevant prediction problems, and embodies an estimate of the structural and parametric error inherent in the model predictions.
Ideally, pre-calibration is a first step in a full Bayesian calibration analysis.The advantage of the logistic mapping or pure rejection sampling approach used is that, because no weighting is applied, a subsequent Bayesian calibration can be applied to refine the evaluation without any need to unravel convolution effects or the multiple use of constraints.In practice, however, the pre-calibration step can be sufficient to extract all the information that is readily available from top-down constraints given the magnitude of uncertainties in inputs and of structural errors in intermediate complexity ESMs.

Process-based (bottom-up) evaluation
Both bottom-up and top-down evaluation are required for evaluating ESMs: the first approach can give process-byprocess information but not the balance between them; the second will give the balance but not the single terms.When bottom-up, process-based improvements can be shown to have top-down, system-level benefits, then we know our multi-pronged evaluation has worked.Bottom-up, process-based evaluation will often require combinations of data to create the appropriate metrics as it is more likely to focus on the sensitivity of one output vari-able to changes in a single input.For example, to assess if a model has the right sensitivity of NPP to precipitation a test could be to compute the partial derivative of NPP with respect to precipitation at constant values of temperature, radiation etc. for both the model and the observations (Randerson et al., 2009).This approach requires processing a data set of, in this case, NPP, to combine it with precipitation data to derive a relationship.The same NPP data could be combined with temperature data to derive a similar NPP(T ) relationship.This is much more likely to isolate at least a small number of processes than simply comparing simulated NPP to a observational map or time series.
It is also common for model development to focus on specific features or aspects of the model in order to have faith in the model's ability to make projections.For example, climate modelling centres may focus on the ability of their GCMs to represent coupled phenomena such as ENSO, or the timing and intensity of monsoon systems.In this way, bottom-up evaluation pinpoints important model processes, and helps to confirm that the model is a sufficiently accurate representation of the real system, giving the right results for the right reasons.However, a key limitation of this approach is that the relevant observations needed to assess a particular process may not exist.
Process-based evaluation requires metrics based on process-based sensitivities, as described in Sect.3.2.Sensitivity analysis (e.g.Saltelli et al., 2000;Zaehle et al., 2005) may be useful to determine the parameters and processes to focus on in a bottom-up evaluation.In this approach, a simple statistical model is used to represent the physical relationships in the reference data set.A similar model is calibrated on the model simulations and the complex multivariate and non-linear relationships can then be compared.
Measuring these sensitivities allows prioritisation of the important parameters to validate in the model and isolate processes not well simulated in the model.For example, Aires et al. (2013) used neural networks to develop a reliable statistical model for the analysis of land-atmosphere interactions over the continental US in the North American Regional Reanalysis (NARR) data set.Such sensitivity analyses enable identification of key factors in the system and in this example, characterisation of rainfall frequency and intensity according to three factors: cloud triggering potential, low-level humidity deficit, and evaporative fraction.

System-level (top-down) evaluation
Top-down constraints tend to focus on whole-system behaviour and are more likely to involve evaluation of spatial or time-series data.Typical quantities used for top-down evaluations include surface temperature, pressure, precipitation, and wind speed maps.Observational data sets exist for many of these quantities throughout the atmosphere, so zonal-mean, or 3-dimensional comparisons are also possible (Randall et al., 2007).Anav et al. (2013) extend this approach to assess new biogeochemical outputs of CMIP5 ESMs, such as distribution and time evolution of carbon stores and fluxes.
The appropriate choice of metrics is important, as discussed in Sect.3. A correlation coefficient might seem an obvious choice to assess the seasonal cycle of a given variable, but a model with the right phase of seasonal cycle but a magnitude 5 times too big/small would score a high correlation coefficient, while a model with the correct magnitude but lagged by just one month would score poorly.To overcome these limitations of correlation-based metrics, additional metrics such as mean error should be included in the analysis to aid interpretation of the correlation, while lag errors could first be corrected so that the correlation gives a more meaningful result.There are also many studies that have attempted to overcome this issue by presenting summary statistical metrics for multiple components across multiple models.
Taylor ( 2001) is one of the examples where a metric based on correlation and a distance metric have been developed as a skill score.Gleckler et al. (2008) use Taylor diagrams to compare the performance of models in terms of both the magnitude and phase of the seasonal cycle.Reichler and Kim (2008) normalise model error variance on a grid-point basis to come up with a composite score and measure progress in model skill between generations of IPCC reports.Such scoring systems can be useful to synthesise the results of numerous metric comparisons, but should be used with caution as they can be hard to interpret -it is not always clear what model failing has led to a low score.The choice of which observations to use in the weighting is also subjective.
Model errors will inevitably evolve in time, affecting the reliability of simulations of future Earth system states.Measuring this type of uncertainty is an extremely difficult challenge.Presently, the best approach is to use expert judgement to estimate the growth of errors beyond the known forcing space, and this logic underpins the large, subjective choice of input ranges in the precalibration technique.Palaeoclimate analysis expands the space of forcings applied to the Earth system, such that possible future states might be more likely to occur inside the envelope of testable simulations.With sufficient high-quality data, it would be possible to crossvalidate predictions against extreme past states that stretched the envelope in the most appropriate way.The Paleocene-Eocene Thermal Maximum offers perhaps the best opportunity for this, due to the large difference from current climate and atmospheric CO 2 conditions.An advanced theoretical approach is the "Reification" technique of Goldstein and Rougier (2009), which allows the error in a given model to be successively related to more and more accurate models, but its implementation is very much under development (see Williamson et al., 2012).

The role of emergent constraints in model evaluation
Emergent constraints (Table 5) can also provide valuable information for model evaluation, as they convert the extensive short-timescale information available for the contemporary period into longer-timescale constraints on the Earth system sensitivities that are most important for the 21st and 22nd centuries (e.g.climate sensitivity to CO 2 , or carbon cycle sensitivity to climate).Observational data on short timescales do not relate directly to these sensitivities, and analogue approaches, which evaluate ESM sensitivity against known changes in the past, are also limited by observational data, as the analogue events in Earth's past are not as well characterised as those in the contemporary period.
Emergent constraints relate some observable aspect of the contemporary Earth system to a key system sensitivity, using an ensemble of Earth system simulations (Collins et al., 2012).The archetypal example of this relates the magnitude of the snow-albedo feedback to the size of the seasonal cycle in snow cover in the Northern Hemisphere, across more than twenty GCMs (Hall and Qu, 2006).Since the seasonal cycle of snow-cover can be estimated from observations, this model-derived relationship provides a means to convert observations to a constraint on the size of the snow-albedo feedback in the real climate system, for which there is no direct reliable measurement.A similar emergent constraint has been used to relate the sensitivity of the interannual variability in atmospheric CO 2 to the loss of carbon from tropical land under climate change (Cox et al., 2013).
In general terms, such emergent constraint methods build on the realisation that analysis of short-time fluctuations in a system can assist in determining the sensitivity of that system to external forcing (Leith, 1975).Conversely, valuable information is unnecessarily lost when taking long-term trends and ignoring the shorter-timescale variations about these trends.Such emergent constraints utilise the large differences amongst ESM projections to reduce uncertainties in the sensitivities of the real Earth system to anthropogenic forcing.

Outlook
Although the current generation of ESMs encompass a wide range of processes, they are likely to become increasingly complex as processes that are currently being explored in, for example, dynamic global vegetation models such as better representation of nutrient cycles (e.g.Gotangco Castillo et al., 2012), fire (e.g.Thonicke et al., 2010;Prentice et al., 2011b;Pfeiffer et al., 2013), permafrost (e.g.Lawrence et al., 2012;Schaphoff et al., 2013) and wetland dynamics (e.g.Collins et al., 2011), or dust-(e.g.Shannon and Lunt, 2010), vegetation-climate interactions (Quillet et al., 2010)  Looks at relationships between variables in a way that isolates a single process, or small number of processes.
Magnitude of seasonal cycle of T air vs. T surf to evaluate insulation by snow pack.
Pinpoints important model processes.
"Right answer for right reason." Easy to interpret, e.g. can see if response is too big or too small for a given input.
Only targets a small part of the model.
Relevant observations may not exist.
Even when process representation is close to perfect, this does not ensure overall balance between them is right.
Systemlevel, "topdown" Compares largescale model outputs that emerge from interactions between many processes within the model with relevant observations.Global patterns of temperature, precipitation, etc.

Seasonal cycle of carbon fluxes.
Evaluates end-result, i.e. quantities that we actually want the model to predict.
Assesses overall balance between many (possibly finely balanced) processes.No requirement for models to be right -models might be wrong on individual basis regarding magnitude of response but the relationship may be robust.
Guides where we want observational effort.
Relies on "bad" models more than "good" ones to derive regression.
May get false confidence if models systematically wrong (e.g.all lack long-term carbon release from permafrost).2010; Bellouin et al., 2011) are incorporated.This growing complexity has the potential to mask model errors, making robust evaluation of the model and its components increasingly necessary.
Common to any dynamical system under evaluation, key challenges include choosing the most important variables in the system, identifying the fundamental relationships, estimating non-linear and multivariate sensitivities, and analysing the interactions between processes.We have outlined how approaches such as pre-calibration and robust calibration, along with a combination of process-and system-level evaluation with relevant data, can be used to characterise model skill.We have also illustrated the usefulness of emergent constraints to further refine model outcomes.
A combination of approaches can greatly increase our understanding of a model's ability to realistically simulate processes across multiple temporal and spatial scales.For example, both locally and globally, the net terrestrial carbon budget fluxes are a small difference between large uptake (photosynthesis) and release (respiration) terms.Even if each pro-cess could be modelled with high precision, the net balance could still be poorly constrained.Hence, single, processbased tests are necessary but not sufficient.Conversely, observations of the seasonal cycle or interannual variability of carbon balance constrain overall terrestrial carbon balance, but do not provide detail about the processes contributing to it.It is theoretically possible to simulate the carbon balance with a number of different combinations of the components; therefore there is the potential to get the right answer for the wrong reasons.Different parameter combinations are potentially able to recreate the historical record of atmospheric CO 2 concentration (Sitch et al., 2008;Booth et al., 2012).Furthermore, some of the most accurate features of climate simulations (such as the pattern of near-surface temperatures) are poor predictors of the sensitivity of the terrestrial carbon balance to increasing CO 2 .It is thus eminently possible to get a skilful simulation of the present through the cancellation of multiple errors.A combination of "bottom-up" constraints on the processes and "top-down" constraints on the balance between them is essential, to give confidence that the model gives the right behaviour for the right reason.
A key limitation of current model evaluation approaches is that the widely used statistical measures of sensitivities are based on "coincident increments", such as correlations, not on causality.A very interesting extension of sensitivities would investigate causal links among the important parameters in the system.Some tentative studies have investigated measures such as Granger causality; see Notaro et al. (2006) for an application to vegetation patterns.However, a more complete framework needs to be used (Pearl, 2009).Due to the complexity of this type of work, a close collaboration of climate-carbon cycle scientists and statisticians would be required.
Model complexity and structure has to be kept in mind when making comparisons of skill with respect to any given metric across a range of ESMs.Comparing models of different complexity could create an artificially large model spread, that does not reflect current process knowledge.However, comparing only models of similar complexity could lead to underestimation of the true uncertainty in model projections due to structural similarities between models and restricted sample size.
Benchmarking models against a set of well-chosen observations (Sect.2), and using appropriate metrics (Sect.3), should be considered a vital step in any model evaluation.While individual metrics might be each easily interpreted, a combination of many different metrics could be a challenge to interpret, particularly when very different scores in metrics that measure different aspects of model performance need to be reconciled.Therefore, while it may be tempting to simply evaluate the performance of the model against every data set that can be found (and indeed a "perfect" model should be able to withstand such a test), if this comes at the expense of being able to interpret the results then it may be more beneficial to focus on a smaller set of tests which target key model outputs.This level of discrimination is inevitably an expert judgement, but is necessary if the field of ESM evaluation is to move from "beauty contest" to constraint.

Fig. 1 .
Fig. 1.Conceptual diagram of hierarchical approach to model evaluation on different spatial and temporal scales.

Fig. 3 .
Fig. 3. Examples of global data sets documenting environmental conditions during the mid-Holocene (ca.6000 yr ago) that can be used for benchmarking ESM simulations.In general, these are expressed as anomalies, i.e. the difference between the mid-Holocene and modern conditions: (a) pollen-based reconstructions of anomalies in mean annual temperature, (b) reconstructions of anomalies in sea-surface temperatures based on marine biological and chemical records, (c) pollen and plant macrofossil reconstructions of vegetation during the mid-Holocene, (d) charcoal records of the anomalies in biomass burning, and (e) anomalies of changes in the hydrological cycle based on lake-level records of the balance between precipitation and evaporation (after Harrison and Bartlein, 2012).(Reprinted from Harrison, S. P. and Bartlein, P.: Records from the Past, Lessons for the Future, in: The Future of the World's Climate, edited by: A. Henderson-Sellars and K. J. McGuffie, 403-436, Copyright © 2012, with permission from Elsevier.)

Fig. 4 .
Fig. 4. Schematic diagram of model evaluation approaches, with optional approaches indicated by dashed lines.
Compensating errors: "Right answer for wrong reason".Hard to interpret as offers no indication of what is causing an error and how to fix it.(2006): seasonal cycle of snow albedo.Cox et al. (2013): IAV of tropical carbon fluxes.

Table 1 .
Summary of key data types for evaluation.

Table 2 .
Summary of Level 1 metrics (x, y represent points while D 1 , D 2 are data sets).More sensitive to outliers compared to the Manhattan distance.Like the Manhattan distance, it also supposes that direct comparisons of the variables can be made.

Table 3 .
Summary of Level 2 metrics.

Table 4 .
Summary of Level 3 metrics.

Table 5 .
Summary of evaluation methodologies.