A comprehensive benchmarking system for evaluating global vegetation models

. We present a benchmark system for global vegetation models. This system provides a quantitative evaluation of multiple simulated vegetation properties, including primary production; seasonal net ecosystem production; vegetation cover; composition and height; ﬁre regime; and runoff. The benchmarks are derived from remotely sensed gridded datasets and site-based observations. The datasets allow comparisons of annual average conditions and seasonal and inter-annual variability, and they allow the impact of spatial and temporal biases in means and variability to be assessed separately. Speciﬁcally designed metrics quantify model performance for each process, and are compared to scores based on the temporal or spatial mean value of the observations and a “random” model produced by bootstrap resampling of the observations. The benchmark system is applied to three models: a simple light-use efﬁciency and water-balance model (the Simple Diagnostic Biosphere Model: SDBM), the Lund-Potsdam-Jena (LPJ) and Land Processes and eXchanges (LPX) dynamic global vegetation models (DGVMs). In general, the SDBM performs better than either of the DGVMs. It reproduces independent measurements of net primary production (NPP) but underestimates the amplitude of the observed CO 2 seasonal cycle. The two DGVMs show little difference for most benchmarks (including the inter-annual variability in the growth rate and seasonal cycle of atmospheric CO 2 ) , but LPX represents burnt fraction demonstrably more accurately. Benchmarking also identiﬁed several weaknesses common to both DGVMs. The benchmarking system provides a quantitative approach for evaluating how adequately processes are represented in a model, identifying errors and biases, tracking improvements in performance through model development, and discriminating among models. Adoption of such a system would do much to improve conﬁdence in terrestrial model predictions of climate change impacts and feedbacks


Introduction
Dynamic global vegetation models (DGVMs) are widely used in the assessment of climate change impacts on ecosystems, and feedbacks through ecosystem processes (Cramer et al., 1999;Scholze et al., 2006;Sitch et al., 2008;Scheiter and Higgins, 2009).However, there are large differences in model projections of the vegetation response to scenarios of future changes in atmospheric CO 2 concentration and climate (Friedlingstein et al., 2006;Denman et al., 2007;Sitch et al., 2008).Assessing the uncertainty around vegetationmodel simulations would provide an indicator of confidence in model predictions under different climates.Such a system would serve several functions, including the following: comparing the performance of different models; identifying Published by Copernicus Publications on behalf of the European Geosciences Union.
processes in a particular model that need improvement; and checking that improvements in one part of a model do not compromise performance in another.
Benchmarking is a routine component in the assessment of climate-model performance, including investigation of parameter uncertainties (e.g.Murphy et al., 2004;Piani et al., 2005) and multi-model comparison (Randall et al., 2007;Reichler and Kim, 2008), and is used both to inform model development (e.g. Jackson et al., 2008) and to interpret the reliability of projections of future climate (e.g.Shukla et al., 2006: Hall andQu, 2006).In recent years, there has been considerable effort spent on the development of standard metrics for climate-model evaluation (Taylor, 2001;Gleckler et al., 2008: Lenderink, 2010;Moise and Delage, 2011;Yokoi et al., 2011).In comparison, there has been little quantitative assessment of DGVM performance under recent conditions.Although most studies describing vegetation-model development provide some assessment of the model's predictive ability by comparison with observational datasets (e.g.Sitch et al., 2003;Woodward and Lomas, 2004;Prentice et al., 2007), such comparisons often focus just on one aspect of the model where recent development has taken place (e.g.Gerten et al., 2004;Arora and Boer, 2005;Zeng et al., 2008;Thonicke et al., 2010;Prentice et al., 2011).It has not been standard practice to track improvements in (or degradation of) general model performance caused by new developments.
A benchmarking system should facilitate more comprehensive model evaluation, and help to make such tracking routine.The land modelling community has recently recognized the need for such a system (e.g. the International Land Model Benchmarking Project, ILAMB: http://www.ilamb.org/), and some recent studies have designed and applied benchmarking systems.Blyth et al. (2009Blyth et al. ( , 2011) ) compared results of the JULES land-surface model with site-based water and CO 2 flux measurements and satellite vegetation indices, quantifying the difference between model output and observations using root mean squared error (RMSE) as a metric.Beer et al. (2010) used a gridded dataset of gross primary productivity (GPP), derived from up-scaling GPP from the FLUXNET network of eddy covariance towers (Jung et al., 2009(Jung et al., , 2010) ) to assess and compare the Lund-Potsdam-Jena (LPJ), LPJmL, ORCHIDEE, CLM-CN and SDGVM models.Bonan et al. (2011) evaluated latent heat fluxes with the tower-derived gridded GPP dataset (Beer et al., 2010) to evaluate the calibration of the CLM4 model.Cadule et al. (2010) used the model-to-data deviation, normalised standard deviation and Pearson's correlation to quantify the "distance" between simulated and observed CO 2 concentration and applied these to compare three coupled climatevegetation models that incorporate two DGVMs: TRIFFID and ORCHIDEE.All of these studies focus on a very limited number of simulated processes, and use metrics that are difficult to interpret across processes and models.Randerson et al. (2009) introduced a more systematic framework to as-sess and compare the performance of two biogeochemical models (CLM-CN and CASA') against net primary production (NPP) and CO 2 concentration data, including the definition of comparison metrics tailored to the benchmark observations and a composite skill score that combined metric scores for each observation into an overall measure of model performance.The Randerson et al. (2009) composite score was a weighted combination of scores across different metrics, where the weights were based on a qualitative and necessarily somewhat subjective assessment of the "importance" and uncertainty of each process (Randerson et al., 2009).Luo et al. (2012) recommended the development of a working benchmarking system for vegetation models that incorporates some of the approaches used in these various studies including a set of standard target datasets for benchmarks, a scoring system; and a way of comparing across model processes in order to evaluate model strengths and weaknesses to guide model development.Luo et al. (2012) reject the idea of a single composite metric because of the subjectivity involved in choices of relative weightings.
Our purpose here is to demonstrate a benchmarking system including multiple observational datasets and transparent metrics of model performance with respect to individual processes.We have tested the system on three vegetation models to demonstrate the system's capabilities in comparing model performance, assigning a level of confidence to the models' predictions of key ecosystem properties, assessing the representation of different model processes and identifying deficiencies in each model.

Principles
The benchmarking system consists of a collection of datasets, selected to fulfil certain criteria and to allow systematic evaluation of a range of model processes, and metrics, designed with the characteristics of each benchmark dataset in mind.We selected site-based and remotely sensed observational datasets that, as far as possible, fulfil the following requirements: -They should be global in coverage or, for site-based data, they should sample reasonably well the different biomes on each continent.This criterion excludes "campaign mode" measurements, and datasets assembled only for one continent or region.
-They should be independent of any modelling approach that involves calculation of vegetation properties from the same driving variables as the vegetation models being tested.This criterion allows remotely sensed fraction of absorbed photosynthetically active radiation (fA-PAR) products but excludes the MODIS NPP product Biogeosciences, 10, 3313-3340, 2013 www.biogeosciences.net/10/3313/2013/used by Randerson et al. (2009), or remotely sensed evapotranspiration (e.g.Fisher et al., 2008Fisher et al., , 2011;;Mu et al., 2011).It allows use of flux measurements and CO 2 inversion products, but excludes, for example, the up-scaled GPP used by Beer et al. (2010).
-They should be available for multiple years and seasonal cycles to allow assessment of modelled seasonal and inter-annual variation, for variables that change on these time scales.
-Datasets should be freely available, so that different modelling groups can evaluate their models against the same benchmarks.
The selected datasets (Table 1) provide information for the following: fAPAR, the fractional coverage of different plant life and leaf forms, GPP and NPP, height of the canopy, fire, as burnt fraction; runoff, as river discharge, and seasonal and inter-annual variation in atmospheric CO 2 concentration (Fig. 1): -fAPAR is the fundamental link between primary production and available energy (Monteith, 1972).It measures the seasonal cycle, inter-annual variability and trends of vegetation cover.Of all ecosystem properties derived from spectral reflectance measurements, fAPAR is closest to the actual measurements.
-Fractional cover of different life forms and leaf forms provides basic information about vegetation structure and phenology.
-GPP and NPP are the two fundamental measures of primary production.
-Vegetation height is a key variable for characterising vegetation structure, function and biomass.
-Remotely sensed data on fire (as fractional burnt area) have been available for a few years (e.g.Carmona-Moreno et al., 2005;Giglio et al., 2006).The latest dataset (Giglio et al., 2010;van der Werf et al., 2010) is derived from active fire counts and involves empirical (biome-dependent) modelling to translate between active fire counts and burned area.Our criteria exclude the use of the accompanying fire CO 2 emissions product (van der Werf et al., 2010), however, as this depends strongly on the use of a particular biogeochemical model.
-Annual runoff is an indicator of ecosystem function, as it represents the spatial integration of the difference between precipitation and evapotranspiration -the latter primarily representing water use by vegetation.It is a sensitive indicator, because a small proportional error in modelled evapotranspiration translates into a larger proportional error in runoff (Raupach et al., 2009).Runoff is measured independently of meteorological data by gauges in rivers.
-Atmospheric CO 2 concentration is measured at high precision at a globally distributed set of stations in remote locations (distant from urban and transport centres of CO 2 emission).The pattern of the seasonal cycle of atmospheric CO 2 concentration at different locations provides information about the sources and sinks of CO 2 in the land biosphere (Heimann et al., 1998), while the inter-annual variability of the increase in CO 2 provides information about CO 2 uptake at the global scale.
Ocean impacts on the seasonal cycle are small (Nevison et al., 2008).For inter-annual variability we use inversion products which selectively remove the ocean contribution (about 20 % of the signal: Le Quéré et al., 2003).
All remotely sensed data were re-gridded to a 0.5 • resolution grid and masked to a land mask common to all three models.Data-model comparison metrics were designed to be easy to implement, intuitive to understand, and comparable across multiple benchmarked processes.Metric scores for comparison of models with these datasets were compared against scores from two null models: one corresponding to the observational mean and the other obtained by randomly resampling the observations.
To demonstrate whether the benchmark system fulfilled the functions of evaluating specific modelled processes and discriminating between models, we applied it to three global models: a simple light-use efficiency and water-balance model introduced by Knorr and Heimann (1995), known as the Simple Diagnostic Biosphere Model (SDBM : Heimann et al., 1998) and two DGVMs.The SDBM is driven by observed precipitation, temperature and remotely sensed observations of fAPAR.The model has two tunable global parameters representing light-use efficiency under well-watered conditions, and the shape of the exponential temperature dependence of heterotrophic respiration.The DGVMs are the Lund-Potsdam-Jena (LPJ) model (version 2.1: Sitch et al., 2003, as modified by Gerten et al., 2004) and the Land surface Processes and eXchanges (LPX) model (Prentice et al., 2011).LPX was developed from LPJ-SPITFIRE (Thonicke et al., 2010), and represents a further refinement of the fire module in LPJ-SPITFIRE.

fAPAR
fAPAR data (http://oceancolor.gsfc.nasa.gov/SeaWiFS/;Table 1) were derived from the SeaWiFS remotely sensed fA-PAR product (Gobron et al., 2006), providing monthly data for 1998-2005. fAPAR . fAPAR et al. (2000), Rödenbeck et al. (2003), Baker et al. (2006), Chevalier et al. (2010) obtained for times when the solar incidence angle is > 50 • .This limitation mostly affects cells at high latitudes, or with complex topography, during winter.Cells where fAPAR values could not be obtained for any month were excluded from all comparisons.Annual fAPAR, which is the ratio of total annual absorbed to total annual incident PAR, is not the same as the average of the monthly fAPAR.True annual fAPAR was obtained by averaging monthly values weighted by PAR.Monthly PAR values were calculated using Clime Research Unit (CRU) TS3.1 monthly fractional cloud cover (Jones and Harris, 2012) as described in Gallego-Sala et al. (2010).
Monthly and annual fAPAR values were used for annual average, inter-annual variability and seasonality comparisons.
The monthly fAPAR data are used as a driver for the SDBM, but as a benchmark for the DGVMs.

Vegetation cover
Fractional cover data (Table 1) were obtained from International Satellite Land-Surface Climatology Project (ISLSCP) II vegetation continuous field (VCF) remotely sensed product (Hall et al., 2006;DeFries and Hansen, 2009 and refer-ences therein).The VCF product provides separate information on life form, leaf type and leaf phenology at 0.5 • resolution for 1992-1993.There are three categories in the lifeform dataset: tree (woody vegetation > 5 m tall), herbaceous (grass/herbs and woody vegetation < 5 m), and bare ground cover.Leaf type (needleleaf or broadleaf) and phenology (deciduous or evergreen) is only given for cells that have some tree cover.Tree cover greater than 80 % is not well delineated due to saturation of the satellite signal, whereas tree cover of less than 20 % can be inaccurate due to the influence of soil and understorey on the spectral signature (DeFries et al., 2000).The 0.5 • dataset was derived from a higher resolution (1 km) dataset (DeFries et al., 1999).Evaluation of the 1 km dataset against ground observations shows it reproduces the distribution of the major vegetation types: the minimum correlation is for bare ground at high latitudes (r 2 = 0.79) whereas grasslands and forests have an r 2 of 0.93.

NPP
The NPP dataset (Table 1) was created by combining site data from the Luyssaert et al. (2007) and the Ecosystem Model/Data Intercomparison (EMDI: Olson et al., 2001) databases.We exclude sites from managed or disturbed environments; i.e. we do not use class B records from EMDI, and we exclude sites classified as "managed", "recently burnt", "recently cut clear", "fertilized" or "irrigated" in Luyssaert et al. (2007) .The Luyssaert et al. (2007) data used here are all from woody biomes, and all but two of the EMDI data used are from grasslands.The NPP estimates in Luyssaert et al. (2007)

Burnt fraction
Burnt fraction data (Table 1) were obtained for each month from 1997-2006 from the third version of the Global Fire Emissions Database (GFED3: Giglio et al., 2010).Burnt fraction was calculated from high-resolution, remotely sensed daily fire activity and vegetation production using statistical modelling.Quantitative uncertainties in the estimates of burnt fraction, provided for each grid cell, are a combination of errors in the higher resolution fire activity data and errors associated with the conversion of these maps to lowresolution burnt area.

River discharge
River discharge (Table 1) was obtained from monthly measurements at station gauges between 1950 and 2005 (Dai et al., 2009).Dai et al. (2009) use a model-based infilling procedure in their analyses, but the dataset used here is based only on the gauge measurements.The basin associated with gauges close to a river mouth was defined using information from the Global Runoff Data Centre (GRDC: http: //www.bafg.de/GRDC).Average runoff for the basin was obtained by dividing discharge by total basin area.Although individual gauge measurements may have measurement errors of the order of 10-20 %, the use of spatially integrated discharge values means that the uncertainties are considerably less than this (Dai et al., 2009).Annual average and interannual variability comparisons for runoff were made only for years in which there were 12 months of data, to avoid seasonal biases.

CO 2 concentration
CO 2 concentration (Table 1) data were taken from 26 Carbon Dioxide Information Analysis Center (CDIAC: cdiac.ornl.gov) stations (Fig. 3) for seasonal cycle comparisons.For inter-annual comparisons, we used several inversion products (Bousquet et al., 2000;Rödenbeck et al., 2003;Baker et al., 2006;Keeling, 2008;Chevalier et al., 2010), processed as in Prentice et al. (2011).The inversions are designed to isolate the component of variability in the CO 2 growth rate due to land-atmosphere exchanges.The differences between these inversions (maximum difference 3.8 ppm) give a measure of the associated uncertainty.

Metrics
Many measures with different properties are used in the geosciences literature to compare modelled and observed quantities.These typically fall into three categories: nonnormalised metrics; metrics normalised by observational uncertainty; and metrics normalised by observational variance.Non-normalised metrics, which include RMSE (used e.g. by Blyth et al., 2009Blyth et al., , 2011) ) and mean squared error (MSE), cannot be compared directly between different variables as they are in different units.Metrics normalised by observational uncertainty require uncertainty estimates to be given for each site/grid cell in a dataset.Most of the datasets used in this study do not have such estimates, ruling out the use of metrics normalised by observational uncertainty.We therefore use metrics normalised by observational variance, allowing metrics based on both mean deviations (modulus-based) and mean squared deviations as alternative "families".The mean, variance and standard deviation provide a basic measure of global agreement between model and observation.Our basic normalised metrics for taking the geographic patterning into account in data-model comparisons of annual averages or totals were the normalised mean error (NME) and the normalised mean squared error (NMSE) (for definitions, limits and applications, see Table 2): where y i is the modelled value of variable x in grid cell (or at site) i, x i the corresponding observed value, and x the mean observed value across all grid cells or sites.NMSE is equal to the one-complement of the Nash-Sutcliffe model efficiency metric (Nash and Sutcliffe, 1970).NMSE thus conveys the same information as the Nash-Sutcliffe metric.As NME and NMSE are normalised by the spatial variability of the observations, these scores provide a description of the spatial error of the model.NME differs from NMSE only in the use of mean deviations, which are less sensitive to extreme values than standard deviations.We prefer NME, but retain NMSE because of its direct relation to a metric established in the literature.Both metrics take the value zero when agreement is perfect, unity when agreement is equal to that expected when the mean value of all observations is substituted for the model, and values > 1 when the model's performance is worse than the null model.
Table 2. Summary description of the benchmark metrics.y i is the modelled and x i is the corresponding observed value in cell or site i, and x is the mean observed value across all grid cells or sites.ω i is the modelled phase, and ϕ i is the observed phase.q ij is the modelled and p i observed proportion of item j in cell or site i.

Metric Equation Limits
Use in this study Normalised mean error (NME) 1 -Model performs as well as observational mean 2 -complete disagreement for step 3 Infinity -complete disagreement for step 1 and 2 For burnt fraction and fAPAR: annual averages, phase concentration, interannual variability.
For runoff: annual averages, inter-annual variability For CO 2 : phase concentration For NPP, GPP and height: annual averages Assessing difference in seasonality for fAPAR, burnt fraction and CO 2 Manhattan metric (MM) Vegetation cover comparisons for life forms, tree, grassland, bare ground, evergreen vs. deciduous tree and broadleaf vs. needleleaf tree.
Table 3. Mean, absolute variance (as defined in Eq. 3) and standard deviation (SD) of the annual average values of observations.The variance for most variables is from the long-term mean of the gridded or site data, whereas CO 2 is the variance of the inter-annual differences.

Annual average
Annual average comparisons were made using the mean, mean deviation (Eq. 3) and standard deviation of simulated and observed values (Table 3).NME and NMSE comparisons were conducted in three stages: (1) x i and y i take modelled and observed values; (2) x i and y i become the difference between observed or modelled values and their respective means (x i → x i − x); and (3) x i and y i from step 2 are divided by either the mean deviation or standard deviation Stage 2 removes the influence of the mean, and stage 3 removes the influence of the variability, on the measure.The NMSE at stage 3 is related to the correlation coefficient (Barnston et al., 1992).Van Oijen et al. (2011) showed that MSE can be decomposed into three elements similar to stage 1, 2 and 3 here, but as MSE is not normalised the decomposition is not directly applicable for this study.

Inter-annual variability
Inter-annual variability comparisons were made by calculating global values for each year of the model output and observations, and comparing them using Eqs.( 1) and ( 2), but with y i now being the global sum of modelled values for year i, and x i the corresponding observed value.Only stage 2 and 3 comparisons were made, as the stage 1 provides no extra information from the annual-average comparisons.Stage 3 comparison measures whether a model has the correct timing or phasing of inter-annual peaks and troughs.For interannual CO 2 concentration, the observational data were detrended to remove the effect of anthropogenic emissions.

Seasonality
The seasonal expression of change can be characterised in terms of the length and timing of the season, as well as the magnitude of differentiation between seasons.For example, in simulating the fire regime at a particular place, the length of the fire season and the time that fires occur are as important as correctly predicting the area burnt.Seasonality comparisons were conducted in two parts: seasonal concentration (which is inversely related to season length) and phase (expressing the timing of the season).Each simulated or observed month was represented by a vector in the complex plane, the length of the vector corresponding to the magnitude of the variable for each month and the directions of the vector corresponding to the time of year: where θ t is the direction corresponding to month t, with month 1 (January) arbitrarily set to an angle of zero.A mean vector L was calculated by averaging the real and imaginary parts of the 12 vectors, x t .
The length of the mean vector divided by the annual value stands for seasonal concentration, C; its direction stands for phase, P : Thus, if the variable is concentrated all in one month, seasonal concentration is equal to 1 and the phase corresponds to that month.If the variable is evenly spread over all months, then concentration is equal to zero and phase is undefined.If either modelled or observed values have zero values for all months in a given cell or site, then that cell/site is not included in the comparisons.Concentration comparisons use Eqs.( 1) and ( 2) and steps 1, 2 and 3. Modelled and observed phase are compared using mean phase difference (MPD): where ω i is the modelled phase, and ϕ i is the observed phase.The measure can be interpreted as the average timing error, as a proportion of the maximum error (6 months).For seasonal CO 2 concentrations, where the data are monthly deviations from the mean CO 2 , we compared the seasonal amplitude instead of seasonal concentration by comparing the simulated and observational sum of the absolute CO 2 deviation for each month using Eqs.( 1) and (2).

Relative abundance
Relative abundance was compared using the Manhattan metric (MM) and squared chord distance (SCD) (Gavin et al., 2003;Cha, 2007): where q ij is the modelled abundance (proportion) of item j in grid cell i, p i the observed abundance of item j in grid cell i, and n the number of grid cells or sites.So in the case of comparing life forms, items j would be trees; herbaceous; and bare ground.The sum of items in each cell must be equal to one for these metrics to be meaningful.They both take the value of 0 for perfect agreement, and 2 for complete disagreement.

Null models
To facilitate interpretation of the scores, we compared each benchmark dataset to a dataset of the same size, filled with the mean of the observations (Table 4).We also compared each benchmark dataset with "randomized" datasets (Table 4).This was done using a bootstrapping procedure (Efron, 1979;Efron and Tibshirani, 1993), whereby we constructed a dataset of the same dimensions as the benchmark set, filled by randomly resampling the cells or sites in the original dataset with replacement.We created 1000 randomized datasets to estimate a probability density function of their scores (Fig. 2).Models are described as better/worse than randomized resampling if they were less/more than two standard deviations from the mean randomized score.
As NME and MM are the sum of the absolute spatial variation between the model and observations, the comparison of scores obtained by two different models shows the relative magnitude of their biases with respect to the observations, or how much "better" one model is than another.If a model has an NME score of 0.5, for example, its match to the observations is 50 % better than the mean of the data score of 1.0.Similarly, when this model is compared to a model with an NME score of 0.75, it can be described as 33 % better than the second model as its average spatial error is 0.5/0.75= 67 % the size.Conversely, the second model would need to reduce its errors/improve by 33 % in order to provide as good a match to observations as the first.

SDBM
The SDBM simulates NPP and heterotrophic respiration (R h ) as described in Knorr and Heimann (1995) while the embedded water-balance calculation models evapotranspiration and therefore implicitly runoff.NPP is obtained from a simple relationship: where ε is light-use efficiency, set at 1 g C MJ −1 ; Ipar is incident PAR; and α is the ratio of actual to equilibrium evapotranspiration, calculated as in Prentice et al. (1993) and Gallego-Sala et al. (2010).R h was calculated as a function of temperature and water availability and for each cell is assumed to be equal to NPP each year (i.e.assuming the respiring pool of soil carbon is in equilibrium): where Q 10 is the slope of the relationship between ln(R h ) and temperature (expressed in units of proportional increase per 10 K warming) and takes the value of 1.5; and T is temperature ( • C). β is calculated by equating annual R h and annual NPP: GPP was assumed to be twice simulated NPP (Poorter et al., 1990).Runoff was assumed to be the difference between observed precipitation and evapotranspiration.Groundwater exchanges are disregarded.The free parameters ε and Q 10 were assigned values of 1.0 and 1.5 respectively, following Knorr and Heimann (1995) who obtained these values by tuning to observed seasonal cycles of CO 2 .

LPJ
LPJ (version 2.1: Gerten et al., 2004) simulates the dynamics of terrestrial vegetation via a representation of biogeochemical processes, with different properties prescribed for a small set of plant function types (PFTs).Each PFT is described by its life form (trees or herbaceous), leaf type (needleleaf or broadleaf) and phenology (evergreen or deciduous).A minimal set of bioclimatic limits constrain the global distribution of the PFTs.Nested time steps allow different processes to be simulated at different temporal resolution: photosynthesis, respiration and water balance are calculated on a daily time step while carbon allocation and PFT composition are updated on an annual time step.A weather generator converts monthly data of precipitation and fractional rain days to a daily time series of precipitation amounts.Fire is calculated annually and is based upon a simple empirical model which calculates the probability of fire based on daily moisture content of the uppermost soil layer as a proxy for fuel moisture (Thonicke et al., 2001)  are calculated from the summed annual probability of fire, using a simple relationship.

LPX
LPX (Prentice et al., 2011), which is a development of LPJ-SPITFIRE (Thonicke et al., 2010), incorporates a processbased fire scheme, with ignition rates based on the seasonal distribution of lightning strikes and fuel moisture content and fire spread, intensity and residence time based on climate data and modelling the drying of different fuel types between rain days.Fire intensity influences fire mortality and carbon fluxes.The fire model runs on a daily time step.

Model protocol
All models were run on a 0.5 • global grid using the CRU TS3.0 land mask as in Prentice et al. (2011).Soil texture was prescribed using the FAO soil data (FAO, 1991).The spin-up and historical drivers for the DGVM simulations were exactly as used for LPX by Prentice et al. (2011).For comparability, the same climate data were used to drive the SDBM.In addition SDBM was driven by fAPAR values from SeaWiFS observations.For cells lacking fAPAR values, values were constructed for the missing months by fitting the following equation to available data for each year: where fAPAR(m) is the fAPAR for months m with data; U is the maximum fAPAR value in month m max ; and L is the minimum fAPAR value.As the maximum fAPAR value typically occurs in spring or summer (Prince, 1991) when Sea-WiFS data are generally available, and the minimum occurs in the winter when data may be unavailable, U is set to the highest fAPAR value, whilst L is tuned to fit the function to the data.
The SDBM was only run for 1998-2005, a limitation imposed by the availability of fAPAR data, and comparisons were confined to this period.For LPX and LPJ, outputs and therefore comparisons were possible from 1950-2006.Comparisons with NPP, GPP, annual average basin runoff, global inter-annual variability in runoff, and the seasonal cycle of CO 2 concentration were made for all three models.LPX and LPJ are compared across a wider range of benchmarks.
Comparisons of the seasonal CO 2 cycle were based on simulated monthly net ecosystem production (NEP: NPP − R h − fire carbon flux).NEP for the SDBM was taken as the difference between monthly NPP and R h .For LPJ, which simulates fire on an annual basis, monthly fire carbon flux was set to 1/12 the annual value.With LPX, it was possible to use monthly fire carbon flux.For each model, detrended monthly values of NEP for each grid cell were input into the atmospheric transport matrices derived from the TM2 transport model (Kaminski et al., 1996), which allowed us to derive the CO 2 seasonal cycle (Heimann, 1995;Knorr and Heimann, 1995) at the locations of the observation sites.
Average basin runoff was calculated by summing the runoff from all model grid cells within a GRDC-defined basin and dividing by the basin area.If a grid cell fell into more than one GRDC basin, the runoff was divided between basins in proportion to the fraction of the cell within each basin.Inter-annual changes in runoff were calculated by summing runoff over all cells in basins for which there were data for a given year.Seasonal cycles of runoff are dependent Biogeosciences, 10, 3313-3340, 2013 www.biogeosciences.net/10/3313/2013/on the dynamics of water transport in the river, which was not modelled.

fAPAR
LPJ scores 0.82 and LPX scores 0.86 using NME for annual average fAPAR (Table 5).This difference in score is equivalent to a negligible (i.e.< 5 %) change in the match to the observations.Both values are considerably better than values for the mean of the data (1.00) and random resampling (1.19 ± 0.004), with the match to observations being 15 % closer and 30 % closer respectively.The models also perform well for seasonal timing (Fig. 4), with scores of 0.19 (LPJ) and 0.18 (LPX) or the equivalent of an average of 34 days different from observations.For comparison, the seasonal timing of the mean of the data and random resampling is ca. 3 months different from observations.The models also perform well for inter-annual variability: LPJ scores 0.60 and LPX scores 0.50 using NME for inter-annual variability, compared to a mean score of 1.00 and a score of 1.21 ± 0.34 from random resampling.The DGVM scores represent, respectively, a 40 % and 50 % better match to observations than the mean of the data.LPJ scores 1.07 and LPX scores 1.14 using NME for seasonal concentration, compared to 1.00 for the mean and 1.41 ± 0.006 for random resampling.This means that the seasonal concentration of fapar in the DGVMs is, respectively, 7 % and 14 % worse than the mean of the data compared to observations.

Vegetation cover
LPJ scores 0.78 and LPX scores 0.76 using the MM for the prediction of life forms (Table 5), again a negligible difference in performance (< 3 %) compared to observations.Both values are better than obtained for the mean of the data (0.93) or by randomly resampling (0.88 ± 0.002).Both models were also better than mean and randomly resampling for bare ground.However, both models overestimate tree cover and underestimate herb cover by around a factor of 2 (Table 5).The scores for tree cover (LPJ: 0.62, LPX: 0.56) show, respectively, a 38 % and 24 % poorer match to observations than the mean of the data (0.45), and a 15 % and 4 % poorer match to observations than randomly resampling (0.54 ± 0.002).In the same way, the two DGVMs show a 40 % poorer match to observed grass cover than the mean of the data and a 6 % poorer match than randomly resampling.Both models are worse than mean and random resampling for phenology (Table 5).

NPP/GPP
The models have NME scores for NPP of 0.58 (SDBM), 0.83 (LPJ) and 0.81 (LPX) (Table 5) -better than values obtained for the mean of the data (1.00) and random resampling (1.35 ± 0.09).Removing the biases in mean and variance (Table 5) improves the performance of all three models.The SDBM simulates 1.13 times higher NPP than observed, but correctly predicts the spatial variability shown by the observations, whereas the two DGVMs overestimate NPP but underestimate the spatial variance in NPP.As a result, removing the biases in the mean produces a much larger improvement in the DGVMs.In LPJ, for example, the score goes from 0.83 to 0.69, equivalent to a 29 % better match to the observations.The improvement in the SDBM is equivalent to only a 9 % better match to observations.The two DGVMs perform worse for GPP than NPP.LPX has an NME score of 0.81 for NPP but 0.98 for GPP -this is equivalent to a 17 % better match to NPP observations than to GPP observations.The SDBM performs better for GPP than the DGVMs, obtaining an NME score of 0.71, which is better than the mean of the data (1.00) and randomly resampling (1.36 ± 0.22).

Canopy height
LPJ scores 1.00 and LPX scores 1.04 using NME for the prediction of height (Table 5).These values lie between the mean (1.00) and random resampling (1.33 ± 0.004) scores.This poor performance is due to modelled mean heights ca.60-65 % lower than observed and muted variance (Table 5, Fig. 6).Removing the mean bias improves the score for both DGVMs to 0.71 for LPJ and 0.73 for LPX, equivalent to a 29 % and 30 % improvement in the match to observations.Model performance is further improved by removing bias in the variance, to 0.64 for LPJ (ca.11 %) and 0.68 for LPX (ca.6 %).

Burnt fraction
There is a major difference between the two DGVMs for annual fractional burnt area (Fig. 7): LPJ scores 1.58, while LPX scores 0.85 for NME (Table 5).Thus, LPX produces a 46 % better match to the observations than LPJ.The LPJ score is worse than the mean (1.00) and random resampling (1.02 ± 0.008).The same is true for NME comparisons of inter-annual variability, with LPJ scoring 2.86, worse than the mean (1.00) and random resampling (1.35 ± 0.34), whereas the LPX score of 0.63 is better than both.LPX could also be benchmarked against the seasonality of burnt fraction.It scores 0.10 for MPD comparison of phase, much better than the mean (0.74) and random resampling (0.47 ± 0.001).However, it did not perform well for seasonal concentration, scoring 1.38 compared to the mean (1.00) and random resampling (1.33 ± 0.006).

River discharge
Comparing average runoff for 1950-2005, both DGVMs score 0.28 for NME, better than the mean (1.00) and random resampling (1.18 ± 0.48).The models perform much less well for inter-annual comparisons, with NME scores of 1.33 (LPJ) and 1.32 (LPX), worse than 1.00 for the mean and 1.45 ± 0.17 for random resampling.Agreement is slightly improved by removing variance bias (LPJ: 1.07, LPX: 1.11).Neither of the DGVMs examined here treat water-routing explicitly.Introducing a one-year lag for inter-annual comparisons (Fig. 8) produces a 21 % (LPJ) and 19 % (LPX) improvement in the match to observations, confirming that taking account of delays in water transport is important when assessing the inter-annual variation in runoff.All three models were compared for 1998-2005.For annual average comparisons, they all performed better than the mean and random resampling (Table 5).However, all models performed poorly for inter-annual variability, obtaining similar scores (1.64 to 2.38) compared to the mean (1.00) and random resampling (1.34 ± 0.34).Removing variability bias and introducing a one-year lag improved performance, with the SDBM scoring 1.37, LPJ 1.36 and LPX 1.35.

CO 2 concentration
The generalised form of the seasonal cycle in CO 2 concentrations at different sites can be compared for all three models.The SDBM scores 0.21 whereas LPJ scores 0.34 and LPX 0.34 in the MPD comparisons of seasonal timing, compared to the mean of the data (0.33) and random resampling (0.42 ± 0.051).Thus, the SDBM produces an estimate of peak timing that is 22 days closer to observations than the mean of the data, while the DGVMs produce estimates that are 6 days further away from the observations than the mean of the data (Fig. 3).The scores for NME comparison of seasonal concentration for the SDBM (0.68), LPJ (0.46) and LPX (0.58) are all better than the mean (1.00) and random resampling (1.38 ± 0.28).Thus, although the difference between the models is non-trivial (ca.30 %), all three models are ca.30-50 % closer to observations than the mean of the data.Only the DGVMs can be evaluated with respect to interannual variability in global CO 2 concentrations.Both models capture the inter-annual variability relatively well (Table 5).LPJ scores 0.89 and LPX scores 0.83 for the average of all inversion datasets, compared to the mean of the data (1.00) and random resampling (1.37 ± 0.05).

Incorporating data uncertainties
In calculating the performance metrics, we have disregarded observational uncertainty.Few land-based datasets provide quantitative information on the uncertainties associated with site or gridded values.However, the GFED burnt fraction (Giglio et al., 2010)  values at each site or grid cell to calculate the maximum and minimum potential distance between models and observations.
In the standard NME comparison for annual fractional burnt area, LPJ scores 1.58 while LPX scores 0.85.Taking into account the uncertainties produces minimum and maximum scores of 1.27 and 1.85 for LPJ, and 0.35 and 1.17 for LPX.Since these ranges are non-overlapping, the improvement in the match to observations shown by LPX compared to LPJ is demonstrably larger than observational uncertainty.This is not the case for the Luyssaert et al. (2007) only sitebased annual average NPP comparisons, where the ranges are 0.26-1.36(SDBM), 0.37-1.43(LPJ) and 0.39-1.50(LPX).Similarly, the apparent biases in mean annual NPP shown by all three models are within the observational uncertainty.Removing the slight high bias in mean annual NPP produced an improvement in the performance of the SDBM, with a change in the Luyssaert et al. (2007) only score from 0.72 to 0.59, equivalent to a 18 % better match to the observations.However, the range of scores obtained for the SDBM taking into account the observational uncertainties after removing the high bias is 0.21-1.25.As this overlaps with the scores obtained prior to removing these biases (0.26-1.36), the improvement gained from removing the influence of the mean in NPP in the SDBM is less than the observational uncertainty.
Another approach to estimating the influence of uncertainty is to use alternative realizations of the observations.This approach has been used by the climate-modelling community to evaluate performance against modern climate observations (e.g.Gleckler et al., 2008) and is used here for CO 2 inter-annual comparisons.The scores obtained in comparisons with individual inversion products range from 0.82 to 0.98 for LPJ, and from 0.70 to 0.95 for LPX.Thus, the conclusion that the two DGVMs capture the inter-annual variability equally well, based on the average scores of all inversion datasets, is supported when taking into account uncertainty expressed in the differences between the inversions.

The influence of choice of dataset
The use of alternative datasets for a given variable implies that there are no grounds for distinguishing which is more reliable.It also highlights the fact that there is an element of subjectivity in the choice of datasets and that this introduces another source of uncertainty into the process of benchmarking.We have explicitly excluded from the benchmarking procedure any datasets that involve manipulations of original measurements based on statistical or physical models that are driven by the same inputs as the vegetation models being tested (e.g.MODIS NPP, remotely sensed evapotranspiration, upscaled GPP).However, such products often provide  2010) GPP dataset is based on a much larger number of flux-tower measurements than are included in the Luyssaert et al. (2007) dataset, but uses both diagnostic models and statistical relationships with climate to scale up these measurements to provide global coverage.We compare the annual average GPP scores using Beer et al. ( 2010), calculated using all grid cells and considering only those grid cells which correspond to locations with site data in the Luyssaert et al. (2007) dataset.These comparisons (Table 6) show that LPX and SDBM perform better against the Beer et al. (2010) dataset than against the Luyssaert et al. (2007) at the site locations, while the results obtained for LPJ against the two datasets are roughly similar.There is a further improvement in performance when the models are compared against all the grid cells.The improvement in performance at the site locations presumably reflects the fact that the Beer et al. ( 2010) dataset smooths out idiosyncratic site characteristics; the additional improvement in performance in the global comparison reflects both the smoothing and the much larger number of flux sites included in the Beer et al. ( 2010) dataset.Nevertheless, the conclusion that the SDBM performs better than the DGVMs is not influenced by the choice of dataset.LPJ performs marginally better than LPX when the Luyssaert et al. (2007) dataset is used as the benchmark (0.8 versus 0.98), but worse than LPX when the Beer et al. (2010) dataset is used as a benchmark (0.6 versus 0.51).This indicates that the difference between the two DGVMs is less than the observational uncertainty.
The release of new, updated datasets poses problems for the implementation of a benchmarking system, but could be regarded as a special case of the use of alternative realizations of the observations.The GFED3 burnt fraction dataset, used here, is a comparatively recent update of an earlier burnt fraction dataset (GFED2: van der Werf et al., 2006) the annual average burnt fraction changes from 1.58 (against GFED3) to 1.91 (against GFED2) for LPJ and from 0.85 (GFED3) to 0.92 (GFED2) for LPX (i.e. both models produce a better match to GFED3 than to GFED2), but the difference between the two models is preserved (i.e.LPX, with its more explicitly process-based fire model, is more realistic than LPJ).

Benchmarking the sensitivity to parameter tuning
Benchmarking can be used to evaluate how much tuning of individual parameters improves model performance and to ensure that the simulations capture specific processes correctly.We examine how well the current system serves in this respect by running sensitivity experiments using the SDBM.
The SDBM underestimates the amplitude of CO2 seasonal cycle (Fig. 3).A better match to CO 2 observations can be achieved by tuning the light-use efficiency parameter (ε in Eq. 12).The best possible match to CO2 seasonal amplitude (0.18) is obtained when ε is equal to 1.73 g C MJ −1 , but this increases both the mean and the variance of NPP compared to observations: the overall performance of the SDBM is therefore worse (Table 7).The seasonal amplitude of CO 2 depends on simulating the correct balance between NPP and R h .Thus, given that the model produces a reasonable simulation of annual average NPP, improvement in CO 2 seasonality should come from changes in the simulation of R h .Removing the requirement that NPP and R h are in equilibrium, by setting total NPP to be 1.2 times R h , improves the CO 2 sea-sonal amplitude score to 0.51.In the baseline simulation, the Q 10 for R h is 1.5 (Eq.13).Changing this response by increasing Q 10 to 2 degrades the simulation of the seasonal amplitude and phase of CO 2 , while decreasing Q 10 to 1.3 improves the simulation of the seasonal amplitude and phase of CO 2 (Table 7).Removing the seasonal response of R h to moisture (i.e.removing α from Eq. 13) improves the score for seasonal amplitude (0.39) but does not change the score for the phase.However, this degrades its performance against annual average NPP from 0.58 to 0.82.We expect that R h should be sensitive to soil moisture changes, but this analysis suggests that the treatment of this dependency in the SDBM is unrealistic.

Discussion and conclusion
Model benchmarking serves multiple functions, including (a) showing whether processes are represented correctly in a model, (b) discriminating between models and determining which performs better for a specific process, and (c) comparing between the model scores and those obtained by comparing mean and random resampling of observations, thus helping to identify processes that need improvement.As found by Heimann et al. (1998), the SDBM produces a good simulation of the seasonal cycle of atmospheric CO 2 concentration.However, we show that the simulated amplitude of the seasonal cycle is too low (  The seasonal variation of R h can be altered by changing the response of R h to temperature (Q 10 ).Although many models (e.g.Potter et al., 1993;Cox et al., 2000) use Q 10 values of 2, benchmarking shows that the value of 1.5 used in the SDBM provides a better match to seasonal CO 2 observations.However, reducing the Q 10 to 1.3 improves the simulation still further.Mehecha et al. ( 2010), based on an analysis of FLUXNET data, have shown that Q 10 values are 1.4 ± 0.1 independent of temperature or vegetation type.Thus, both the initial and "improved" Q 10 values used here are consistent with observations, whereas values of 2 are not.Sensitivity analyses show that the SDBM can produce a seasonal cycle comparable to observations with respect to both amplitude and phase by removing the assumption that NPP and R h are in equilibrium, and the dependence of R h on seasonal changes in moisture availability.The idea that NPP and R h are not in equilibrium is realistic; the idea that moisture availability has no impact on R h is not.Thus, these analyses illustrate how benchmarking can be used to identify whether processes are represented correctly in a model, and pinpoint specific areas that should be targeted for investigation in further developments of the SDBM.
The benchmarking system can discriminate between models.LPJ and LPX are closely related models, differing primarily in the complexity of their treatment of fire and the feedbacks from fire disturbance to vegetation.The two DGVMs perform equally well against the benchmarks, e.g. for NPP (Fig. 9), inter-annual CO 2 concentrations (Fig. 10) and inter-annual and annual average runoff (Fig. 8).However, LPX performs better than LPJ with respect to all measures associated with fire (Fig. 7).
We were able to show several areas where both DGVMs perform poorly against the benchmarks, and use the comparisons to evaluate possible reasons.Deficiencies common to both models include a low bias in canopy height (Table 5; Fig. 6), poor simulation of the seasonal concentration of fA-PAR and of the balance of tree and grass cover (Table 5), and poor simulation of the inter-annual variability in runoff (Fig. 8).
Both DGVMs score poorly against the canopy height benchmark (Fig. 6), averaging around 1/3 of observed heights (Table 5).However, they capture the spatial pattern of the differences in height reasonably well.A good match to canopy height was not expected, because LPJ and LPX do not simulate a size-or age-structured tree population but rather represent the properties of an "average individual".In contrast, the canopy height dataset represents the mean top height of forests within the grid cell.However, the models should, and do, capture broad geographic patterns of variation in height.The canopy height benchmark could provide a rigorous test for models that explicitly simulate cohorts of different ages of trees, such as the Ecosystem Demography (ED) model (Moorcroft et al., 2001).For models adopting a similar strategy to the LPJ/LPX family, the key test is whether the spatial patterns are correctly simulated.In either case, the use of remotely sensed canopy height data represents a valuable addition to the benchmarking toolkit.
Poor performance in the simulation of seasonal concentration of fAPAR (Table 5) demonstrates that both DGVMs predict the length of the growing season inaccurately: the growing season is too long at low latitudes and too short at mid-latitudes.This poor performance indicates that the phenology of both evergreen and deciduous vegetation requires improvement.Both models overestimate the amount of tree cover and underestimate grass cover (Table 5).The oversharp boundaries between forests and grasslands suggest that the models have problems in simulating the coexistence of these life forms.This probably also affects, and is exacerbated by, the simulation of fire in the models (Fig. 7).
The DGVMs simulate annual average runoff reasonably well, but inter-annual variability in runoff is poorly simulated.In large basins, water can take many months to reach the river mouth (Ducharne et al., 2003) and this delay has a major impact on the timing of peaks in river discharge.

Biogeosciences
Neither LPX nor the version of LPJ evaluated here includes river routing; runoff is simulated as the instantaneous difference in the water balance.Thus, it is unsurprising that neither model produces a good match to observations of interannual variability.Murray et al. (2011) have pointed out that inclusion of a river routing scheme should improve the simulation of runoff in LPX, and this is supported by the fact that introducing a one-year lag improved model performance against the runoff benchmark (Fig. 8).There is already a version of LPJ (LPJmL v3.2: Rost et al., 2008) that incorporates a water storage and transport model (and also includes human extraction), and produces a more realistic simulation of inter-annual variability in runoff than the version examined here.
In this paper, we have emphasised the use of global metrics for benchmarking, although both the NME and MM metrics provide a measure of the impact of the correct simulation of geographical patterning on global performance.However, the metrics could also be used to evaluate model performance at smaller geographic scales (e.g. for specific latitudinal bands, or individual continents or biomes).For example, comparison of the mean annual burnt fraction scores for specific latitudinal bands shows that the two DGVMs simulate fire in tropical regions better than in extratropical regions or overall, with NME scores for the tropics of 1.27 (LPJ) and 0.82 (LPX) compared to the global scores of 1.58 (LPJ) and 0.85 (LPX).
Some variables, such as runoff and burnt fraction, display considerable inter-annual variability linked to climate (e.g.changes in ENSO: van der Werf et al. post-volcanic cooling events: Riaño et al., 2007), and valuable information is obtained by considering this variability.The vegetation cover and canopy height datasets used for benchmarking here are single-year "snapshots": this is entirely appropriate for variables that change only slowly.Nevertheless, given that vegetation is already responding to changes in climate (Parmesan, 2006;Hickling et al., 2006;Fischlin et al., 2007), additional "snapshots" of these variables would be useful adjuncts to a benchmarking system allowing evaluation of models' ability to reproduce decadalscale variability in vegetation properties.
In general, remote sensing data are most likely to provide the global coverage necessary for a benchmark dataset.Nevertheless, we have found considerable value in using sitebased datasets for river discharge, CO 2 , GPP and NPP.River discharge data are spatially integrated over basins that together cover much of the global land surface, while CO 2 station measurements intrinsically integrate land-atmosphere CO 2 fluxes over moderately large areas through atmospheric transport.The coverage of the site-based GPP and NPP datasets is more limited and currently does not represent the full range of biomes.We have shown that model performance against the Beer et al. ( 2010) gridded GPP dataset is better than performance against the site-specific estimates of GPP in the Luyssaert et al. (2007) dataset -a function of the much higher number of flux-tower measurements included in the newer dataset and the smoothing of individual measurements inherent in the interpolation of these measurements to produce a gridded dataset.We do not use the Beer et al. ( 2010) dataset as a standard benchmark, because it was derived, in part, using the same climate variables that are used for the simulation of GPP in the vegetation models.However, the apparent improvement in model performance against the Beer et al. ( 2010) dataset at the Luyssaert et al (2007) sites indicates the importance of making quality-controlled summaries of the primary flux-tower data available to the modelling community for benchmarking purposes.
GPP and NPP have also been derived from remotely sensed products (e.g.Running et al., 2004;Turner et al., 2006).This is not an optimal approach because the results are heavily influenced by the model used to translate the spectral vegetation indices, and the reliability of the product varies with spatial scale and for a given ecosystem type (Lu and Ji, 2006).
A more general issue with the development of benchmarking systems is the fact that target datasets are constantly being extended in time and upgraded in quality.This is potentially problematic if the benchmark system is to be used to evaluate improvements in model performance through time, since this requires the use of a fixed target against which to compare successive model versions, but this target may have been superseded in the interim.In the current system, for example, we use the Dai et al. (2009) dataset for runoff, which supersedes an earlier product (Dai and Trenberth, 2002) and improves upon this earlier product by including more and longer records.The use of an updated version of the same target dataset may change the numeric scores obtained for a given simulation, but our comparison of the GFED2 and GFED3 datasets suggests this is unlikely to change the interpretation of how well a model performs.Any benchmarking system will need to evolve as new data products become available.In practical terms, this may mean that data-model comparisons will have to be performed against both the old and new versions of the products in order to establish how different these products are from one another and to establish a new baseline comparison value for any given model.As with the datasets used in this study, any new datasets should be freely available to the scientific community, to allow different modelling groups to undertake comparable benchmarking exercises.
A major limitation of the benchmarking approach presented here is that it does not take into account observational uncertainties, because very few datasets provide a quantitative estimate of such uncertainties.We have shown that observational uncertainty is larger than differences in model performance with respect to site-based annual average NPP measurements, and these observational uncertainties are also greater than model biases in NPP.However, differences in the performance of LPJ and LPX with respect to annual average burnt fraction are considerably larger than observational un-certainties.Approaches such as the use of multiple datasets (e.g.our use of multiple CO 2 inversions) may be one way of assessing uncertainty where there are no grounds for selecting a particular dataset as being more accurate or realistic.However, the only comprehensive solution to the problem is for measurement uncertainties to be routinely assessed for each site/grid cell and included with all datasets.
We have not attempted to provide an overall assessment of model performance by combining the metric scores obtained from each of the benchmarks into a composite skill score, although this has been done in some previous analyses (e.g.Randerson et al., 2009), because this requires subjective decisions about how to weight the importance of each metric.Composite skill scores have been used in dataassimilation studies to obtain better estimates of model parameters (e.g.Trudinger et al., 2007).The choice of weights used in these multi-variable composite metrics alters the outcome of parameter optimization (Trudinger et al., 2007;Weng and Luo, 2011;Xu et al., 2006).Decisions about how to weight individual vegetation-model benchmarks may heavily influence model performance scores (Luo et al., 2012).
The community-wide adoption of a standard system of benchmarking, as first proposed by C-LAMP (Randerson et al., 2009) and by ILAMB (Luo et al., 2012), would help users to evaluate the uncertainties associated with specific vegetation-model simulations and help to determine which projections of the response of vegetation to future climate changes are likely to be more reliable.As such, it will help to enhance confidence in these tools.At the same time, as we have shown here, systematic benchmarking provides a good way to identify ways of improving the current models and should lead to better models in the future.

Fig. 2 .
Fig. 2. Results of bootstrap resampling of inter-annual variability in global burnt fraction(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005) from the GFED3 dataset.The asterisks labelled LPX and LPJ show the scores achieved by the LPX and LPJ models respectively.The limits for better than and worse than random resampling are set at two standard deviations away from the mean bootstrapping value (vertical lines).

Fig. 3 .
Fig. 3. Observed seasonal cycle of atmospheric CO 2 concentrations at 26 CO 2 stations over the period 1998-2005 (black line), taken from the CDIAC website (cdiac.ornl.gov)compared to the simulated seasonal cycle from the Simple Diagnostic Biosphere Model (SDBM) (green line); LPJ (red); and LPX (blue).The y-axis indicates variation in atmospheric CO 2 concentration about the mean.The x-axis is months from January through 18 months to June.

Fig. 4 .
Fig. 4. Comparison of observed and simulated seasonal phase and seasonal concentration of fraction of absorbed photosynthetically active radiation (fAPAR) averaged over the period 1998-2005 from (a) seasonal phase from SeaWiFS (Gobron et al., 2006) and as simulated by (b) LPJ and (c) LPX; seasonal concentration from (d) SeaWiFS, (e) LPJ and (f) LPX.Hashed area in (a) and (d) shows areas where no comparison is possible.

Fig. 5 .
Fig. 5. Comparisons of observed and simulated NPP and GPP in kg C m −2 .The NPP observations (x-axis) are from the dataset made by combining sites from the Luyssaert et al. (2007) dataset and the Ecosystem/Model Data Intercomparison dataset (Olson et al., 2001).The GPP observations are derived from the Luyssaert et al. (2007) dataset.The simulated values (y-axis) are annual averages for the period 1998-2005.The observations are compared with NPP (a) and GPP (b) from the Simple Diagnostic Biosphere Model (SDBM), NPP (c) and GPP (d) from LPJ and NPP (e) and GPP (f) from LPX.The solid line shows the 1 : 1 relationship.

Fig. 6 .
Fig. 6.Comparisons of observed and simulated height.(a) Observed canopy height (in 2005) from the Simard et al. (2011) dataset compared to (b) simulated height in the same year from LPX; (c) LPX-simulated height, multiplied by a factor of 2.67 so that the simulated global mean height is the same as the observations; (d) height from (c) with values reduced by a factor of 1.40 about the mean so that the simulations have the same global mean and variance as the observations.

Fig. 8 .
Fig. 8. Observed inter-annual runoff for 1950-2005 averaged over basins from the Dai et al. (2009) dataset (black line) compared to average simulated runoff over the same basins from LPJ (red line) and LPX (blue line).(a) shows inter-annual runoff, and (b) shows interannual variability in runoff where the simulated values are lagged by a year.

Fig. 9 .
Fig. 9. Comparison of observed and simulated annual average net primary production (NPP).Observed values are from the Luyssaert et al. (2007) and Ecosystem/Model Data Intercomparison datasets (Olson et al., 2001), and the simulated values are from (b) Simple Diagnostic Biosphere Model (SDBM), (c) LPJ and (d) LPX.The symbols in (b), (c) and (d) indicate the magnitude and direction of disagreement between simulation and observed values, where the upward and downward facing triangles represent over-and undersimulation respectively.Double triangles indicate a difference in NPP of > 400 g C m −2 , single filled triangles a difference between 200 and 400 g C m −2 , single empty triangles a difference 100 and 200 g C m −2 , and empty circles a difference of < 100 g C m −2

Table 1 .
Summary description of the benchmark datasets.

Table 4 .
Scores obtained using the mean of the data (Data mean), and the mean and standard deviation of the scores obtained from bootstrapping experiments (Bootstrap mean, Bootstrap SD).NME/NMSE denotes the normalised mean error/normalised mean squared error, MPD the mean phase difference and MM/SCD the Manhattan metric/squared chord distance metrics.

Table 5 .
Comparison metric scores for model simulations against observations.Mean and variance rows show mean and variance of simulation for annual average values, followed in brackets by the ratio of the mean/variance with observed mean or variance in Table3.Numbers in bold indicate the model with the best performance for that variable.Italic indicates model scores better than the mean of the data score listed in Table4.Asterisks indicate model scores that are significantly better than randomly resampling listed in Table4.NME/NMSE denotes the normalised mean error/normalised mean squared error, MPD the mean phase difference and MM/SCD the Manhattan metric/squared chord distance metrics.fAPAR is the fraction of absorbed photosynthetically active radiation, NPP net primary productivity, and GPP gross primary productivity.

Table 6 .
Mean annual gross primary production (GPP) normalised mean error (NME) comparison metrics using Luyssaert et al. (2007) and Beer et al. (2010) as alternative benchmarks.In the case of Beer et al. (2010), the comparisons are made for all grid cells (global) and also from the grid cells which contain sites in the Luyssaert et al. (2007) dataset (at sites).

Table 5 ;
Fig.3).The SDBM's performance depends on getting the right balance

Table 7 .
Comparison metric scores for simulations with the Simple Diagnostic Biosphere Model (SDBM) against observations of the seasonal cycle of atmospheric CO 2 concentration and annual average NPP.Numbers in bold indicate the model with the best performance for that variable.Italic indicates model scores better than the SDBM simulation tuned using CO 2 seasonal observations.NME/NMSE denotes the normalised mean error/normalised mean squared error and MPD the mean phase difference.The details of each experiment are explained in the text. .Improved simulation of CO 2 seasonal amplitude can be achieved through tuning the light-use efficiency using CO 2 station data, but this degrades the simulated NPP. h