Using atmospheric observations to evaluate the spatiotemporal variability of CO 2 fluxes simulated by terrestrial biospheric models

Terrestrial biospheric models (TBMs) are used to extrapolate local observations and process-level understanding of land-atmosphere carbon exchange to larger regions, and serve as predictive tools for examining carbon-climate interactions. Understanding the performance of TBMs is thus crucial to the carbon cycle and climate science communities. In this study, we present and assess an approach to evaluating the spatiotemporal patterns, rather than aggregated magnitudes, of net ecosystem exchange (NEE) simulated by TBMs using atmospheric CO2 measurements. The approach is based on statistical model selection implemented within a high-resolution atmospheric inverse model. Using synthetic data experiments, we find that current atmospheric observations are sensitive to the underlying spatiotemporal flux variability at sub-biome scales for a large portion of North America, and that atmospheric observations can therefore be used to evaluate simulated spatiotemporal flux patterns as well as to differentiate between multiple competing TBMs. Experiments using real atmospheric observations and four prototypical TBMs further confirm the applicability of the method, and demonstrate that the performance of TBMs in simulating the spatiotemporal patterns of NEE varies substantially across seasons, with best performance during the growing season and more limited skill during transition seasons. This result is consistent with previous work showing that the ability of TBMs to model flux magnitudes is also seasonally-dependent. Overall, the proposed approach provides a new avenue for evaluating TBM performance based on sub-biome-scale flux patterns, presenting an opportunity for assessing and informing model development using atmospheric observations.


Introduction
A key question in carbon cycle science is how terrestrial carbon sinks will evolve within the context of a rapidly changing climate.Such projections of future carbon-climate interactions largely depend on the accuracy of current terrestrial biospheric models (TBMs), the main tool used to simulate the processes controlling the biospheric carbon cycle.Thus, understanding and evaluating the performance of current TBMs is an essential step toward improving the state of carbon cycle research.
TBM predictions of carbon flux can be directly evaluated against eddy covariance tower measurements at various timescales ranging from hourly to interannual (Baker et al., 2003;Balzarolo et al., 2014;Keenan et al., 2012;Raczka et al., 2013;Richardson et al., 2012;Sasai et al., 2005;Schaefer et al., 2012;Schwalm et al., 2010), but the information provided by flux towers is only representative of small spatial scales (∼ 1 km 2 ) relative to the scales of interest for global analyses.On the other end of the spectrum, TBM predictions aggregated to large spatial and/or temporal scales (e.g., continental/monthly to global/annual) are routinely intercompared with flux estimates obtained from inverse modeling based on observed atmospheric CO 2 mixing ratios (Canadell et al., 2011;Gourdji et al., 2012;Hayes et al., 2012;McGuire et al., 2012;Turner et al., 2011), but such large-scale comparisons make it difficult to provide directly usable information regarding the processes driving carbon exchange.In addition, differences among TBMs exist across a full range of spatiotemporal scales, including interannual variability, the timing of phenology, and the spatiotemporal distribution of biospheric carbon fluxes within regions Published by Copernicus Publications on behalf of the European Geosciences Union.(Gourdji et al., 2012;Huntzinger et al., 2012;Keenan et al., 2012;Raczka et al., 2013;Richardson et al., 2012;Schaefer et al., 2012;Schwalm et al., 2010).These differences reflect the fact that processes controlling carbon-climate feedbacks are manifested differently across TBMs.
Assessing the spatial and/or temporal variability of carbon fluxes as a method for evaluating TBMs, therefore, offers the potential to examine the environmental processes driving carbon exchange, and hence provides an alternative path forward in the assessment of TBM predictions.For example, evaluating the timing of modeled phenology can highlight issues associated with a model's representation of Light Use Efficiency (LUE), temperature response, and GPP response under various conditions (Richardson et al., 2012;Schwalm et al., 2010).Examining the interannual variability of TBM output can identify problems with the representation of interannual variability in spring phenology, soil thaw, snowpack melt and lagged response to extreme climatic events (Keenan et al., 2012).
The majority of previous studies examining carbon flux variability are nevertheless based on spatially and/or temporally aggregated carbon fluxes, however.An evaluation of flux variability, or flux patterns, at the fine native spatiotemporal scales of TBM simulations would make it possible to target more directly the fine-scale spatiotemporal patterns of carbon fluxes that have been shown to relate to environmental/climatic factors directly, such as precipitation, radiation and nighttime temperature (Beer et al., 2010;Mueller et al., 2010;Yadav et al., 2010).Such evaluations could therefore inform model improvements at the process level.
Observations of atmospheric CO 2 can potentially be used to assess such fine-scale spatiotemporal flux patterns.On one hand, atmospheric CO 2 observations are sensitive to finescale net ecosystem exchange (NEE) spatial and temporal variability (Huntzinger et al., 2011).On the other hand, variations in atmospheric CO 2 measurements are routinely used in inverse modeling frameworks to infer upwind sources and sinks of CO 2 , and recent studies suggest that atmospheric observations contain information about flux patterns at spatial and temporal resolutions comparable to those of TBMs run for regional to continental domains (Broquet et al., 2013;Göckede et al., 2010;Gourdji et al., 2010Gourdji et al., , 2012)).Despite the uncertainties existing in regional inversions due to uncertainties in atmospheric transport, fossil fuel emissions, fire disturbance, and boundary conditions, these studies do point to the possibility of evaluating the spatiotemporal patterns of fluxes from biospheric models through the use of inverse models.
With this goal in mind, what is needed is an atmosphericinversion-based method that can use variations in atmospheric CO 2 to assess the spatiotemporal patterns of surface carbon fluxes simulated by TBMs.The purpose of this paper is to present, evaluate, and demonstrate the application of such an approach, applied here to the evaluation of the 1 • × 1 • and 3-hourly spatiotemporal variability of NEE sim- ulated by TBMs using atmospheric CO 2 measurements.This fine-scale variability is evaluated here within each month and biome over North America, thus providing a way to evaluate the seasonal and biome-specific differences in model performance.The distinguishing feature of the proposed approach is that it targets the evaluation of flux patterns at fine scales, rather than flux magnitudes at aggregated scales, thereby providing a closer link to process-based understanding of TBM performance.The approach is first evaluated with a series of synthetic data experiments where the underlying flux patterns affecting the atmospheric CO 2 signals are known.The application of this approach is further tested and demonstrated using actual atmospheric measurements and a prototypical small set of extensively studied TBM simulations from the North American Carbon Program (NACP) Regional Interim Synthesis (RIS) effort (Huntzinger et al., 2012).

Atmospheric CO 2 measurements
We use continuous, high-precision atmospheric CO 2 concentration measurements from 35 towers for the year 2008 to evaluate the simulated NEE spatiotemporal variability over North American land.Figure 1 shows the location of these towers along with the geographic coverage of seven North American biomes as modified from Olson et al. (2001).A majority of towers are located in Temperate Broadleaf and Mixed Forests, Temperate Grasslands, Savannas and Shrublands, Temperate Coniferous Forests and Boreal Forests and Taiga, while very few towers are located in the other biomes (Tundra, Desserts and Xeric Shrublands, and Tropical and Subtropical biomes).This distribution of towers is expected to affect the sensitivity of atmospheric CO 2 data to NEE within those biomes.The year 2008 is used as it includes the expansion of continuous measurement locations from the Mid-Continent Intensive (MCI) project (Miles et al., 2012;Ogle et al., 2006).Atmospheric CO 2 measurements are processed and averaged to 3-hourly intervals as described in Gourdji et al. (2012).Data from all hours of the day are used for tall towers with a height over 300 m, while afternoon data are used for most short towers (lower than 100 m), and nighttime data are used for sites with complex topography (e.g., Niwot Ridge -NWR), as detailed in Shiga et al. (2014).We further remove data that are strongly influenced by only a few 1 • ×1 • grid cells, in order to exclude data that are likely subject to systematic transport model errors (Göckede et al., 2010;Gourdji et al., 2012;Peters et al., 2007).The resulting total number of observations is n = 28 717.
To remove the effect of boundary conditions, we pre-subtract the GLOBALVIEW-CO 2 boundary condition (GLOBALVIEW-CO 2 , 2010) from atmospheric measurements as in Gourdji et al. (2012).We further remove the impact of fossil fuel emissions by pre-subtracting concentrations modeled based on the VULCAN-ODIAC fossil fuel emissions inventory (Shiga et al., 2014).

Sensitivity footprints from atmospheric transport model
The sensitivity of the available atmospheric observations to underlying CO 2 fluxes (in units of ppmv (µmol m −2 s −1 ) −1 ) is quantified as described in Gourdji et al. (2012).In brief, footprints are derived from the Stochastic Time-Inverted Lagrangian Transport (STILT) model (Lin et al., 2003), driven by meteorological fields from the Weather Research and Forecast (WRF) model (Skamarock and Klemp, 2008).The STILT transport model has been used and examined extensively at regional and continental scales (Chatterjee et al., 2012;Gourdji et al., 2010Gourdji et al., , 2012;;Huntzinger et al., 2011;Kort et al., 2008;McKain et al., 2012).Footprints are also used to generate synthetic observational time series based on TBM flux simulations.

Terrestrial Biospheric Models (TBMs)
We use simulations from four TBMs to evaluate the proposed approach, namely CASA-GFED (van der Werf et al., 2006), SiB3 (Baker et al., 2008), ORCHIDEE (Krinner et al., 2005) and VEGAS2 (Zeng et al., 2005), using the runs submitted to the NACP RIS activity.These four models were selected for analysis because of the availability of 3-hourly NEE flux output.While CASA-GFED and VEGAS2 have a coarser native temporal resolution, their NEE fluxes have been downscaled to a 3-hourly resolution as described in Huntzinger et al. (2011).Our evaluation is based on the overall NEE simulated by each TBM, although model definitions of NEE differ: CASA-GFED includes fire disturbance, while other models do not; ORCHIDEE excludes crop harvest, while others do not.A comparison and summary of these simu-lations can be found in Table S1 in the Supplement.Further details on the NACP RIS simulations can be found in Huntzinger et al. (2012).

Regression framework linking atmospheric CO 2 to NEE
The overall goal of the proposed approach is to evaluate the spatiotemporal variability of NEE as simulated by various TBMs using atmospheric CO 2 measurements.Such an approach must be based on an inverse model that can infer NEE from atmospheric CO 2 measurements.It must also include a statistical model selection component to evaluate the degree to which NEE patterns predicted by TBMs are useful in explaining the observed atmospheric CO 2 variability.Rather than quantifying the magnitude of NEE, the primary goal here is to evaluate the spatiotemporal NEE patterns (at a 1 • × 1 • and 3-hourly resolution) within specific biomes of North America and for specific months.The approach presented here builds on the geostatistical inverse modeling (GIM) framework (Gourdji et al., 2010(Gourdji et al., , 2012;;Michalak et al., 2004), but is presented here in the form of a regression analysis to simplify the presentation and to emphasize the model selection aspect of the proposed approach.
To this end, we first formulate a multi-linear regression framework that relates atmospheric observations to NEE spatiotemporal variability.Statistical model selection is then applied to determine whether, when, and where the spatiotemporal variability of simulated NEE is consistent with that evident from variability in atmospheric CO 2 .Here, the NEE spatiotemporal variability is defined at a 1 • × 1 • spatial and 3-hourly temporal resolution, and the TBMs are evaluated within specific biome-month combinations.Figure 2 shows the distribution of NEE in one specific biome-month combination (i.e., Boreal Forests and Taiga in July) as an example.
To link atmospheric measurement to surface fluxes, we first define the observed atmospheric CO 2 concentrations, with the influence of boundary conditions and fossil fuel emissions pre-subtracted, as where z is an n×1 vector of atmospheric CO 2 observations, s is an m×1 vector of the underlying NEE fluxes at 1 • ×1 • and 3-hourly resolution, H (n × m) are the sensitivity footprints, namely a Jacobian matrix representing the sensitivity of each observation to each underlying flux (i.e., ∂z i ∂s j ) as quantified using an atmospheric transport model (see Sect. 2.2), and ε (n × 1) is the model-data mismatch term that represents any discrepancies between observed (z) and modeled (Hs) CO 2 mixing ratios.The model-data mismatch term encompasses the influence of errors in the boundary conditions, errors in the fossil fuel inventory, representation errors, aggregation errors, transport model errors, and measurement errors.These errors are assumed to have zero mean and to be Y.Fang et al.: Using atmospheric observations to evaluate the spatiotemporal variability of CO 2 fluxes uncorrelated across measurements, with their variances represented by a diagonal covariance matrix R (n × n).The dimensions of the matrices and vectors are based on the total number of observations, n = 28 717, and the total number of fluxes at a 1 • ×1 • (2635 such grid cells within the domain used here) and 3-hourly resolution (366 × 8 = 2928) such periods within the span of the 1-year inversion), m = 2635 × 2928 = 7 715 280.
The spatiotemporal NEE distribution of s is represented as a linear model of NEE as predicted by various TBMs within specific biome-month combinations: where X is an m × p matrix with each column representing NEE 1 • ×1 • 3-hourly spatiotemporal variability within a specific biome-month combination from a specific TBM, such that a given column is populated by the modeled NEE from a given TBM for a given biome-month for those rows (i.e., elements of s) corresponding to that specific biomemonth combination, while the remainder of the column is filled with zeros.These individual columns of X are thus predictor variables for the dependent variable s.With 7 biomes (Fig. 1) and 12 months, there are a total of 84 possible predictor variables per TBM (i.e., p ≤ 84 for one TBM).The p × 1 vector β represents the drift coefficient describing the relationship between X and s, and Xβ together thus represents a statistical model of the trend of NEE.The m × 1 vector ξ represents the portion of the variability of s that cannot be explained by the predictor variables in X, and these deviations are modeled as having a mean of zero and a covariance matrix Q (m × m) that represents how the flux deviations from the model of the trend (i.e., s − Xβ) are correlated in time and space.
Combining these two equations, we represent the atmospheric observations z in terms of the NEE predictor variables X: where z is seen to have a spatiotemporally variable mean HXβ and, assuming independence between ξ and ε, a residual covariance of where "T" is the matrix transpose operation.From a statistical standpoint, our goal then becomes to select a subset of TBM biome-month combinations that captures a substantial portion of the CO 2 variability observed in z.This constitutes a classical statistical model selection problem, in which we examine which predictor variables (candidate columns in X) are useful in explaining the atmospheric CO 2 measurements (z).
A widely applied approach for statistical model selection is the Bayesian Information Criterion (BIC) (Schwarz, 1978).BIC takes into account both the goodness of fit, i.e., the residual sum of squares (RSS), and the number of auxiliary variables (k) in each candidate model, and can be used to compare non-nested candidate models.BIC has also been adapted for use with spatiotemporally autocorrelated residuals (Hoeting et al., 2006;Mueller et al., 2010) and within the context of atmospheric inversions where atmospheric observations are used to inform underlying surface fluxes (Gourdji et al., 2012), making it ideal for the application presented here.The standard expression for BIC is where RSS represents the residual sums of squares of a given candidate model X c , is the n × n covariance matrix of the residuals (Eq.4), | | denotes the matrix determinant, and k is the number of parameters in a particular candidate model.For the specific application presented here (Eq.1-4), and factoring out the unknown drift coefficients, β and RSS become as in Gourdji et al. (2012): The specific covariance parameters needed to define Q and R, which are needed to define , vary between experiments, and are obtained as described in the Supplement.
Model selection built on this framework aims to identify the "best" model of the trend based on a tradeoff between model size and the model's power in explaining the variations in observed atmospheric CO 2 .Here, the "best" model is specially defined as one with the minimum BIC value, providing an optimal balance between model complexity and model fit.To identify this model, BIC is compared across all possible combinations of predictor variables (i.e., 84 NEE biome-months per TBM).Due to the large number of candidate predictor variables considered here, we implement the branch-and-bound algorithm of Yadav et al. (2013) to increase computational efficiency.
The final selected subset of TBM biome-months represents those biomes and months within which a given TBM exhibits spatiotemporal variability that explains a substantial portion of the variability observed in the observations z (see Eq. 3).For a given TBM biome-month distribution to be "selected" as part of the "best" model of the trend, therefore, (1) the available atmospheric observations must be sensitive to the spatiotemporal variability of fluxes within that biomemonth (as represented by H); i.e., the information contained in atmospheric data sufficiently constrains the spatiotemporal variability within that biome-month, and (2) the variability within a particular biome-month as represented by a particular TBM must explain a sufficient portion of the variability in the atmospheric observations to offset the penalty term in Eq. ( 5), i.e., the reduction in RSS must outweigh the penalty term.On the contrary, if a given TBM biome-month distribution is "not selected", then either (1) or (2) as given above is not satisfied, i.e., either that atmospheric observations are not sensitive to the NEE variability within that biome-month, or that the NEE variability as represented in the model is inconsistent with atmospheric observations.In other words, selecting or not selecting a TBM biome-month combination directly reflects on the performance of the TBM in that biome and month, as long as we have fulfilled the requirement in (1) above.If the condition in ( 1) is not met, we are not able to use the model selection results to examine model performance, due to the insufficient coverage of the network.We henceforth refer to the TBM biome-month combinations included in the final selected subset as the "selected" combinations or elements, or alternately as the TBM biome-month combinations "identified" using the atmospheric data.

Synthetic data and real data experiments
In this section, we describe the design of a series of Synthetic Data (SD) experiments (Fig. 3) in which the underlying fluxes are prescribed, to assess the sensitivity of atmospheric CO 2 measurements to NEE flux spatiotemporal patterns within all biome-month combinations, and identify when and where results from the proposed approach reliably reflect model performance in simulating NEE spatiotemporal variability.We further introduce two Real Data (RD) experiments as a proof-of-concept demonstration of our approach.
In those RD experiments, we use actual atmospheric CO 2 measurements to evaluate the spatiotemporal variability of NEE as simulated by four prototypical TBMs (Sect.2.3).
In the SD experiments, synthetic atmospheric observations (z) are generated as described in Eq. (1) using fluxes (s TBM ) that include NEE as simulated by one of the TBMs and, in some cases, spatiotemporally-correlated flux residuals (ξ ) and model-data mismatch errors (ε), i.e., z = H (s TBM + ξ )+ ε.The superset of candidate ancillary variables (Fig. 3, X) includes NEE from one or more TBMs.TBMs included in s TBM and X are henceforth denoted as the "truth" and the "candidate(s)", respectively.
The first SD case study, SD-one-∅∅ (Fig. 3), is designed to investigate whether, when, and where the information contained in current atmospheric data enables the identification of the correct candidate TBM for a case where it is the only TBM considered in the model selection, where this TBM fully represents the variability in the synthetic atmospheric observations (ξ = 0), and where no model-data mismatch errors are included in the simulation (ε = 0).Given that, in this case, the candidate TBM explains all of the variability in the synthetic atmospheric observations, it should always be selected if the atmospheric data are sufficiently sensitive to NEE across all biome-months; hence, biome-months for which the TBM is not selected are ones to which the atmospheric CO 2 observations are not sufficiently sensitive to offset the penalty term in Eq. ( 5).
The second and third SD case studies, SD-one-∅ε and SD-one-ξ ε(Fig.3 represents a portion of the true underlying NEE variability.In these case studies, noise (ε) is added to observations, generated as a random vector of independent normally-distributed values with variances corresponding to the diagonal elements of R, which are inferred from the RD-all-ξ ε experiment (described below), and a mean of 0. In addition, for SD-oneξ ε, the flux signal from the TBMs is augmented with additional spatially-correlated fluxes (ξ ) generated as a random vector of normally distributed values with a covariance structure equal to that inferred from the RD-all-ξ ε experiment (described below).The details of the model-data mismatch errors and flux residuals are summarized in the Supplement.
The final SD case study, SD-all-ξ ε, builds on SD-one-ξ ε (Fig. 3), but is designed to test whether the correct TBM can be identified when all four TBMs are used as candidate variables.This case study therefore explores whether current atmospheric observations can be used to differentiate between candidate TBMs.No constraints are placed on the model selection, such that more than one TBM can be selected for the same biome-month, but only the dominant TBM (i.e., the one with the largest β, Eq. 6) is discussed in analyzing this case.
Finally, two RD case studies, RD-one-ξ ε and RD-all-ξ ε, are defined analogously to SD-one-ξ ε and SD-all-ξ ε, respectively, to test the applicability of our approach further by examining the actual performance of the four prototypical TBMs based on available atmospheric observations.The observations (z) here are the actual atmospheric measurements, which by definition include model-data mismatch errors, and the flux residuals are also inherently present, as no TBM is expected to reflect the true underlying fluxes perfectly.
In each RD-one-ξ ε experiment, one of the four prototypical TBMs is used as the candidate TBM in order to assess individual TBM performance.In RD-all-ξ ε, all four TBMs are included, analogously to SD-all-ξ ε, to identify the TBM (if any) that best represents the spatiotemporal variability of NEE within a given biome-month, based on the information provided by the atmospheric measurements.

Sensitivity of atmospheric observations to NEE spatiotemporal variability and evaluation of the proposed approach
The SD-one-∅∅ experiment examines the sensitivity of atmospheric observations to underlying flux variability, and evaluates the proposed approach under idealized conditions where the true flux field is perfectly represented by the candidate TBM model, and where no model-data mismatch errors are included in the synthetic atmospheric observations.Results indicate that the candidate TBM is selected for over 90 % of all biome-months (Fig. 4, top row), demonstrating that atmospheric observations are sensitive to NEE spatiotemporal variability, and that the proposed approach leverages this sensitivity to correctly identify the TBM model as being representative of the flux variability within the vast majority of biomes and months.The only notable exception is for the Tundra biome, for which, other than during the height of the growing season, the atmospheric data do not provide a sufficient constraint on the flux variability, due to the poor data coverage and the weak biospheric signal.Because this biome is expected to play an important role in the future global carbon cycle and climate (Belshe et al., 2013;Ping et al., 2008;Schuur et al., 2009;Tarnocai et al., 2009), and large uncertainties remain in quantifying its role in carbon cycling (McGuire et al., 2012), this result highlights the need for strategic placement of additional CO 2 monitoring stations in the vicinity of this biome to constrain its carbon flux distribution.
The SD-one-∅ε and SD-one-ξ ε case studies examine the degree to which the presence of model-data mismatch errors and additional flux variability not represented by the candidate TBM limit the information content of available observations, and the ability of the proposed approach to identify the consistency between the true underlying NEE patterns and those simulated by TBMs.
Results of SD-one-∅ε show that including realistic modeldata mismatch errors decreases the information content of atmospheric observations to the point where a TBM that in reality represents the full spatiotemporal flux variability is not selected for many months and TBMs within the Tropical and Subtropical biome, as well as the desert and xeric shrublands biome, in addition to the Tundra biome that was not well constrained even under idealized conditions (Fig. 4, middle row).The identification of a TBM as correctly representing the flux patterns also becomes more challenging during winter and spring within the Boreal Forests and Taiga biomes, and the Temperate Coniferous Forests biome (Fig. 4, middle row), especially when VEGAS2 is used as the true flux distribution.This result is related to the fact that the magnitude and the spatiotemporal variability of NEE simulated by VEGAS2 within those biome-months are much smaller than for other TBMs.For example, the standard deviation of NEE simulated by VEGAS2 is less than half of that of other TBMs.Overall, the inclusion of realistic modeldata mismatch, combined with the coverage of the monitoring network, makes the identification of TBMs that represent the spatiotemporal variability of fluxes within biomes unreliable for three of the seven biomes considered here, namely the Tundra, Tropical and Subtropical, and Desert and Xeric Shrublands biomes.Subsequent analyses therefore focus on the remaining four better-constrained biomes, namely the (i) Boreal Forests and Taiga, (ii) Temperate coniferous forests, (iii) Temperate Grasslands, Savannas, and Shrublands, and (iv) Temperate Broadleaf and Mixed Forests biomes.
SD-one-ξ ε is the most realistic single-TBM synthetic data experiment, as it includes not only model-data mismatch errors, but also variability in the spatiotemporal flux distribution that is not represented by the candidate TBM.Results for the better-constrained biomes indicate that the ability to identify a model as correctly representing a portion of the Average numbers of months within each season for which the candidate TBM is selected for the SD-one-∅∅, SD-one-∅ε and SD-one-ξ ε case studies (Fig. 3).Grey shading in SD-one-ξ ε represents biomes that were determined to be not sufficiently well constrained by available atmospheric data.DJF: December, January, February; MAM: March, April, May; JJA: June, July, August; SON: September, October, November.The criteria for grey areas include (1) no models being selected in one season, or (2) the overall model selection being less than 50 % in a year.true flux variability deteriorates in the winter months for the Boreal Forests and Taiga, but remains largely unchanged in the other biomes (Fig. 4, bottom row).For the winter in the Boreal Forests and Taiga biome, the TBM is only identified when the fluxes are based on SiB3, likely because this TBM has a stronger flux signal in this biome during the winter relative to the other TBMs, thereby overcoming the confounding impacts of model-data mismatch errors and additional flux variability unexplained by the TBM.
Results of SD-one-ξ ε indicate that under realistic conditions, the proposed approach is able to correctly identify a TBM that represents a portion of the true underlying flux variability within four of the seven biomes considered here, given the monitoring network used here.The magnitude of the model data mismatch used here was derived from the real-data experiments (RD-one-ξ ε), and includes the impact of errors in the transport model, boundary conditions, fossil fuel emissions, and fire emissions, as well as measurement and aggregation errors.Therefore, results suggest that conclusions over the four considered biomes are robust in spite of the influences of those uncertainties.We acknowledge that the errors applied here do not fully address the complexity of uncertainties in the real world, as we assume that errors to be independent and follow a Gaussian distribution.However, the results presented here, together with evidence from the literature (e.g., Gourdji et al., 2012;Pillai et al., 2012), support the ability to infer flux patterns despite the many sources of uncertainty in regional inversions.
The final SD case, SD-all-ξ ε, is designed to explore whether atmospheric observations can be used to differentiate among several competing TBMs to identify the TBM that best represents the underlying flux variability.Results indicate that across the majority of the examined biomes, months, and TBMs, the proposed approach combined with the available atmospheric data are able to discriminate among models for a similar fraction of TBM-biome-month combination (Fig. 5) as when only the "correct" TBM was offered as a candidate model (SD-one-ξ ε, Fig. 4, bottom row).One noticeable difference, however, occurs during the growing season in the Boreal Forests and Taiga, when VE-GAS2 or CASA-GFED is used to represent a substantial portion of the true flux variability.In these cases, the other of these two models is often identified in the model selection procedure.This is not surprising, because these two models yield fluxes that are highly spatiotemporally correlated to one another (Fig. 6), and because biospheric signals simulated by VEGAS2 are particularly weak (Huntzinger et al., 2011).Overall, therefore, for the four better-constrained biomes, the information content of the atmospheric data is sufficient to identify a TBM that represents a substantial portion of the true underlying variability using the proposed approach, even when multiple competing TBMs are available.In other words, atmospheric observations can be used to differentiate among competing TBMs.The exception, not surprisingly, is when the competing TBMs have fluxes that are highly correlated (R > 0.8), which, for the four TBMs .Average numbers of months within each season for which the candidate TBM is selected for the SD-all-ξ ε case study (Fig. 3).Grey shading represents biomes that were determined to be not sufficiently well constrained by available atmospheric data.DJF: December, January, February; MAM: March, April, May; JJA: June, July, August; SON: September, October, November.examined here, occurs most often over the Boreal Forests and Taiga and Temperate Coniferous Forests biomes (where biospheric signals are relatively weak and atmospheric data are less sensitive), for the VEGAS2 and CASA-GFED, as well as the SiB3 and ORCHIDEE model pairs (Fig. 6).

Demonstration of the proposed approach using atmospheric observations
The results presented in Sect. 5 confirm that, given the coverage of atmospheric data available in 2008, the proposed approach is able to identify TBMs representing a substantial portion of the underlying NEE spatiotemporal variability over four better-constrained biomes of North America throughout most of the year.In this section, by focusing on the RD experiment results, we demonstrate the application of the proposed approach using "real" data, by evaluating four prototypical TBMs participating in the NACP RIS.

Performance of TBMs in simulating the spatiotemporal variability of NEE
The RD-one-ξ ε case study includes four experiments, each evaluating one prototypical TBM.As a general indication of individual TBM performance across biomes and months, we sum the number of candidate TBMs selected across the four RD-one-ξ ε cases (Fig. 7).We find that the capability of TBMs to simulate the NEE spatiotemporal variability varies strongly across biomes and seasons.TBMs are most frequently identified over the Temperate Broadleaf and Mixed Forests biome (7 out of 12 months, with at least one TBM identified), and least frequently identified over the Boreal Forests and Taiga biome (3 out of 12 months, with at least one TBM identified).Across seasons, TBMs are most frequently identified during the growing season (May-September, 15 out of 20 biome-months, with at least one TBM identified).TBMs are least frequently identified during transition seasons (March-April and October-November, with 2 out of 16 biome-months, with at least one TBM identified), likely reflecting known challenges of TBMs in representing the seasonal cycle of phenology (Richardson et al., 2012;Schaefer et al., 2012;Schwalm et al., 2010).Specifically, during October-November, none of the TBMs is identified as representing the flux spatiotemporal variability in any of the biomes, in agreement with the finding in Gourdji et al. (2012) that carbon fluxes simulated by over 70 % of the NACP RIS TBMs are outside the 95 % confidence intervals of atmospheric inversion estimates in October.Of all 48 biome-months examined, none of the four TBMs are identified as substantially representing the spatiotemporal variability within 27 biome-months, and only one TBM is identified in 5 additional biome-months (Fig. 7).Multiple TBMs are identified as representing a portion of the spatiotemporal variability within the remaining 16 biomemonths (Fig. 7).Interestingly, SiB3 and ORCHIDEE are selected in almost all of these 16 biome-months, suggesting that they both have the potential to explain a substantial portion of the observed variability in atmospheric CO 2 .This is consistent with the similarity in NEE spatiotemporal series between SiB3 and ORCHIDEE shown in Fig. 6.
The RD-all-ξ ε case study identifies the TBM that best represents the underlying flux variability (Fig. 8).Out of 27 biome-months for which no individual TBM was selected in the RD-one-ξ ε experiments, 5 biome-months lead to models being selected when more than one model can be used in combination, with the dominant TBM being ORCHIDEE over the Temperate Coniferous Forests biome in April and May and the Temperate Broadleaf and Mixed Forests in February, SiB3 over the Boreal Forests and Taiga in August, and VEGAS2 over the Temperate Grasslands, Savannas and Shrublands in December.
Overall, SiB3 and ORCHIDEE are selected as the dominant TBM in explaining the flux variability as observed through the atmospheric CO 2 measurements, more often than VEGAS2 and CASA-GFED (Fig. 8).SiB3 appears most representative of flux patterns over boreal biomes, whereas ORCHIDEE is most representative over temperate biomes.Although SiB3 is selected most often (13 biome-months), followed by ORCHIDEE (10 biome-months), none of the TBMs is consistently better than the others across all biomes and seasons.

Evaluation of the TBMs and the proposed approach within the context of earlier studies
To further evaluate the performance of, and added value provided by, the proposed approach, we assess the RD-one-ξ ε results within the context of the existing literature to determine whether (1) results are consistent with the literature wherever they are comparable, and whether (2) the proposed approach can provide insights that go beyond those provided by other model evaluation strategies.Many of our findings are consistent with early work analyzing the examined TBMs within the framework of the NACP RIS.For example, we find distinctive seasonal differences in TBM performance in simulating NEE (Figs. 7  and 8), consistent with the previously noted model misrepresentation of phenology seasonality based on site-level measurements (Richardson et al., 2012;Schaefer et al., 2012;Schwalm et al., 2010).In addition, we find that models perform better for Temperate Broadleaf and Mixed Forests, and that SiB3 appears to be more consistent with observations than other models, both of which are consistent with the existing literature evaluating NACP RIS models (Raczka et al., 2013;Schwalm et al., 2010).The consistency between our results and existing literature further supports the performance of the proposed approach.It also implies that, although the approach proposed here is subject to many of the same uncertainties in fossil fuel emissions, fire disturbance, boundary conditions and transport models that affect all regional inversions, the main conclusions regarding TBM performance for the four major biomes examined here are quite robust.
The proposed approach also provides the opportunity to draw conclusions that go beyond the current literature.We present two examples here.
First, results indicate that model capability in simulating the spatiotemporal variability (i.e., patterns) of NEE varies strongly with seasons, with greater skill during the growing season than during the transition seasons.In other words, even within specific biomes and months, the variability of NEE is better represented during the growing season.This seasonal variability in model performance may be due to seasonal differences in the dominant environmental drivers controlling the spatiotemporal variability of NEE.For example, Mueller et al. (2010) found that the environmental drivers controlling NEE in a hardwood forest vary across seasons, with radiation, nighttime temperature and vegetative radia-tion indices (e.g., the fraction of photosynthetically active radiation fPAR) dominating during the growing, non-growing and leaf-out seasons, respectively.We hypothesize that the seasonal differences in model performance are likely related to the models' ability to represent the seasonally-varying influence of such environmental drivers.Because the NEE spatiotemporal variability is directly related to environmental processes and drivers (Beer et al., 2010;Mueller et al., 2010;Yadav et al., 2012;Gourdji et al., 2012), the proposed approach provides a close link between model performance and environmental processes.
Second, we find that SiB3 and ORCHIDEE are identified more often as representing the spatiotemporal flux variability than VEGAS2 and CASA-GFED.Given that the simulated NEE spatiotemporal variability is more similar between SiB3 and ORCHIDEE, and between VEGAS2 and CASA-GFED, relative to across these two model pairs (Fig. 6), this finding suggests that aspects of the model internal structure common within the pairs likely contribute to similarities in simulated flux patterns and associated performance.Such features include (1) SiB3 and ORCHIDEE using Enzyme Kinetic (EK) models, while CASA-GFED2 and VEGAS use Light Use Efficiency (LUE) models to formulate their photosynthesis processes, (2) the native model time step of SiB3 and ORCHIDEE being shorter than a day, while that of CASA-GFED and VEGAS2 varies from daily to monthly, and (3) SiB3 and ORCHIDEE having substantially more plant functional types than CASA-GFED and VEGAS2.Although it is not possible to draw definite conclusions about the links between model structure and model performance in simulating flux patterns due to the small number of TBMs examined here and the lack of a uniform simulation protocol, a future application of this approach to a larger ensemble of models following a uniform protocol would make it possible to explore these connection in more detail.

Concluding remarks
In this paper, we present, evaluate and demonstrate a statistical approach based on GIM and BIC to evaluate the spatiotemporal variability of NEE as simulated by TBMs against atmospheric CO 2 concentration measurements from 35 towers in North America in 2008.We demonstrate the applicability of this approach by evaluating four prototypical TBMs participating in the NACP RIS.
We first design a series of synthetic data experiments in which the underlying fluxes are prescribed, to test the proposed approach and examine whether, when, and where atmospheric measurements are sensitive to, and hence can constrain, the spatiotemporal variability simulated by different TBMs.We find that due to the poor data coverage and weaker biospheric signals, current atmospheric observations cannot be used to reliably assess the flux spatiotemporal variability in the Tundra, Desert and Xeric Shrublands, and Tropical and Subtropical biomes.The remaining four biomes (i.e., Temperate Broadleaf and Mixed Forests, Temperate Grasslands, Savannas and Shrublands, Boreal Forests and Taiga, and Temperate Coniferous Forest), however, are found to be sufficiently well constrained by atmospheric data.Over these four biomes, the synthetic data experiments suggest that the proposed model selection approach, combined with the available atmospheric data, is able to identify the TBMs that represent a substantial portion of the underlying flux variability, as well as to differentiate between multiple competing TBMs.
We further test and demonstrate the application of the approach by evaluating the performance of four prototypical TBMs that have been extensively assessed in the literature using actual atmospheric observations.We find that conclusions about model performance are consistent with the existing literature for cases where results are comparable, further supporting the applicability of our approach.Those results include (1) TBMs representing fluxes best during the growing season (May-September) and least consistently with atmospheric observations during the transition seasons, especially in October and November, and (2) TBMs appearing to perform best over the Temperate Broadleaf and Mixed Forests biome.The experiments performed here also lead to new conclusions about the examined TBMs.For example, results show that SiB3 and ORCHIDEE appear to represent the flux variability within individual biomes and months better relative to CASA-GFED and VEGAS2.In addition, this approach has the potential to link model performance with environmental processes, making it possible to test the hypothesis that seasonal differences in TBM performance reflect models' ability to represent the seasonal variability in the dominant environmental controls on fluxes.
The comparison conducted here only includes four TBMs, and is intended primarily as a demonstration of the proposed approach.Furthermore, these four TBMs were not run using a uniform experimental protocol (Huntzinger et al., 2012), precluding any conclusive results about linkages between model performance and model structure.Applying the approach presented here to a larger ensemble of models, ideally following a uniform simulation protocol, therefore represents a logical next step.
The Supplement related to this article is available online at doi:10.5194/bg-11-6985-2014-supplement.

Figure 1 .
Figure 1.North American biomes, modified from Olson (2001), as defined for the case studies; white triangles indicate the locations of atmospheric CO 2 measurement towers used in the analysis.

Figure 2 .
Figure 2. Illustration of the 1 • ×1 • and 3-hourly spatiotemporal variability of NEE simulated by CASA-GFED for Boreal Forests and Taiga in July.A vector including these 1 • ×1 • and 3-hourly fluxes corresponds to one ancillary variable (i.e., one column) in X.

Figure 3 .
Figure 3. Illustration of Synthetic Data (SD) case studies as described in Sect. 4.
), are analogous to SD-one-∅∅, but include model-data mismatch errors (ε = 0, denoted by ε) and/or spatially correlated flux residuals (ξ = 0, denoted by ξ ).These case studies are designed to test the degree to which current atmospheric observations can inform the spatiotemporal variability of NEE in cases with realistic modeldata mismatch errors and/or where the candidate TBM only www.biogeosciences.net/11etal.: Using atmospheric observations to evaluate the spatiotemporal variability of CO 2 fluxes

Figure 4 .
Figure 4. Average numbers of months within each season for which the candidate TBM is selected for the SD-one-∅∅, SD-one-∅ε and SD-one-ξ ε case studies (Fig.3).Grey shading in SD-one-ξ ε represents biomes that were determined to be not sufficiently well constrained by available atmospheric data.DJF: December, January, February; MAM: March, April, May; JJA: June, July, August; SON: September, October, November.The criteria for grey areas include (1) no models being selected in one season, or (2) the overall model selection being less than 50 % in a year.

Figure 5
Figure5.Average numbers of months within each season for which the candidate TBM is selected for the SD-all-ξ ε case study (Fig.3).Grey shading represents biomes that were determined to be not sufficiently well constrained by available atmospheric data.DJF: December, January, February; MAM: March, April, May; JJA: June, July, August; SON: September, October, November.

Figure 6 .
Figure 6.The correlation coefficient of NEE spatiotemporal series as simulated by different TBMs throughout 2008 for the four biomes better constrained by available atmospheric observations.TGSS: Temperate Grasslands, Savannas, Shrublands; Bore: Boreal Forests and Taiga; TCoF: Temperate Coniferous Forests; TBMF: Temperate Broadleaf and Mixed Forests.

Figure 7 .
Figure 7. Number of TBMs that are selected for each biome-month in the RD-one-ξ ε case study.Grey shading represents biomes that were determined to be not sufficiently well constrained by available atmospheric data.

Figure 8 .
Figure8.The TBM that explains the most variability in atmospheric measurements for a given biome-month, as identified by the RD-all-ξ ε experiment.Grey shading represents biomes that were determined to be not sufficiently well constrained by available atmospheric data.