Causes of variation in soil carbon simulations from CMIP5 Earth system models and comparison with observations

. Stocks of soil organic carbon represent a large component of the carbon cycle that may participate in climate change feedbacks, particularly on decadal and centen-nial timescales. For Earth system models (ESMs), the ability to accurately represent the global distribution of existing soil carbon stocks is a prerequisite for accurately predicting future carbon–climate feedbacks. We compared soil carbon simulations from 11 model centers to empirical data from the Harmonized World Soil Database (HWSD) and the Northern Circumpolar Soil Carbon Database (NCSCD). Model estimates of global soil carbon stocks ranged from 510 to 3040 Pg C, compared to an estimate of 1260 Pg C (with a 95 % conﬁdence interval of 890–1660 Pg C) from the HWSD. Model simulations for the high northern latitudes fell between 60 and 820 Pg C, compared to 500 Pg C (with a 95 % conﬁdence interval of 380–620 Pg C) for the NC-SCD and 290 Pg C for the HWSD. Global soil carbon varied 5.9 fold


Introduction
Soil organic carbon is the largest carbon pool in the terrestrial biosphere (Jobbagy and Jackson, 2000), and losses of soil carbon due to climate change could contribute to rising atmospheric CO 2 concentrations. Loss rates for soil carbon through heterotrophic respiration depend on temperature (Davidson and Janssens, 2006;Lloyd and Taylor, 1994), moisture (Orchard and Cook, 1983;Ryan and Law, 2005), and disturbance regimes such as land use change (Post and Kwon, 2000) and fire (Harden et al., 2000). The sensitivity of many of these drivers to climate change creates the potential for feedbacks that may accelerate or decelerate the buildup of greenhouse gases in the atmosphere (Young and Steffen, 2009).
Although field studies and atmospheric-ocean carbon measurements suggest that the terrestrial biosphere is currently a net sink for carbon dioxide (Houghton, 2007;Lund et al., 2009;Le Quéré et al., 2009), it is unclear if this sink will persist as climate changes. Projections from recent Earth system models (ESMs) suggest that the magnitude of the sink is likely to decline in response to climate change over the 21st century (Cramer et al., 2001;Friedlingstein et al., 2006;Koven et al., 2011). However, this response is highly uncertain (Friedlingstein et al., 2006) and depends, in part, on the strength of feedbacks from the nitrogen cycle (Thornton et al., 2009) and the impact of drought stress on net primary production (NPP), tree mortality, and fires (Goulden et al., 2011;Huntingford et al., 2008;Phillips et al., 2009).
In northern ecosystems, permafrost soils contain large stocks of carbon (Tarnocai et al., 2009) that are particularly vulnerable to loss with climate change (Schuur et al., 2008;Zimov et al., 2006), given the large temperature increases expected for the region (Giorgi, 2006). Models of permafrost soil carbon have only recently been integrated into ESMs (Koven et al., 2011) and further improvements in the representation of thermokarst dynamics, peat accumulation, and soil hydrology are needed to reduce uncertainties related to climate-carbon feedbacks in northern biomes.
Because future climate projections depend on the carbon cycle, ESMs must be capable of accurately representing the pools and fluxes of carbon in the biosphere, particularly in soils that store a large fraction of terrestrial organic carbon. However, there have been few quantitative assessments of ESM skill in predicting soil carbon stocks, contributing to uncertainty in model simulations. To help reduce this uncertainty, we analyzed simulated soil carbon from ESMs participating in the 5th Climate Model Intercomparison Project (CMIP5). If ESMs can accurately represent current soil carbon stocks, then we might have more confidence in their ability to predict future stocks under a changing climate (Luo et al., 2012).
Our analysis had three specific goals: (1) quantify the variation in ESM representation of soil carbon stocks, (2) understand the driving factors regulating soil carbon distribution in ESMs,and (3) compare the ESM soil carbon stocks to empirical data. We conducted these analyses at grid (1 • × 1 • ), biome, and global scales across models to assess spatial variability in the data and model simulations. We compared model outputs to the global Harmonized World Soil Database (FAO/IIASA/ISRIC/ISSCAS/JRC, 2012) and the Northern Circumpolar Soil Carbon Database (Tarnocai et al., 2009). We used an additional data set at high latitudes because these areas contain a large percentage of global soil carbon but are difficult to model and to measure empirically. We expected ESMs to represent high latitude soils poorly because many of the terrestrial decomposition models were developed for mineral soils, as opposed to the organic soils found in many high latitude ecosystems (Koven et al., 2011;Neff and Hooper, 2002;Ping et al., 2008). More generally, we expected that the global distribution of soil carbon in the ESMs would be primarily driven by NPP, soil temperature, and soil moisture. We also anticipated that ESMs with more soil carbon pools would be capable of representing a wider range of soil carbon dynamics, and thus would yield more accurate simulations when compared to observations.

Materials and methods
We examined soil carbon stocks in 16 ESMs (Tables 1 and S1 in Supplement) from the 5th Climate Model Intercomparison Project (CMIP5). The model simulations were compared with the Harmonized World Soil Database (HWSD) (FAO/IIASA/ISRIC/ISSCAS/JRC, 2012) and high latitude soil carbon stocks from the Northern Circumpolar Soil Carbon Database (NCSCD) (Tarnocai et al., 2009). We analyzed the underlying drivers of soil carbon variability with a set of reduced complexity models.

Earth system models
ESMs from CMIP5 use common simulation and output protocols, enabling direct comparisons between models. One of the goals of CMIP5 is to facilitate benchmarking of ESMs through the historical simulation protocol, which has a prescribed time series of atmospheric CO 2 mixing ratios and land use change (Taylor et al., 2011). ESMs were selected from the CMIP5 repository based on the availability of soil carbon and other key output variables for the historical simulation, as well as consultation with modeling centers.
of temperature (T ) relative to a baseline (T 0 ), such that f (T ) = Q (T −T 0 )/10 10 . In this equation the Q 10 value is often set to between 1.5 and 2.5 based on estimates inferred from ecosystem flux measurements (Mahecha et al., 2010;Raich and Schlesinger, 1992) or the annual cycle of atmospheric CO 2 (Kaminski et al., 2002;Randerson et al., 2002). In some of the models, the temperature sensitivity of decomposition follows neither a Q 10 nor Arrhenius relationship. For example, in BCC-CSM1.1 and GFDL-ESM2G, decomposition rate increases up to some optimal temperature and then de-creases (Ji et al., 2008;Parton et al., 1987;Shevliakova et al., 2009). For the GISS-E2 model, the soil respiration response to temperature is a linear fit to data from Del Grosso et al. (2005) up to 30 • C, with a plateau above 30 • C. In all of the models, decomposition either increases monotonically with increasing soil moisture or increases up to some optimum moisture level and then decreases. Three ESMs include nitrogen interactions with soil carbon: CCSM4, NorESM1, and BCC-CSM1.1. We downloaded soil organic carbon, litter carbon, annual NPP, 2 m air temperature, soil temperature, and total soil water from the historical simulation, where available, for each ESM (cSoil, cLitter, npp, tas, tsl, and mrso, respectively, from the CMIP5 variable list). The monthly means for all variables from 1995-2005 were averaged for each grid cell to generate an overall mean for comparison to the HWSD and to use as drivers for our reduced complexity models (see below). We combined litter and soil carbon for our analysis and refer to the sum as soil carbon. Coarse woody debris (cCwd from the variable list) was not included in the sum since there is no respiration from this pool in the two models that report it (CCSM4 and NorESM1). Global turnover times for soil carbon were calculated by dividing total soil carbon by total terrestrial NPP from each ESM. INM-CM4 did not report NPP directly, so we derived NPP from gross primary production and autotrophic respiration (gpp and ra from the variable list). Soil temperatures were reported for each soil layer, but only the top 10 cm mean was used in this analysis. Land area was calculated from the grid area modified by the land cover for each model (areacella and sftlf from the variable list, respectively). All ensemble members were averaged for each model; however, not all variables were available for each ensemble member. For example, GISS-E2-R reported cSoil but not tsl for ensemble member r1i1p1 at the time of download.
We performed a hierarchical cluster analysis and found that ESMs from the same climate center generated very similar distributions of soil carbon for 1995-2005 (see Supplement Fig. S1). Clusters were constructed using complete linkage of the Euclidian distances between the global 1 •gridded soil carbon distributions for each model. Models from the same climate center always showed more than 90 % relative similarity and included the following pairs: GISS-E2 H and R, HadGEM2 ES and CC, IPSL-CM5 (LR) A and B, MIROC-ESM and MIROC-ESM-CHEM, and finally NorESM1 ME and M. As a result, these model pairs were averaged. We were left with 11 independent simulations, one representing each modeling center, that were evaluated using the approaches described below.
ESMs do not report the depth of carbon in the soil profile to CMIP5, making direct comparison with empirical estimates of soil carbon difficult. Although many soil models were originally constructed to represent C dynamics at an approximate depth range of 0 to 20 cm (e.g., Kelly et al., 1997), we assumed that all simulated soil carbon was contained within the top 1 m to simplify comparison with data sets. We recommend that future model intercomparison projects request soil carbon output from model simulations with specific depth ranges (for example, soil carbon above 1 m and below 1 m) to allow for more accurate and direct comparison to survey data.

Soil carbon
The HWSD provided empirical estimates of global soil carbon stocks to validate ESM simulations. The HWSD is a product of the Food and Agriculture Organization of the United Nations and the Land Use Change and Agriculture Program of the International Institute for Applied Systems Analysis. The HWSD aggregates data from the European Soil Database (ESDB, 2004), the Soil Map of China (Shi et al., 2004), regional soil and terrain databases (Sombroek, 1984), and the FAO-UNESCO Soil Map of the World (FAO/UNESCO, 1981). Soil carbon stocks were calculated from bulk densities and organic carbon concentrations given in the HWSD for the top 1 m of soil at a 0.5 • × 0.5 • resolution (Fig. 1). Bulk density estimates in the HWSD were derived from soil texture; however, this approach is not appropriate for high carbon soils (FAO/IIASA/ISRIC/ISSCAS/JRC, 2012; Saxton et al., 1986). Therefore, we replaced Histosol and Andisol bulk densities with values from the World Inventory of Soil Emission Potentials (Batjes, 1996).
Because high latitude soils contain a large fraction of global soil carbon, we also validated ESM simulations of soil carbon in high latitudes with the NCSCD, which is an independent survey of soil carbon (Tarnocai et al., 2009). The NCSCD covers 18.8 × 10 6 km 2 , including areas with different geological histories and stages of soil development. We used the 1 • × 1 • soil carbon data product for the first meter of soil (Fig. 1). The spatial and soil data used to develop this database were collected during the last 60 yr and originated from a variety of sources. Quantitative uncertainty analyses for the HWSD and NC-SCD have not been performed and would be a challenge to construct because of the diverse data sources involved. However, some estimate of uncertainty was essential to enable quantitative comparisons with the CMIP5 models. To generate such a range for total soil carbon from both data sets, we constructed preliminary 95 % confidence intervals (CI 95 ) using a qualitative approach. These estimates must be interpreted with caution because they are not based on formal error propagation methods. Furthermore, these estimates only apply to database totals, and uncertainties for individual grid cells are likely to be larger.
For the HWSD, the major sources of error were related to analytical measurement of soil carbon, variation in carbon content within a soil type, and mapping of soil types. Analytical measurements of soil carbon concentrations are generally precise, but measurements of soil bulk density are more uncertain and may contribute to CI 95 values that are ±15 % of the mean carbon content for a given soil profile. Soil types in the HWSD are defined based on Food and Agriculture Organization soil taxonomic units that are assumed to experience similar histories of soil forming factors such as climate, vegetation, disturbance, topography, and parent material. Batjes (1997) reported quartiles of soil carbon content for 23 soil taxonomic units based on 18 to 1270 soil profiles per unit. These quartiles suggest that soil carbon content is approximately log-normally distributed, allowing for calculation of CI 95 values for each soil unit following log transformation. When back transformed, CI 95 ranged from 6 to 33 % below the median to 6 to 48 % above the median, with an average CI 95 of 14 % below to 17 % above the median across all 23 units.
Another major source of HWSD uncertainty was related to the mapping of soil units and scaling of soil maps to 0.5 • . Soil taxonomic units and associated carbon contents were spatially extrapolated using expert knowledge informed by topography, geology, and vegetation (usually based on aerial photography). Original soil maps were scaled up in the HWSD by classifying each 0.5 • grid cell according to its dominant soil unit. We assumed that the uncertainty associated with mapping and scaling was similar in magnitude to measurement error and spatial variation, with a CI 95 of approximately ±15 % of the mean. To estimate an overall CI 95 for the HWSD, we assumed that variation in soil carbon content within soil taxonomic units already included analytical error, and that median carbon content within a soil unit is extrapolated by multiplying by the area of the unit. Thus the CI 95 values representing variation in soil carbon content and mapping uncertainty were summed to yield an overall CI 95 of 29 % below the mean to 32 % above the mean, or a range of 890 to 1660 Pg C with a mean of 1260 Pg C. This estimate is broadly consistent with other empirical estimates of global soil carbon (Eswaran et al., 1993;Jobbagy and Jackson, 2000;Sombroek et al., 1993).
For the NCSCD, the uncertainties varied by geographic region. The North American portion of the data set was based on analysis of 1169 pedons producing a medium to high confidence rating (66-80 %). Thus we estimated the CI 95 for the North American portion of the NCSCD to be 165 ± 17 Pg C, corresponding to ±10 % of the mean. In Eurasia, soil carbon estimates were based on fewer pedons (591) plus 90 peat cores, producing a low to medium confidence rating (33-66 %). Therefore we estimated the CI 95 for the Eurasian region to be 331 ± 99 Pg C, or ± 30 % of the mean. Carbon in Yedoma deposits and river deltas was estimated independently using surveyed depth information where available. This deeper soil carbon had the lowest confidence rating but contributed only ∼ 1 % or 5 Pg of the database total; therefore we allowed for a CI 95 of 5 ± 5 Pg C on this estimate. Together, these uncertainty estimates yielded an overall CI 95 of roughly 500 ± 120 Pg C for the first meter of soil.

Net primary productivity and temperature
To assess model skill in simulating key driving variables that could affect soil carbon stocks, we compared ESM outputs to temperature data from the Climate Research Unit (CRU) and to NPP data from a literature synthesis and from the Moderate Resolution Imaging Spectrometer (MODIS). The CRU and MODIS data also were used in parameter estimation for the reduced complexity models (Eqs. 1-2) to explain the spatial variation in observed global soil carbon with observed temperature and NPP. We used a 0.5 • × 0.5 • gridded air temperature data set from the CRU, specifically the 1995-2005 mean of the tmp variable from CRU TS 3.10 . For NPP, we used the 0.008 • × 0.008 • gridded MODIS product MOD17A3 from 2000-2011 (Zhao and Running, 2010). We also compared ESM-simulated NPP to Ito's (2011) value of 54 ± 11 Pg C yr −1 (mean ± standard deviation). This estimate was based on empirical models that used environmental parameters to extrapolate field measurements of NPP to the global scale. We considered ESMsimulated NPP values to be consistent with empirical data if they fell within 2 standard deviations of Ito's (2011) estimate.

Biome map
To evaluate ESM soil carbon across biomes, we aggregated HWSD estimates and model simulations of soil carbon within biomes. The biome map was based on the MODIS land cover product MCD12C1 (Friedl et al., 2010;NASA LP DAAC, 2008) (Fig. S2). We assigned one of 16 land cover types to each 1 • × 1 • grid cell by taking the most common land cover from the original underlying 0.05 • × 0.05 • data. Each 1 • × 1 • grid cell was assigned to one of 9 biomes: tundra, boreal forest, tropical rainforest, temperate forest, desert and shrubland, grasslands and savannas, cropland and urban, snow and ice, or permanent wetland. Details for the biome construction can be found in Fig. S2 in the Supplement.

Regridding approach
All model outputs and data sets were regridded to 1 • × 1 • for biome-and grid-scale comparisons. Our regridding approach assumed conservation of mass and that a latitudinal degree is proportional to distance for close grid cells. Regridding the outputs to 1 • × 1 • downscaled the models while upscaling the data (Table 2).

Assessing driving variables for soil C stocks
We developed several reduced complexity models to evaluate the drivers of simulated soil carbon variability and facilitate comparisons between ESMs. These reduced complexity models consisted of a single pool of soil carbon at each grid cell driven by locally varying NPP, soil temperature, and soil moisture (Figs. S3, S4, S5 in Supplement) and globally uniform parameters including a decomposition rate constant, Q 10 value, and moisture coefficient. By applying the same simple model, we were able to compare parameters across ESMs and assess which variables had the strongest control over soil carbon. Driving variables for the reduced complexity models were taken from ESM annual means of NPP, soil temperature (T , top 10 cm mean), and total soil water content (W ) over the period 1995-2005. Our reduced complexity models assumed that the soil carbon pool (C) in grid cell i was at steady state, such that NPP inputs equaled outputs from heterotrophic respiration (R): Carbon pools were not expected to be exactly at steady state for 1995-2005, and mean grid differences between NPP and R across the ESMs ranged from 0.01 to 0.12 kg m −2 yr −1 , or between 1 % and 20 % of the mean grid NPP for this period. Thus the ESMs were close to steady state, and we assumed steady state to simplify our analysis. For the simplest reduced complexity model, we assumed that soil heterotrophic respiration was directly proportional to the soil carbon pool with a spatially uniform decomposition rate constant k (Olson, 1963;Parton et al., 1987): Combining the two above equations yielded the simplest reduced complexity model, Eq. (1), in which soil carbon was proportional to NPP and inversely proportional to a global decomposition rate (k): We formulated a second reduced complexity model, Eq. 2, in which soil respiration in each grid cell also depended on soil temperature (T ) according to a Q 10 function with a baseline temperature of 15 • C (Lloyd and Taylor, 1994): Biogeosciences, 10, 1717-1736, 2013 www.biogeosciences.net/10/1717/2013/ A third reduced complexity model, Eq. (3), included a moisture modifier that increased with total soil water content (W i ) normalized to maximal soil water content for each ESM (W x ) according to an exponential function, where b was a positive scaling exponent: The parameters k, Q 10 , and b in each reduced complexity model were optimized using ESM soil carbon and driving variables across all grid cells. We used a constrained Broyden-Fletcher-Goldfarb-Shanno optimization algorithm (Byrd et al., 1995), a quasi-Newtonian method, as implemented in R 2.13.1 (R Development Core Team, 2012). This algorithm was selected for parameter fitting because of its robust convergence and short run time. We ran the optimization with the following constraints: k ∈ 10 −4 , 10 , Q 10 ∈ (1, 4), and b ∈ (0, 3). The initial parameter estimates were k = 0.1, Q 10 = 1, and b = 0. We used the root mean square error (RMSE) as the measure function. The optimization was performed only on grid cells with non-zero soil carbon values.
We conducted an additional analysis to assess the causes of variation in simulated soil carbon across ESMs (Eqs. 4-7). For this analysis, we used a modified version of Eq. (2) to predict total global soil carbon (C) for each ESM: where the Q 10 and k parameters were derived from fitting Eq.
(2) to the spatial distribution of soil carbon (at 1 • resolution) from each ESM as described above. Grid-scale NPP outputs (NPP i ) and soil temperatures (T i ) from each ESM were used as drivers in Eq. (4) to calculate soil carbon in each grid cell i. Soil carbon was then summed across all grid cells in each ESM to calculate the global soil carbon pool (C). Thus Eq. (4) represents the contribution of both model parameterization (Q 10 and k) and soil carbon drivers (NPP i and T i ) to the global soil carbon pool. To isolate the effect of ESM parameterization on C, we substituted multimodel mean values for NPP (NPP i ) and temperature (T i ) into Eq. (4) for each grid cell i: To isolate the effect of ESM driving variables on C, we substituted multi-model mean values for Q 10 (Q 10 ) and k (k) into Eq. (4): Finally, we substituted only the multi-model mean temperature into Eq. (4) to isolate the effect of NPP on inter-model variation in C: Using regression analysis, we compared the predicted C from Eqs. (4)-(7) to the totals simulated by the ESMs. These regressions measure the contribution of parameterization (Eq. 5) versus driving variables (Eqs. 6 and 7) to variation in soil carbon totals across ESMs. We excluded GISS-E2 from this inter-model analysis because Eq. (2) could not be fit to this model, and therefore Q 10 and k were not available.

Statistical analyses
ESM simulations were compared to data sets using Pearson correlation coefficients, RMSE, and Taylor scores in R 2.13.1 (R Development Core Team, 2012). The Taylor score (T S ) combines the Pearson correlation coefficient (r) and standard deviation (σ ) of the model results (m) compared to the data (d): where r max is the maximum correlation attainable, assumed to be 1 in this case (Taylor, 2001). Biome-aggregated totals were compared to observations using linear regression.

Global soil carbon stocks and turnover times
The mean (±SD) global soil carbon reported across all ESMs was 1520 ±770 Pg, whereas the global soil carbon in the HWSD was 1260 Pg with a CI 95 of 890 to 1660 Pg (Table 2, Fig. 2). CCSM4 reported the lowest total at 510 Pg C and MPI-ESM-LR the highest at 3040 Pg C. Examining only the area shared by all ESMs and the HWSD reduces the global carbon totals but does not substantially change the rank order of the models (  Fig. S6). In addition, the rank order of ESM soil carbon totals in this region differed from the rank order based on global totals. CCSM4 and NorESM1 simulated just over 10 % of the total soil carbon  (Amundson, 2001;Raich and Schlesinger, 1992). For soil carbon and NPP, each global estimate is separated into individual biome components according to the legend shown in the top panel.
Variation in global soil carbon stocks simulated by ESMs could be driven by variation in modeled NPP, and we found that global terrestrial NPP varied by a factor of 2.6 across the models (Fig. 2). CCSM4, BCC-CSM1.1, CanESM2, INM-CM4, GISS-E2, and MIROC-ESM all predicted global NPP values within 2 standard deviations of the Ito (2011) estimate of 54 Pg C yr −1 , ranging from 46 to 73 Pg C yr −1 , whereas the remaining 5 models fell outside this range. NPP from MODIS was similar to Ito (2011) at 52 Pg C yr −1 . At high northern latitudes, NPP estimates from the ESMs were more variable (1.7 to 10.1 Pg C yr −1 ), compared to a MODIS estimate of 4.7 Pg C yr −1 (Fig. S6 in Supplement).
Turnover times for global soil carbon from the ESMs varied by a factor of 3.6, between 10.8 and 39.3 yr, using global Table 3. Goodness-of-fit measures by grid cell for each ESM including soil carbon versus the HWSD, soil carbon versus NCSCD, soil carbon versus HWSD without NCSCD grid cells (HWSD − NCSCD), NPP versus MODIS NPP, and land 2 m air surface temperature versus CRU 2 m air temperature. T S = Taylor score; r = Pearson correlation coefficient; RMSE = root mean square error. RMSE has units of kg C m −2 for the soil carbon comparisons, kg C m −2 yr −1 for NPP, and • C for temperature. stocks and NPP estimates from each model (Fig. 2). Using MODIS NPP, we calculated a turnover time of 24 yr for soil carbon in the HWSD. This estimate is consistent with the range of 18 to 32 yr reported in other studies (Amundson, 2001;Raich and Schlesinger, 1992). However, CanESM2 and INM-CM4 were the only two ESMs with turnover times that also fell within this range (Fig. 2). At high northern latitudes, 5 of the 11 models had turnover times that were considerably lower than the observations, whereas only 2 of the models had turnover times exceeding observational estimates ( Fig. S6 see Supplement). Turnover times for high northern latitudes were 101.2 yr for the NCSCD and 60.8 yr for the HWSD.

Spatial distribution of soil carbon
The spatial distribution of soil carbon stocks varied widely among the ESMs (Fig. 3 There was generally poor agreement between the ESMs and the HWSD soil carbon distribution at the 1 • scale (Table 3). Compared to the HWSD, ESMs had Pearson correlation coefficients between 0.06 and 0.39, RMSE between 9.4 and 20.7 kg C m −2 , and Taylor scores ranging from 0.21 to 0.68. Omitting the high latitude portion of the HWSD that overlapped with the NCSCD modestly improved these performance metrics for most but not all ESMs (Table 3). Model agreement with NCSCD soil carbon was poor with Pearson correlation coefficients between −0.10 and 0.19, RMSE between 18.0 and 27.7 kg C m −2 , and Taylor scores between 0.05 and 0.52. Agreement between the HWSD and NCSCD also was also low in the areas where the two data sets overlapped (Pearson correlation coefficient of 0.33, RMSE of 20.0 kg C m −2 , and Taylor score of 0.60), although better than the agreement between any individual ESM and the NC-SCD.
ESM agreement with the HWSD generally improved at the biome level (Fig. 4). BCC-CSM1.1 and CanESM2 stood out as being highly correlated with the HWSD (R 2 > 0.90, p < 0.01), though CanESM2 overestimated soil carbon in boreal forests and grasslands and savanna. Biome simulations from HadGEM2, IPSL-CM5, INM-CM4, and MIROC-ESM also had relatively high levels of agreement with the HWSD (0.90 > R 2 > 0.75, p < 0.01), but most regression slopes and intercepts diverged from 1.0 and zero, respectively (Fig. 4). HadGEM2 overestimated soil carbon in grasslands and savanna. IPSL-CM5 generally overestimated tundra but underestimated desert and shrublands. INM-CM4 overestimated grasslands and savanna, boreal forests, croplands, and urban. MIROC-ESM overestimated all biomes except wetlands. Both CCSM4 and NorESM1 were moderately Table 4. Coefficients of determination (R 2 ) and global-scale parameters (1/k and Q 10 ) from reduced complexity models of ESM soil carbon distributions. Eq. (1): dependence on NPP; Eq. (2): dependence on NPP and soil temperature; Eq. (3): dependence on NPP, soil temperature, and soil moisture. Parameters are shown from Eq. (2) with 1/k analogous to turnover time. HWSD-MODIS-CRU represents reduced complexity models based on observational data. The reduced complexity model for CanESM2 was the only one improved by including soil moisture and had a turnover time (1/k) of 8.20 yr, Q 10 of 1.48, and moisture exponent of 0.46 based on Eq. (3). All R 2 values were statistically significant (R 2 > 0.05, p < 0.01) unless otherwise indicated (NS). correlated with the HWSD (0.70 > R 2 > 0.65, p < 0.01), but consistently underestimated biome totals particularly in tundra, boreal forest, and desert and shrubland. Biome totals from MPI-ESM-LR were also moderately correlated with the HWSD (R 2 = 0.62, p < 0.01), but this model overestimated most biome totals, particularly grasslands and savanna. GFDL-ESM2G and GISS-E2 were weak to moderately correlated with the HWSD on the biome level (R 2 = 0.38 and R 2 = 0.51, respectively). GFDL-ESM2G overestimated biome totals from tundra and boreal forests while underestimating the other biomes. GISS-E2 overestimated biome totals in desert and shrublands, grasslands and savanna, tundra, and boreal forests while underestimating tropical rainforests.

Drivers of soil carbon distributions and global stocks
The spatial variability in 8 of the 11 ESMs was well explained by the reduced complexity model driven by NPP and soil temperature (Eq. 2) with R 2 values between 0.72 and 0.93 (Table 4). Consistent with our global-scale calculations (Fig. 2), the turnover times (1/k) for global soil carbon inferred from Eq. (2) (at a baseline temperature of 15 • C) varied from 11 to 37 yr, and Q 10 values ranged from 1.5 to 2.6 ( Table 4). The reduced complexity model for CanESM2 was the only one significantly improved by the addition of soil moisture (Eq. 3), with the R 2 value increasing from 0.56 to 0.73. Soil carbon outputs from GISS-E2 (R 2 < 0.01) and MPI-ESM-LR (R 2 = 0.32) were not well explained by any of the reduced complexity models.
Given the strong relationships between soil carbon, temperature, and NPP illustrated by our reduced complexity models, model skill in simulating driving variables could strongly influence simulated soil carbon stocks (Table 3). ESMs varied in their ability to capture the observed 1 • spatial distribution of NPP (Pearson correlation coefficients from 0.48 to 0.75, biome regression R 2 values from 0.86 to 0.99; Table 3, Fig. S7 in the Supplement). In contrast, models performed better at simulating surface air temperature observations (correlations from 0.93 to 0.96, biome regression R 2 values from 0.93 to 0.97; Fig. S8 in the Supplement). Although air temperature is not directly comparable to soil temperature, particularly in areas with thick organic soils, the biome level correlation between soil and air temperature was high across all ESMs (R 2 values higher than 0.97; Fig. S9). INM-CM4, GISS-E2, BCC-CSM1.1, CCSM4, and NorESM1 all showed warmer soil temperatures compared to air temperatures in northern biomes (Fig. S9). In contrast to the strong relationships we found between soil carbon, NPP, and temperature in the ESMs, the 1 • spatial distribution of soil carbon from the HWSD was not well explained by MODIS NPP and CRU surface air temperature data using the same reduced complexity model (R 2 value of 0.10 for Eq. (2); Table 4).
The reduced complexity models that explained withinmodel spatial variation of soil carbon also captured most of the variation in global soil carbon totals across ESMs ( Fig. 5). Reduced complexity models using ESM-specific values for k and Q 10 and ESM-derived driving variables were able to explain 98 % the variation in total soil carbon across ESMs (Fig. 5a). Most of the variation in soil carbon across ESMs could be attributed to differences in parameterization as represented by Eq. (5) (R 2 = 0.64, Fig. 5b). There was no significant cross-ESM variation due to differences in driving variables alone as represented by Eq. (6) (Fig. 5c). However, driving variables must interact with parameters, since the variances explained by drivers alone (13 %, not significant) and parameters alone (64 %) did not sum to 98 % (Fig. 5a-c). NPP was likely the main driving variable in this interaction, since using the multi-model mean temperature at each grid cell (Eq. 7) still allowed us to explain 93 % of the variation in global soil carbon across ESMs (Fig. 5d).

Discussion
Accurate models of the soil carbon cycle are essential for predicting carbon-climate feedbacks in the future because soil carbon stocks are sensitive to climate change and large relative to the atmospheric CO 2 reservoir. As far as we know, our analysis is the first to benchmark soil carbon outputs from ESMs against empirical data at the global scale, and the first to explore the possible factors contributing to differences among models. We found that although some models simulated reasonable global soil carbon totals, fewer were able to match biome totals, and none were able to reproduce grid-scale distributions of soil carbon. There are a number of factors that may have contributed to the divergence between ESM simulations and observational data. These factors include (1) uncertainties in the data, (2) incorrect representation of environmental drivers in the models (e.g., NPP, temperature, and soil moisture), and (3) incorrect model structure or parameterization of the decomposition response to driving variables. Better performance at the global and biome scales may be due to aggregation of environmental variation that was not captured by the models at finer spatial scales. For instance, topographic controls on soil texture, moisture, and anoxia regulate soil carbon accumulation in peatlands and other organic soil types (e.g., Fan et al., 2008) but are poorly represented in many ESMs. Addressing these issues will be essential for increasing confidence in ESM simulations of terrestrial carbon in the future.

Data uncertainties
Our ability to evaluate model performance relies on high quality empirical data with associated estimates of uncertainty. Whether model simulations diverge from the data is difficult to assess without a formal analysis of uncertainty in the data. Despite their comprehensiveness, the HWSD and NCSCD lack quantitative uncertainty estimates, thereby constraining our ability to use these data sets for benchmarking ESMs. Our preliminary analyses based on a qualitative assessment indicated that the uncertainty in empirical estimates of soil carbon stocks could exceed 770 Pg C at the global scale, an amount similar to the entire pool of atmospheric carbon. At high northern latitudes, there was substantial disagreement between the two data sets. NCSCD estimates of CI 95 were between 380 and 620 Pg C, whereas the corresponding HWSD estimate was only 290 Pg C. However, the HWSD did not include regional uncertainty information, meaning that the two estimates may agree once a formal uncertainty analysis has been performed. Such an analysis requires quantification of uncertainty in both measurement and scaling processes used to construct the spatial distribution of soil carbon. Uncertainty in the measurement of soil properties such as bulk density and carbon concentration must be integrated with errors involved in extrapolating data from individual soil profiles to the regional scale. Detailed analysis of the accuracy of soil maps will likely be essential for quantifying the uncertainty in this extrapolation process.

Driving variables
Accurate observational data are important but will not resolve the differences in simulated soil carbon that we observed among the ESMs. These differences must be due to differential model skill in simulating soil carbon drivers or in representing the response of soil carbon to drivers through model parameterization and structure. Our reduced complexity models showed that differences in NPP contributed significantly to differences in soil carbon across ESMs (Fig. 5). NPP was also a significant driver of soil carbon within ESMs (Table 3), suggesting it should be a focal variable for improving soil carbon estimates. Given the large range of global NPP across ESMs (Fig. 2) and the low Taylor scores observed for some models in comparison to MODIS NPP (Table 3), it may be possible to improve soil carbon simulations by revising photosynthesis and autotrophic respiration algorithms in some of the ESMs so that NPP is more consistent with contemporary observations.
In contrast to NPP, differences in soil temperature did not contribute significantly to differences in soil carbon stocks across ESMs (Fig. 5). However, our reduced complexity models indicated that soil temperature was important for explaining soil carbon variation within models (Table 4). Therefore, simulation of soil temperature could also be a focal area for model improvement, especially since other studies suggest that soil temperature is not consistently well represented in ESMs. For example, the physical coupling between surface air temperature and soil temperature at high latitudes differs considerably across ESMs, which influences the spatial distribution of permafrost (Koven et al., 2012;Slater and Lawrence, 2013). Even if ESMs can simulate average air temperatures consistent with observations (Table 3, Fig. S8), fine-scale differences in the ability to represent soil temperatures (Fig. S9) and permafrost could have consequences for soil carbon distributions.
Soil moisture did not play an important role as a driving variable for soil carbon in our reduced complexity models, indicating that for most models this variable did not strongly control spatial patterns in soil carbon stocks (Table 4) or differences among models (Fig. 5). This result was unexpected because soil moisture affects decomposition rates in all ESMs (Table 1). Furthermore, other studies have shown that soil carbon stocks depend on the response of heterotrophic respiration to soil moisture in global models, although NPP and soil temperature were also important drivers of soil carbon . It is possible that soil moisture influences soil carbon stocks in ESMs, but our reduced complexity model was unable to statistically distinguish the soil moisture effect from the NPP effect because these two drivers often covary.
Alternatively, the exponential form of the moisture function in our reduced complexity model might have been inappropriate if decomposition rates decline at high soil moisture. Based on empirical data, a substantial fraction of global soil carbon likely resides in areas where poor soil drainage impedes organic matter oxidation (Gorham, 1991). It is likely that the interaction of topographic controls and soil texture with soil moisture is not well represented in the current generation of ESMs. New approaches may be needed to determine which grid cells are poorly drained, and the rate at which organic soils form in these area (Ise et al., 2008). We also recommend that future CMIP archives require soil mois-ture information for different soil layers to facilitate benchmarking studies on the response of carbon to moisture in the soil profile.
Although our reduced complexity models indicated that soil carbon simulated by ESMs was driven primarily by NPP and temperature, this relationship was much weaker with observational data. According to the same reduced complexity model used for the ESMs, MODIS NPP and CRU surface air temperatures were only able to explain 10 % of the spatial variation in HWSD soil carbon stocks. Thus the variables that drive soil carbon stocks in ESMs may differ from those that determine observed carbon stocks. For example, soil temperatures in organic-rich soils may be weakly coupled to air temperatures (Koven et al., 2011;Slater and Lawrence, 2013), in contrast to the strong coupling predicted by most ESMs (Fig. S9). Alternatively, the mathematical form of our reduced complexity model may have been inappropriate for the observational data, even though it was appropriate for most ESM outputs. Regardless, our results imply that ESMs may need to incorporate a broader range of environmental drivers and processes to improve modeldata agreement. Even if simulations of NPP and soil temperature can be improved in ESMs, these drivers have a limited ability to explain spatial patterns in global soil carbon with current model structures.

Parameterization and model structure
We found that parameterization was a major source of variation in ESM soil carbon simulations (Fig. 5). In some of the ESMs such as CCSM4, soil carbon turnover may be too fast, whereas in other models such as MIROC-ESM, turnover may be too slow (Fig. 2). To address these issues, ESMs could use terrestrial radiocarbon observations (both the total inventory and vertical distribution of 14 C in different biomes) to help constrain rates of soil carbon turnover (Torn et al., 1997;Trumbore, 2009). Another avenue of improvement for ESM parameterization could focus on processes that operate on fine spatial scales. Differences in soil texture and topography may lead to non-linear effects on soil carbon storage that are not well described by the average characteristics of a grid cell. For instance, relatively small-scale topographic variations are associated with peatland formation, and it is unclear how to scale these effects globally (Gorham, 1991;Koven et al., 2011). A multi-scale approach is required to determine which processes are important at the global scale and how to represent them.
Improving empirical data sets, model driving variables, and model parameterization could substantially increase model-data agreement for present-day soil carbon stocks. However, matching current soil survey data is a necessary but not sufficient condition for validating the accuracy of Earth system models. In order to have confidence in future simulations, the models must correctly represent the mechanisms and drivers of soil carbon change, such that they are right for the right reasons. For example, models with incorrect mechanisms or drivers could be tuned to correctly simulate current soil carbon stocks, but might incorrectly simulate soil carbon stock changes in the future.
We initially hypothesized that models with more pools would have greater flexibility and capture more of the spatial variation in soil carbon. However, the structural features that we examined did not clearly relate to differences in ESM agreement with empirical data (Tables 1, 3). We saw no pattern in ESM-data agreement with respect to number of soil carbon pools, temperature and moisture sensitivity functions, or presence of a nitrogen component. Furthermore, our reduced complexity model (Eq. 3) explained most of the spatial variation within 9 of 11 models (0.62 < R 2 < 0.93). This result confirmed that, despite different simulated stocks of soil carbon, most of the models share a similar underlying structure. Such similarity means that the models likely make similar assumptions about the mechanisms regulating soil carbon cycling. If these underlying assumptions are incorrect or incomplete, the resulting errors will be present in all of the models.
CanESM2, MPI-ESM-LR, and GISS-E2 are three exceptions that were not well explained by our reduced complexity model (Eq. 2) driven by NPP and soil temperature (Table 4), and thus may be examples of models with structural differences. CanESM2 was the only model in which soil water content contributed to the explanatory power of the reduced complexity model (R 2 value improved from 0.57 to 0.74). This dependency on soil water content may be partly explained by the biome-specific turnover time in CanESM2. Since biomes are partially determined by precipitation and soil moisture status, biome-specific turnover times might have resulted in a tighter relationship between soil carbon and moisture in our reduced complexity model. Outputs from MPI-ESM-LR were only moderately explained by our reduced complexity models (R 2 value 0.32). We do not have a good explanation for the poor fit since there was no significant deviation in documented model structure, and driving variables were roughly in line with other model simulations. GISS-E2 outputs were not explained at all by the reduced complexity model (R 2 value less than 0.01). Unlike other models, GISS-E2 showed a unique disconnect between NPP and soil carbon which could be due to differences in the way plant biomass is allocated to litter in the model. However, we cannot offer a definitive explanation for the poor fit.
All ESMs may be missing key processes governing longterm carbon storage that affect model-data agreement. Decomposition models currently used in all ESMs are built on the assumption that carbon substrates have intrinsic chemical decomposition rates (Parton et al., 1993). However, there is an emerging consensus that key abiotic and biotic factors have a stronger governing role in decomposition than the carbon compounds themselves (Schmidt et al., 2011). These key governing components may include aggregate interactions (Six et al., 2000), microbial dynamics (Todd-Brown et al., 2012), cryoturbation (Koven et al., 2011), syngenetic soil formation (Fan et al., 2008;Shur et al., 2004), extracellular enzyme dynamics (German et al., 2011), and rare substrate formation (Allison, 2006). Representing these processes in the structure of soil carbon models remains a major challenge. However, smaller-scale decomposition models have begun to explore several of these mechanisms (Manzoni and Porporato, 2009).
Recent advances in the theory of microbial decomposition could provide a foundation for major changes in the structure of soil carbon models used in ESMs. Schimel and Weintraub (2003) proposed a model in which decomposition was mediated by soil enzymes and microbial biomass. Later models expanded this framework to include microbial functional groups that preferentially decompose specific substrate types (Moorhead and Sinsabaugh, 2006). In contrast to current substrate pool models used in ESMs, biomass-mediated decomposition models would likely include non-linear processes such as Monod uptake or Michaelis-Menten enzyme kinetics. These non-linear effects could produce very different behaviors at daily, annual, and centennial timescales. Compared to substrate pool models, models driven by microbial biomass predict smaller losses of soil carbon under warming due to declines in microbial growth efficiency with higher temperature (Allison et al., 2010).

Conclusions
Overall, we found that some ESMs simulated soil carbon stocks consistent with empirical estimates at the global and biome scales. However, all of the models had difficulty representing soil carbon at the 1 • scale. Despite similar overall structures, the models do not agree well among themselves or with empirical data on the global distribution of soil carbon. Differences across ESMs are primarily due to differences in the representation of NPP and the parameterization of soil decomposition sub-models, not due to differences in model structure. However, all model structures may have serious shortcomings since NPP and temperature strongly influenced soil carbon stocks in ESMs but not in observational data. Fully reconciling this disagreement will require a range of approaches, including better prediction of soil carbon drivers, more accurate model parameterization, and more comprehensive representation of critical biological and geochemical mechanisms in soil carbon sub-models. There is also a need for better quantification of the uncertainty in empirical estimates of soil carbon stocks that are used to benchmark ESMs. If this uncertainty is too high for rigorous model comparison, additional measurements of soil carbon stocks may be required in some regions of the world. Addressing these issues will improve our ability to predict the response of the carbon cycle to climate change and inform policymakers about the potential impacts of carbon emissions.