To make predictions about the carbon cycling consequences of rising global surface temperatures, Earth system scientists rely on mathematical soil biogeochemical models (SBMs). However, it is not clear which models have better predictive accuracy, and a rigorous quantitative approach for comparing and validating the predictions has yet to be established. In this study, we present a Bayesian approach to SBM comparison that can be incorporated into a statistical model selection framework. We compared the fits of linear and nonlinear SBMs to soil respiration data compiled in a recent meta-analysis of soil warming field experiments. Fit quality was quantified using Bayesian goodness-of-fit metrics, including the widely applicable information criterion (WAIC) and leave-one-out cross validation (LOO). We found that the linear model generally outperformed the nonlinear model at fitting the meta-analysis data set. Both WAIC and LOO computed higher overfitting risk and effective numbers of parameters for the nonlinear model compared to the linear model, conditional on the data set. Goodness of fit for both models generally improved when they were initialized with lower and more realistic steady-state soil organic carbon densities. Still, testing whether linear models offer definitively superior predictive performance over nonlinear models on a global scale will require comparisons with additional site-specific data sets of suitable size and dimensionality. Such comparisons can build upon the approach defined in this study to make more rigorous statistical determinations about model accuracy while leveraging emerging data sets, such as those from long-term ecological research experiments.

Coupled Earth system models (ESMs) and constituent soil biogeochemical
models (SBMs) are used to simulate global soil organic carbon (SOC) dynamics
and storage. As global climate changes, some ESM and SBM simulations suggest
that substantial SOC losses could occur, resulting in greater soil

Soil biogeochemical models vary greatly in structure (Manzoni and Porporato, 2009) but can be broadly partitioned into two categories: those that implicitly represent soil C dynamics as first-order linear decay processes and those that explicitly represent microbial control over C dynamics with nonlinear Michaelis–Menten functions (Wieder et al., 2015a). Explicit models typically include more parameters than linear models because multiple microbial parameters are needed for each decay process as opposed to a single rate parameter. The additional parameters allow explicit models to represent microbial mechanisms, but at the expense of greater model complexity.

Rigorous statistical approaches should be applied to investigate how
explicit representation of microbial processes affects predictive model
performance. ESM and SBM comparisons involving empirical soil C data
assimilations have been conducted previously (Allison et al., 2010; Li
et al., 2014) but few standardized statistical methods for ESM and SBM
benchmarking and comparison have been developed that would allow for
rigorous model selection. Prior model comparisons have involved graphical
qualitative comparisons or use of basic fit metrics such as the coefficient
of determination,

Encouragingly, a rich toolset to further inform quantitative model
evaluation and comparison can be drawn from Bayesian statistics (Hararuk
et al., 2014, 2018; Hararuk and Luo, 2014). These tools include information
criteria and approximate cross-validation, goodness-of-fit metrics designed
for the simultaneous comparison of multiple structurally diverse models.
Like

Examples of information criteria popularized by widely used R packages such as lme4 and rjags include the Akaike information criterion (AIC), Bayesian information criterion (BIC), and deviance information criterion (DIC) (Vehtari and Ojanen, 2012). However, these metrics have some limitations. AIC, BIC, and DIC do not use full sampled posterior distributions in their computational processes. AIC and BIC both rely on a pointwise maximum likelihood estimate that cannot be derived from nonuniform Bayesian prior distributions, including normal distributions. AIC and BIC (despite BIC's name) thereby have limited use in Bayesian statistics settings. DIC can accommodate nonuniform priors but is calculated from pointwise simplified posterior means. The compression of full posteriors into pointwise means can prompt DIC to compute an impossible negative effective model parameter count in select situations (Gelman et al., 2014). Consequently, the original forms of AIC, BIC, and DIC are no longer recommended for use in Bayesian model assessment by some statisticians in light of superseding alternatives (Gelman et al., 2014).

Three predictive goodness-of-fit metrics address the limitations and stability issues of AIC, BIC, and DIC by incorporating full, nonuniform posterior distributions in their calculations to better account for overfitting and model size (Christensen et al., 2010; Gelman et al., 2014). These metrics include the widely applicable information criterion (WAIC), log pseudomarginal likelihood (LPML), and Pareto-smoothed important sampling leave-one-out cross validation (PSIS-LOO and hereby referred to as LOO). WAIC, LPML, and LOO can estimate the ability of models to fit unobserved measurements outside of the set of measured data samples (Vehtari et al., 2017). Thus, WAIC, LPML, and LOO can be considered as superior barometers for model predictive accuracy compared to AIC, BIC, and DIC.

The overarching goal of this study was to develop a statistically rigorous
and mathematically consistent data assimilation framework for SBM comparison
that uses predictive Bayesian goodness-of-fit metrics. We pursued three
specific objectives as part of that goal. First, we compared the behaviors
of two different SBMs, a linear microbial-implicit model termed the
conventional model (CON) and a nonlinear microbial-explicit model called
the Allison–Wallenstein–Bradford model (AWB) (Fig. 1), following data
assimilation with soil respiration data sourced from a meta-analysis of soil
warming studies (Romero-Olivares et al.,
2017). Second, we characterized the parameter spaces of these models using
prior probability distributions of parameter values informed by previous
studies and expert judgment. Third, we compared specific Bayesian predictive
information criteria in WAIC, LPML, and LOO to the coefficient of
determination,

Diagrams of the pool structures of the

We compared two SBMs, the CON and AWB models (Allison et al., 2010). The models were selected for this study due to their relative equation simplicity, their tractable parameter count, and limited biological data input requirements (Sect. S1 in the Supplement). The CON system models three separate C pools as state variables including SOC, dissolved organic C (DOC), and microbial biomass C (MIC) pools, while AWB includes SOC, DOC, MIC, and extracellular enzyme biomass C (ENZ) pools (Fig. 1). Additionally, these models were chosen because they are C-only models without nitrogen (N) pools. The increased complexity of N-accounting SBMs will require future studies with coupled N data sets (Manzoni and Porporato, 2009).

The data set for model fitting was compiled from a recent meta-analysis of
27 soil warming studies that measured

Model-outputted response ratios were calculated by dividing simulated

CON and AWB ordinary differential equation systems were simulated using the CVODE backward differentiation method (Curtiss and Hirschfelder, 1952) from the SUNDIALS library of equation solvers (Hindmarsh et al., 2005). Differential equation models contain parameters that affect state variables, and model fitting through Markov chain algorithms involves iterating through parameter space one set of parameters at a time. We performed model fitting using a Markov chain algorithm called the Hamiltonian Monte Carlo (HMC), using version 2.18.1 of the RStan interface to the Stan statistical software (Carpenter et al., 2017; Stan Development Team, 2020) and version 3.4.1 of R (R Core Team, 2017). HMC is not a random-walk algorithm and uses Hamiltonian mechanics to determine exploration steps in parameter space. HMC has been theorized to offer more efficient exploration of high-dimensional parameter space than traditional random-walk Metropolis algorithms (Beskos et al., 2013).

Conditional on the meta-analysis data set, the HMC algorithm computed posterior and posterior predictive distributions, from which Bayesian statistical inferences on likely ranges of parameter values were then made. Posterior distributions are the distributions of more likely model parameter values conditional on the data. Posterior predictive distributions are the distributions of more likely values for unobserved data points from the data-generating process conditional on the observations. In the case of this study, the experiments constituting the meta-analysis would be the data-generating process.

For the sake of clarity, it is important to distinguish between the frequentist confidence intervals and Bayesian posterior predictive intervals and distributions we describe in our study. Confidence intervals are calculated from the sample means and standard errors at observed data points and indicate ranges of values that are likely to contain the true data values with repeated sample collections using the same methodology. Posterior predictive intervals and distributions are computed after estimation of the posterior parameter distributions and represent the likely distributions of unobserved data values conditional on observed data values. Bayesian credible intervals, which we will also discuss in this study, are ranges of values that parameters are likely to take with some probability that are conditional on the observed data. Credible areas indicate the probability densities of parameter values across credible intervals.

We ran four chains for 35 000 iterations each for our HMC simulations, with the first 10 000 iterations being discarded as burn-in for each chain. Hence, our posterior distributions consisted of 100 000 posterior samples per HMC run. In retrospect, because our credible areas displayed sufficient smoothness (Fig. S2 in the Supplement) and Bayesian diagnostics indicated adequate posterior sampling (Table S5 in the Supplement), we could have reduced simulation time without impairing posterior computation by running shorter chains that consisted of 20 000 to 30 000 iterations. To minimize the presence of divergent energy transitions, which indicate issues with exploring the geometry of the parameter space specified by the prior distributions, we set the adaptation delta to 0.95, the initial step size to 0.1, and maximum tree depth to 12. Those parameters determine how the HMC algorithm proposes new sets of parameters at each step and were set so that the HMC would begin with smaller exploration steps. The algorithm varies the step size from its initial value throughout posterior sampling to maintain a desired acceptance rate; the tuning sensitivity of the step size is governed by the adaptation delta value, with higher values indicating reduced sensitivity.

We further constrained our HMC runs to characterize parameter regimes corresponding to higher biological realism. Normal informative priors were used to initiate the runs, and the prior distribution parameters were chosen based on expert opinion and previous empirical observations (Allison et al., 2010; Li et al., 2014). Prior distributions had noninfinite supports; supports were truncated to prevent the HMC from exploring parameter space that was unrealistic (Table S2).

Because we were mainly interested in testing model predictions of soil warming response, the models were initiated at steady-state prior to the introduction of warming perturbation to isolate model warming responses from steady-state attraction. We fixed preperturbation steady-state soil C densities to prevent HMC runs from exploring parameter regimes corresponding to biologically unrealistic C pool densities and mass ratios.

To set prewarming steady-state soil C densities, we first analytically derived steady-state solutions of the ordinary differential equations of the models. Then, with the assistance of Mathematica version 12, we rearranged the equations by moving the steady-state pool sizes to the left-hand side (Sect. S2), such that we could determine the value of parameters dependent on pool sizes while allowing the rest of the parameters to vary for the HMC. Consequently, we could constrain the prewarming pool sizes from reaching unrealistic values in the simulations.

Sensitivity analyses examine how the distributions of model input values influence the distributions of model outputs. In our study, we considered prewarming C pool densities as a model input. We performed a sensitivity analysis to observe how the choice of prewarming C pool densities and C pool ratios would affect the model fits and posterior predictive distribution of C pool ratios.

We compared the model outputs and postwarming response behavior of AWB and
CON at equivalent C pool densities and ratios. The ratio of soil microbe
biomass C (MIC) density to SOC density has been observed to vary
approximately from 0.01 to 0.04 (Anderson
and Domsch, 1989; Sparling, 1992), so we used those numbers as guidelines
for establishing the ranges of the C pool densities and density ratios
explored in our simulations. One portion of the analysis involved running
HMC simulations in which we set the prewarming MIC density at 2 mg C g

For some combinations of the prior distributions and prewarming steady-state C pool densities (Table S2), AWB HMC runs wandered into
unstable parameter regimes that would prevent the algorithm from reliably
running to completion. Consequently, we do not compare simulation results
for AWB and CON with prewarming SOC densities below 50 mg C g

In addition to

Algorithmic differences between WAIC and LPML and LOO render them appropriate for different statistical modeling goals and make them complementary metrics. WAIC is suitable for estimating the relative quality of model fits to hypothetical repeated samples collected at existing experimental time points, whereas LOO and LPML are suitable for estimating the quality of fits to hypothetical measurements taken between observed time points (Vehtari et al., 2017).

We used version 2.0.0 of the “loo” package available for R to calculate our
WAIC and LOO values (Vehtari et al., 2019). A lower WAIC and
LOO and a higher LPML indicate a more likely model for a given data set.
LPML can be multiplied by a factor of

We obtained distributions of posterior predictive fits to the univariate response ratio data for both AWB and CON across different prewarming MIC-to-SOC ratios. Posterior samples totaled 100 000 for each simulation. Sampler diagnostics for the HMC runs indicated that the statistical models were valid at all prewarming steady-state values observed (Table S6), that model parameter values converged across the four Markov chains (Fig. S7), and that the posterior parameter space was effectively sampled and explored (Fig. S5) to generate enough independent posterior samples for inference (Fig. S6). The ratios of effective posterior parameter samples to total samples for parameters were generally satisfactory; across observed MIC-to-SOC ratios, they were all greater than 0.25 and mostly greater than 0.5 (Table S5).

We also tracked divergent transitions, which mark points in chains at which
the HMC algorithm was inhibited in its exploration and posterior sampling,
potentially due to the parameter space becoming geometrically confined and
difficult to navigate. Divergent transitions occurred in the AWB HMC runs
(Fig. S9), though the ratios of divergent transitions to sampled
iterations were relatively low for all runs. The highest divergent transition
ratio observed was 0.0217, corresponding to the simulation initiated with
prewarming SOC

The CON curve monotonically decreases in response ratio over time, whereas the AWB curve displays changes in slope sign (Fig. 2). The difference in curve shape (Fig. 3a, b) is in line with CON's linear status and AWB's nonlinear formulation with more parameters (Allison et al., 2010). By 50 years after warming, mean fit curves for AWB and CON return to 1.0 after their initial increase (Fig. 3c, d), consistent with prior observations and expectations at steady state (van Gestel et al., 2018; Romero-Olivares et al., 2017).

Distribution of fits of

Intramodel comparisons of mean posterior predictive response
ratio fits for AWB and CON across different MIC-to-SOC ratios. Open circles
show the meta-analysis data points for reference. The blue, black, and red
lines indicate model mean fits corresponding to different
prewarming steady-state SOC values of 50,
100, and 200 mg C g

From a cursory visual evaluation, neither of the models clearly outperforms
the other across all prewarming steady states. The 95 % confidence
interval of the first data point at

For both AWB and CON, increasing the prewarming SOC to higher densities
from SOC

Changing the prewarming MIC-to-SOC steady-state pool size ratio by
increasing prewarming MIC from 1 to 8 mg C g

In addition to response ratio fits, we observed the influence of prewarming MIC-to-SOC ratios on model SOC stock response ratios in AWB and CON simulations following warming. Similar to the model flux response ratios, SOC response ratios were calculated by dividing evolved postwarming SOC densities by prewarming densities. The SOC response ratios at 12.5 years for CON and AWB increased as prewarming SOC was raised (and, hence, the MIC-to-SOC ratio decreased) with other prewarming C densities held constant, indicating reduced proportional SOC loss when SOC stocks were initiated at higher prewarming densities (Fig. S3a). For CON, SOC loss decreased from 27.1 % at SOC50 to 9.2 % at SOC200. In a similar trend for AWB, SOC loss decreased from 17.2 % at SOC50 to 8.1 % at SOC200. In contrast, raising prewarming MIC densities (and, hence, increasing the MIC-to-SOC ratio) with other prewarming C densities held constant did not produce a shared trend for CON and AWB (Fig. S3b). CON SOC loss decreased from 18.8 % at MIC1 to 17.4 % at MIC8, while AWB SOC loss increased from 11.3 % at MIC1 to 16.3 % at MIC8.

Truncation of prior supports, or distribution domains, generally did not
prevent posterior densities from retaining normal distribution shapes.
Deformation away from Gaussian shapes for the densities of Ea

The 95 % probability density credible areas for model parameters
corresponding to prewarming steady-state SOC

For both CON and AWB, LOO, WAIC, LPML, and

Goodness-of-fit metrics plotted against initial steady-state SOC
for AWB and CON models for

Varying prewarming steady-state MIC from MIC1 to MIC8 modestly impaired
goodness of fit across the various metrics (Fig. S4). CON's LOO
and WAIC values increased respectively from

Our study develops a quantitative, data-driven framework for model
comparison that could be applied across different research questions,
ecosystems, and scales. We demonstrated the novel deployment of WAIC and
LOO, two more recently developed Bayesian goodness-of-fit metrics that
estimate model predictive accuracy, to evaluate SBMs using data from
longitudinal soil warming experiments. WAIC and LOO improve upon older and
more frequently used metrics, such as AIC and DIC, by accounting for model
complexity and overfitting of data in a more comprehensive, stable, and
accurate fashion. The quantitative agreement between WAIC, LOO, and LPML
reinforces the reliability and validity of information criteria and
cross-validation metrics to complement use of frequentist

We constrained the fitting of AWB and CON to biologically reasonable parameter space by fixing prewarming steady-state C pool densities and establishing prior distributions informed by expert judgment (Table S2). We observed that, despite the qualitative difference in the shapes of their mean posterior predictive fit curves, CON and AWB could both potentially account for the soil warming response in the meta-analysis data set. For both models, posterior predictive fit distributions overlapped with the confidence intervals of the data points (Fig. 2). However, with respect to the Bayesian goodness-of-fit metrics, CON quantitatively outperformed AWB across all prewarming SOC and MIC densities observed (Figs. 5 and S4) because the Bayesian metrics adjusted for AWB's larger model size and consistently higher effective parameter count (Fig. S8). For both models, lower prewarming SOC densities corresponded to better warming response fits (Fig. 5).

After fitting, the response ratio curves of CON and AWB both trended toward
the prewarming steady-state response ratio of 1.0 following the soil
warming perturbation (Fig. 3). The settling of the curves to the prewarming
model steady states aligns with previous literature which demonstrated that
the magnitude of

This is not to say that CON was clearly superior from a qualitative
standpoint. CON's mean posterior predictive curves were not able to match a
subsequent local data maximum in the meta-analysis data set at

In contrast, AWB displays damped oscillations in its response ratios
following warming due to its nonlinear dynamics (Figs. 2 and 3). AWB
was able to match the points after

We performed a goodness-of-fit sensitivity analysis to check whether the
response ratio trends stayed consistent, biologically realistic, and
interpretable across a range of prewarming, steady-state soil C densities
and pool-to-pool density ratios. For instance, we imposed constraints to
reflect that MIC-to-SOC density ratios range between 0.01 and 0.04 across
various soil types (Anderson
and Domsch, 1989; Sparling, 1992). CON and AWB response ratio curves
exhibited realistic values and qualitatively consistent shapes across all
prewarming SOC and MIC steady-state densities, even at less realistic SOC
densities above 100 mg C g

We were unable to initiate our prewarming SOC steady-state density below
SOC50 with the priors and MIC-to-SOC ratios used for AWB. Under SOC50, AWB
HMC runs would not reliably run to conclusion and would terminate due to ODE
instabilities. Even at SOC50, we saw a reduction in independent and
effective samples for certain parameters, namely Ea

The information criteria and cross-validation fit metrics generally indicated higher relative probability and predictive performance at lower prewarming SOC values for AWB and CON (Fig. 5). The fit results suggest that SOC density of the soil at the sites included in the meta-analysis was likely closer to the lower end of the SOC density ranges examined in our sensitivity analysis. A less pronounced trend toward better fits was observed as prewarming MIC density was decreased while prewarming SOC density was held constant (Fig. S4). No clear relationship was observed between MIC-to-SOC ratio and goodness of fit in the AWB and CON models.

The worsening IC and CV results at higher SOC densities support the notion
that prewarming steady-state SOC densities should not be initialized over
SOC100 in AWB and CON when fitting to this meta-analysis data set.
Prewarming SOC density was not observed to exceed 50 mg SOC g

Overall, the Bayesian metrics from the goodness-of-fit sensitivity analysis suggest that CON is superior to AWB at explaining the meta-analysis data set when accounting for model parsimony, particularly when the models are initiated in more realistic ranges of prewarming SOC densities under SOC100. However, we caution against using these results to conclude that CON is a comprehensively superior predictive model over AWB without comparisons involving other longitudinal soil warming data sets. And other data aside, we observe that AWB has a useful advantage over CON conditional on the meta-analysis data set alone: AWB was more tolerant of changes in prewarming conditions, displaying less IC and CV than CON as prewarming SOC is increased (Fig. 5a–c). AWB's compensatory ability stemming from its larger model size could be more quantitatively rewarding in goodness-of-fit sensitivity analyses conducted on data assimilations with larger data sets.

For an additional check on the biological realism and plausibility of our simulations, we conducted a sensitivity analysis examining changes in model SOC stocks following warming. The response ratios of postwarming SOC stocks after 12.5 years, evaluated as the ratio of postwarming to prewarming SOC densities, was computed from observed CON and AWB simulations at the posterior parameter means. SOC losses indicated by the response ratios ranged from 8.13 % to 27.1 % across both models (Fig. S3). These results aligned with a recent comprehensive meta-analysis of 143 soil warming studies (Fig. S10). The largest loss of 27.1 %, occurring in CON at SOC50, is sizable, but the meta-analysis included seven studies measuring losses greater than 20 %, with the maximum loss observed at 54.4 % (van Gestel et al., 2018).

Raising prewarming SOC reduced SOC loss after 12.5 years of warming for both models (Fig. S3a). For CON, SOC loss decreased from 27.1 % at SOC50 to 9.2 % at SOC200. For AWB, SOC loss decreased from 17.2 % at SOC50 to 8.13 % at SOC200. Varying prewarming MIC affected the SOC response ratio more substantially for AWB than CON (Fig. S3b). For AWB, SOC loss increased from 11.4 % at MIC1 to 16.3 % at MIC8, while SOC loss decreased from 18.8 % at MIC1 to 17.4 % at MIC8 for CON. The larger effect of increasing MIC on the SOC response ratio in AWB is likely due to MIC influence on SOC-to-DOC turnover, which is not a feedback accounted for in the equations of the CON model (Sect. S1a).

The posterior means for the Arrhenius activation energy parameters Ea of
CON and AWB returned by the HMC simulations across the observed prewarming
C densities (Table S3) differed somewhat from the parameter
values used in Allison et al. (2010) and Li et al. (2014), which were in
turn tuned based on activation energies estimated in a prior empirical
analysis of enzyme-catalyzed soil organic matter decomposition processes
(Trasar-Cepeda et al., 2007).
In Allison et al. (2010), CON parameters Ea

Our HMC simulations arrived at higher Ea values, with the posterior means
of Ea

We found it less useful to compare the posterior means of other fitted
parameters including the C pool transfer coefficients, C use efficiency

Truncating prior and posterior parameter distributions proved useful for
establishing biological constraints and only modestly deformed posterior
densities for AWB and CON. From SOC100 to SOC200, CON and AWB posterior
densities showed little or no deformation from typical normal distribution
shapes. Moderate posterior density deformation was observed for some
parameters in both models at SOC50 and SOC75, namely Ea

A small frequency of divergent transitions was detected in the AWB HMC simulations. Divergent transitions can be thought of as algorithm trajectory errors arising during the HMC's exploration of a convoluted region of parameter space; a more thorough description of the theory, computation, and implications of divergent transitions can be found in literature focusing on the Hamiltonian Monte Carlo algorithm (Betancourt, 2016, 2017). The number of divergent transitions generally increased as the prewarming MIC-to-SOC steady-state ratio was reduced (Fig. S9). Prior truncation and the fixing of select parameters to constrain the prewarming steady-state mass values for biological realism could have played a combined role in generating the Markov chain divergences by hindering the smooth exploration of parameter space. We were unable to eliminate divergent transitions by adjusting HMC parameter proposal step size, suggesting that other methods, such as modification of the HMC algorithm itself or introduction of auxiliary parameters to AWB that reduce correlation between existing model parameters, may be more applicable in reducing divergent transitions in our case (Betancourt and Girolami, 2015). Additionally, the interaction between the ranges of values used for the prior distributions and the limited number of observations in the data set could have contributed to the shaping of geometric inefficiencies (Betancourt, 2017).

It is possible that the instability that prevented consistent solving and
HMC exploration of AWB under SOC50 could be traced to the forward
Michaelis–Menten formulation of decomposition and uptake kinetics used in
the present version of the AWB model (Sect. S1, Eqs. S7,
S8). We initialized the system with a small DOC density lower than that of
MIC at 0.1 mg C g

Some suggestions for the reparameterization of AWB to improve model stability have been proposed that could reduce or even eliminate divergent transitions by facilitating a smoother and steadier parameter space conducive for HMC exploration. One intermediate possibility would be to modify AWB to use reverse Michaelis–Menten kinetics, which would replace the DOC term in the denominators of the decomposition and uptake expressions with the larger MIC term. The use of reverse instead of Michaelis–Menten dynamics has been used to stabilize and constrain other SBMs (Sulman et al., 2014; Wieder et al., 2015b). A more extensive reformulation involves the replacement of Michaelis–Menten expressions with equilibrium chemistry approximation (ECA) kinetics, which would increase the number of denominator terms in decomposition expressions for further stability. ECA equations have been shown to be more consistent in behavior and robust to parameter regime variation than their Michaelis–Menten counterparts, and they thus have been encouraged as a wholesale replacement for Michaelis–Menten formulations (Tang, 2015; Wang and Allison, 2019). These reparameterizations should be implemented and examined in future work that involves sampling and computation of AWB posteriors.

Recent SBM comparisons have been unable to demonstrate the superiority of one model over another because the uncertainty boundaries of the data were not sufficient for distinguishing model outcomes (Sulman et al., 2018; Wieder et al., 2014, 2015b, 2018). Similar to these previous studies, our results indicate that more data is needed to constrain and differentiate between model posterior predictive distributions. Conditional on the meta-analysis data set, CON demonstrates superior quantitative goodness of fit over AWB, but we are not confident that the relative model parsimony of CON and other linear first-order models makes them universally more suitable for predictive use.

Consequently, future SBM comparisons would benefit from additional data collection efforts sourced from long-term ecological research experiments to globally verify the strengths and limitations of linear versus nonlinear SBMs, including CON and AWB, in Earth system modeling. The limited number of longitudinal soil warming studies presents a challenge for facilitating site-specific model comparisons. We addressed this issue by using meta-analysis data to aggregate warming responses across sites, but this approach does not provide site-specific parameters. Additional data from ongoing and future field warming studies in the vein of the Harvard Forest and Tropical Responses to Altered Climate experiments that demonstrate more varied flux dynamics over time than the meta-analysis data set will be of critical importance for model testing (Melillo et al., 2017; Wood et al., 2019). Model parameters could also be better constrained through the use of multivariate data sets, for example microbial biomass dynamics in addition to soil respiration.

Our approach can be expanded to compare the predictive accuracies of linear microbial-implicit models to those of recently developed nonlinear microbial-explicit SBMs that are much larger than AWB, such as CORPSE (Sulman et al., 2014) and MIMICS (Wieder et al., 2014). Such comparisons will help broadly determine if inclusion of more detailed microbial dynamics in models offers predictive advantages that can overcome the overfitting burdens associated with an increase in parameter count. With the appropriate data sets, our approach can also be applied to consider the predictive performance of SBMs that describe the cycling of nitrogen (N), phosphorus (P), and other limiting nutrients in addition to C dynamics. Models that represent N and P mineralization have yet to see extensive head-to-head statistical benchmarking against C-only models with respect to predictive use (Manzoni and Porporato, 2009). With models growing ever larger in size and specificity, there is a need to verify whether detailed representation of microbial processes and the cycling of limiting nutrients are worth the increase in variable, parameter, and equation counts. After all, “the tendency of more recent models towards more sophisticated (and generally more mathematically complex) approaches is not always paralleled by improved model performance or ability to interpret observed patterns” (Manzoni and Porporato, 2009).

The data assimilation and posterior sampling of complex models in future work comes with computing performance challenges. Markov chain Monte Carlo algorithms are effective for exploring multidimensional parameter space but are limited by temporal and computational expense, particularly when it comes to fitting nonlinear differential equation models (Calderhead et al., 2009; Nemeth and Fearnhead, 2019). Time per Markov chain iteration drastically increases with number of parameters and data points. In fact, the present speed limitations of the family of HMC algorithms make it necessary to use a hybrid approach utilizing Monte Carlo and deep learning algorithms for parameter estimation at a global scale; Monte Carlo fitting is used to constrain parameter estimates at a site-based scale before those estimates are tuned globally by deep learning using spatial information derived from satellite maps (Tao et al., 2020). However, Monte Carlo algorithms are still the optimal methods for posterior computation (Duan et al., 2018) and are necessary for Bayesian model comparisons conditional on site-based data. Consequently, recent Monte Carlo algorithm innovations and developments that offer theoretical speed improvements by trading thorough posterior sampling for numerical efficiency have been encouraging and are ripe to be tested in future SBM comparisons involving more complex models and larger data sets. These developments include stochastic gradient Monte Carlo sampling methods, a class of techniques in which a posterior is approximated by fitting to a small subset of data at each iteration rather than estimated through exhaustive sampling (Ma et al., 2015), and Gaussian process acceleration, in which a smooth distribution of likely solutions for a differential equation system is specified and sampled in place of explicitly solving for the state variables during every Markov chain iteration (Calderhead et al., 2009; Dondelinger et al., 2013; Wang and Barber, 2014).

Alongside advances in Monte Carlo algorithms, additional developments in
Bayesian cross-validation and information criteria measures are also
available for practical trialing in soil biogeochemical data assimilation.
Gelman et al. (2019) have proposed a stable Bayesian counterpart of frequentist

The development of our formalized, statistically rigorous approach for model comparison and evaluation is a critical step toward the goal of projecting global SOC levels and soil emissions throughout the 21st century. Our initial results indicate promise in continued refinement and expansion of our approach to evaluate the predictive performance of linear and nonlinear SBMs. The future integration of updated Markov chain algorithms and Bayesian predictive accuracy metrics into our framework will expand the ability to efficiently and thoroughly compare differential equation models, even if they vary widely in structure and complexity.

The R scripts, Stan code, and respiration data set used for HMC model
fitting along with the original soil respiration meta-analysis data set
(Romero-Olivares et al., 2017) are
available from the directory located at

The supplement related to this article is available online at:

SDA and HWX designed the study with assistance from MG. HWX and ALRO performed the data cleaning and analysis. HWX wrote the necessary code for the study with assistance from SDA. SDA and HWX prepared the paper with suggestions from MG.

The authors declare that they have no conflict of interest.

We would like to thank Stan development team members Aki Vehtari (Aalto University), Michael Betancourt (Symplectomorphic, LLC), Bob Carpenter (Flatiron Institute), Ben Bales (Columbia University), Charles Margossian (Columbia University), and Sebastian Weber (Novartis) for their patient help with Stan code implementation and troubleshooting. We would also like to thank both anonymous reviewers for their valuable and constructive comments, which not only aided in the revision of the manuscript but also provided valuable insights to guide future work.

This research has been supported by the National Science Foundation (grant no. DEB-1900885), the U.S. Department of Energy Office of Science BER-TES program under grant DESC0014374, and the National Institutes of Health T32 Training Program (National Institute of Biomedical Imaging and Bioengineering, grant no. EB009418).

This paper was edited by Trevor Keenan and reviewed by two anonymous referees.