Evaluation of denitrification from three biogeochemical models using laboratory measurements of N2, N2O and CO2

Biogeochemical models are useful for the prediction of nitrogen (N) cycling processes, but accurate description of the denitrification and decomposition sub-modules is critical. Current models were developed before suitable soil N2 flux data were available; new measurement techniques have enabled the collection of improved N2 data. We use measured data 15 from two laboratory incubations to test the denitrification sub-modules of existing biogeochemical models. Two arable soils – a silt-loam and a sand – were incubated for 34 and 58 days, respectively. Fluxes of N2, N2O and CO2 were quantified using gas chromatography and isotope-ratio mass spectrometry (IRMS). For the loamy soil, seven moisture and three NO3 contents were included, with temperature changing during the incubation. The sandy soil was incubated with and without incorporation of litter (ryegrass), with temperature, water content and NO3 content changing during the incubation. Three 20 common biogeochemical models (Coup, DNDC and DeNi) were tested using the data. No systematic calibration of the model parameters was conducted since our intention was to evaluate the general model structure or ‘default’ model runs. As compared with measured fluxes, the average N2+N2O fluxes of the default runs for loamy soil were approximately 3 times higher for Deni, 105 times smaller for DNDC and 22 times smaller for Coup. For the sandy soils, default runs were 3 times higher for DeNi, 7 times smaller for DNDC and 12 times smaller for Coup. While measured fluxes were overestimated by 25 DeNi and underestimated by DNDC and Coup, the temporal patterns of the measured and the modeled emissions were similar for the different treatments. None of the models was able to determine litter-induced decomposition correctly. The reason for the differences between the measured and modeled values can be traced back to model structure uncertainty and/or parameter uncertainty. Given the aim of our work to assess existing model processes for further development and/or to identify missing processes within the models these results provide valuable insights into avenues for future research. We 30 conclude that the predicting power of the models could be improved through future experiments that collect data on denitrification activity with a concurrent focus on control parameter determination. 10 Swedish University of Agricultural Sciences, Department of Forest Ecology and Management, 901 83 Umeå, Sweden https://doi.org/10.5194/bg-2021-77 Preprint. Discussion started: 13 April 2021 c © Author(s) 2021. CC BY 4.0 License.

surface soil (0 to 10 cm depth) was collected in large, plastic boxes. Soil was returned to the lab, where it was sieved to 10 mm, homogenized, subsamples analyzed for physical and chemical properties (Table 1), and remaining soil stored at 4 o C until use. 100 whether previously observed NO3 --N losses during winter could be explained by denitrification (Ziemer, 2006). Gas samples were collected manually once a day and analyzed by gas chromatography (GC) (Well et al., 2009) to determine N2O and CO2 fluxes, and by isotope ratio mass spectrometry (IRMS) to determine the flux of N2+N2O originating from the 15 Nlabeled NO3 - (Well et al., 1998;Lewicka-Szczebak et al., 2013). Soil samples were collected at the beginning and end of the 125 incubations and analyzed for NO3 -, NH4 + and water content as described in Buchen et al. (2016).

Sand laboratory incubation 135
To exclude the phase of intensive respiration and mineralization typically following rewetting, soils were pre-incubated at 50% of maximum water holding capacity for 3 weeks (at room temperature). After pre-incubation, 15 N-labelled KNO3 solution (50 mg N kg -1 dry soil) was added to the soil, and thoroughly mixed ( Table 2 and Table 3). After addition of NO3, the soil was divided, and in half of it, ground ryegrass (sieved with 1 mm mesh; added at a rate of 2.2 g kg -1 dry soil) was also homogenously incorporated. The ryegrass had a C-to-N ratio of 25, and nitrogen, carbon and sulphur content of: 1.3%, 140 32.2% and 0.4%, respectively. Four replicates of soil from each of the two treatments (with and without ryegrass) were then packed into plexiglass cylinders (14.4 cm inner diameter) at typical field bulk density (1.5 g cm -3 ) and a soil depth of 10 cm ( Table 2). The soil cylinders were incubated for 58 days, during which the headspace was continuously flushed with an artificial gas mixture (2% N2 and 20% O2 in He) at a flow rate of 20 ml min -1 . The low N2 concentration was established to increase the sensitivity of N2 flux detection (Lewicka-Szczebak et al., 2017). 145 The cylinders were incubated using an automated incubation system, including gas analysis by GC, suction plates at the bottom of the cylinders to control water potential and collect leachate, and an irrigation device to mimic precipitation and/or fertilization. Using the GC, N2O, CH4, N2 and CO2 (Säurich et al., 2019) were continuously measured throughout the incubation. Gas samples were also collected manually for IRMS analysis, to determine fluxes of N2 and N2O originating from the 15 N labeled NO3 - (Well et al., 1998;Lewicka-Szczebak et al., 2013). The pressure head at the suction plates was 150 controlled by connecting the bottles for seepage collection to a gas reservoir, which was maintained at the target pressure (±0.5 kPa). Water potential in the soil column resulted from the difference in the pressure head between the soil cylinder headspace and suction plate. Headspace pressure was positive due to the continuous headspace flow and flow restriction in the exhaust line of the gas sampling system. Instability in the headspace pressure (values between 1 and 3 kPa) occurred near the end of the experiment, due to partial clogging of the hypodermic needles that were used to lead the exhaust gas through 155 sampling vials (Well et al., 2006). Therefore, pressure head in the soil columns was associated with an uncertainty of about 2.5 kPa.
The water content of the soil was initially set to 0.231 g g -1 (equivalent to 80 % WFPS) and was subsequently changed by establishing defined water potential at the suction plates (Table 4) and by adding water and/or KNO3 solution from the top of the columns as irrigation/fertilization events. Phases with defined temperature were set as shown (Fig. S.3 and Table 4). Soil 160 samples were collected at the beginning and end of the incubations and analyzed for NO3 -, NO2 -, NH4 + , DOC, pH and water content as described in Buchen et al. (2016).

Model choice and description
Using the denitrification data collected in the incubations described above, we tested the denitrification sub-modules of three biogeochemical models: Coup (Jansson and Moon, 2001), DNDC (Li et al., 1992) and DeNi (based on the approach of the NGAS and DailyDayCent; Parton et al., 1996 andDel Grosso et al., 2000). Our first criterion of denitrification sub-modules evaluation was the agreement of measured and modeled results with respect to directional changes of N2 and N2O (i.e. fluxes 170 increasing or decreasing) in response to the relevant control factors. Comparing the magnitude of measured and modeled fluxes was not considered as a criterion. We note that individual control factors were tested only to a limited extent, and were otherwise affected by interactions with other control factors (see description of experimental design in 2.12.1).
Nevertheless, comparing the temporal dynamics of fluxes measured in the experiments and given by the models, reveals the suitability and deficiencies of both, which can pave the way for planning better model evaluation experiments and for 175 developing model routines to fill the gaps of the current approaches.

Coup
Coup (coupled heat and mass transfer model for soil-plant-atmosphere systems) is a complex, adjustable process-oriented model that uses a modified approach of PnET-N-DNDC to simulate nitrification and denitrification (Norman et al., 2008). It is a developed version of the SOIL and SOILN models (Jansson and Moon, 2001). The main model structure is a vertical 180 layered, 1-D soil profile. Coup includes all main heat and water flow processes in the soil profile as well as exchange with the atmosphere.
A first order kinetics approach for two pools (litter and humus) governed by response functions of soil moisture and temperature is used to simulate soil organic carbon dynamics. In Coup, soil litter represents the rapidly decomposable organic material (e.g. fresh plant litter) and the humus pool represents the more resistant fraction. Fluxes of NO,N2O and N2 185 are modeled via nitrification and denitrification, which in turn are obtained from modeled parameters including respiration, mineral N, and dissolved organic C (DOC). The soil anaerobic fraction is defined by the approach of the anaerobic balloon concept of DNDC (Norman et al., 2008). https://doi.org/10.5194/bg-2021-77 Preprint. Discussion started: 13 April 2021 c Author(s) 2021. CC BY 4.0 License.
The simulation of nitrification is calculated by the response functions of soil temperature and moisture, pH and NH3 concentration (Norman et al., 2008) Denitrification processes are simulated by soil temperature, pH and the N concentration 190 of the microbial pool and the anaerobic fraction of the soil (Jansson and Karlberg, 2011) (see Table S.6). The model can simulate C, N and water fluxes in hourly resolution. The complex modular structure gives flexibility to users for planning a step-by-step increase in the complexity of simulations. This option is ideal for the simulation of laboratory experiments.
Users can freely define the thickness and the number of soil layers and the setup of the initial conditions of each layer. The model can also simulate changes of the parameters between soil layers. 195

DeNi
DeNi was programmed based on the nitrification and denitrification approach of the NGAS model (an early stage of the DayCent model) (Parton et al., 1996) (see Table S.6). The approach of the DailyDayCent (and therefore DeNi) model for the description of denitrification is a hybrid between detailed process-oriented models and simpler nutrient cycling models (Parton et al., 1996). It allows users to separately test the nitrification and denitrification sub-modules. The model runs on 200 daily time steps. The main difference between DailyDayCent and Coup is that Coup, like other more complex processoriented models, explicitly models denitrifier dynamics. In contrast, the DailyDaycent/NGAS model is a relatively simple, semi-empirical model to simulate the N2+N2O production without directly considering microbes. It uses empirical parameters and functions that have no direct physical, chemical or biological explanation and were developed from experimental observations. Therefore, it is the combination of a simplistic nutrient cycle model and a more detailed process-205 based model (Parton et al., 1996).
The N2O flux from nitrification is modeled using: soil pH, soil temperature, soil moisture, soil NH4 + concentration (available NH4 + is then computed as a function of NH4 + concentration), and the N turnover coefficient, which is a soil-specific parameter.
The denitrification sub-module calculates the fluxes of N2 and N2O. The soil heterotrophic respiration rate (depending on the 210 available carbon), soil NO3concentration and soil moisture (WFPS) control total denitrification. The N2/N2O ratio is calculated as a function (F(NO3/CO2)) of electron donor (NO3 -) to substrate and soil water content (Del Grosso et al., 2000).

DNDC
The Denitrification-Decomposition model (DNDC) is a complex, widely used process-based model of C and N biogeochemistry in agricultural ecosystems (e.g. Li et al. 1994). It has been extensively tested globally and has shown 215 reasonable agreements between measured and modeled N2O emissions for many different ecosystems (e.g. Li, 2007;Kurbatova et al., 2009;Giltrap et al., 2010;Khalil et al., 2016;. Several modifications/versions have been developed to fit different ecosystems and those provide variable estimates depending on the model versions used. DNDC contains six sub-modules: soil climate, crop growth, decomposition, denitrification (see Table S.6), nitrification and fermentation. It additionally includes subroutines for cropping practices (fertilization, irrigation, tillage, crop rotation and manure addition). The model joins denitrification and decomposition processes together to predict emissions of C and N from agricultural soils, based on various soil, climate and environmental factors. It considers the soil as a series of discrete horizontal layers with uniform soil properties within each layer, except for some soil physical properties that are anticipated as being constant across all layers. However, time-dependent variations in soil moisture, temperature, pH, C and N pools are considered for a reliable estimate of C and N fluxes by calculating them for each soil layer for each time step. Like in Coup, 225 denitrifiers are explicitly modeled.

Model initialization
Selected experimental data for model evaluation included denitrification (N2 and N2O fluxes produced from soil NO3 -) and "proximal" and "distal" controls (according to the definition by Groffman et al., 1988). Proximal controls were temperature, 230 NO3 -, pH and organic C. Distal controls were soil moisture, texture, NH4 + -N, bulk density and respiration (as a proxy for O2 consumption).
Our first criterion of model evaluation was the agreement of measured and modeled results with respect to directional changes of N2 and N2O (i.e. fluxes increasing or decreasing) in response to the relevant control factors. Comparing the 235 magnitude of measured and modeled fluxes was considered a secondary criterion. In the two experimental setups, individual control factors were only tested to a limited extent, while the remaining measurements reflected the interaction of multiple control factors (see 2.1.3 and 2.1.4). Those interactions presented additional complexity, which would not classically be used for model evaluation, yet provided valuable data on the temporal dynamics of measured vs modeled fluxes (see discussion for details). 240 Models were set up according to the initial experimental setups of the two incubations (i.e. 7 initial model set-ups for siltloam and 2 set-ups for sand; Table 2). For the silt-loam soil, only soil temperature was changed during the experiment, while for the sand soil temperature, soil water status (change of the water potential and irrigation) and NO3content (by irrigation with KNO3 solution) were changed. We first compare to which extent the models fit the magnitude of fluxes in general, and 245 subsequently, whether the models reflect the observed differences between the experimental treatments.

Coup
Coup gives users the option to choose between different algorithms, each representing the functionality of a sub-module, with each sub-module addressing a different aspect of the soil-atmosphere-vegetation system (Senapati et al., 2016;He et al., 250 2016;Norman et al., 2008;Nylinder et al., 2011;Conrad and Fohrer, 2009). This feature was used to adapt the model structure to the experimental setup and the available data ( In the model, soil columns of sand were divided into 5 layers (we are assuming equilibrium, and it was calculated based on the water retention curve and layer depth) with layer extents of 2 cm. The water retention curve was not available for the siltloam soil. The soil columns were thus modeled as a 25 cm unified, single soil layer. Daily water content and soil temperature 255 were set up in the model as dynamic input parameters coming from water balances and measurements, respectively. The initial contents of organic carbon, total N, NO3 --N and NH4 + -N of the silt-loam and sand were set up in the model (Table   S.8). The initial amount of SOC allocated into the labile pool was based on default SOC allocation fractions. For sand treatments with application of ryegrass, the C and N of ryegrass were exclusively added to the labile pool. Since the basic settings resulted in overestimation of CO2 production, first order decomposition rate coefficients for litter and humus were 260 changed to modify decomposition and mineralization to fit measured respiration rates.
From the two available algorithms to describe denitrification, the algorithm with explicit consideration of denitrifiers was chosen (Table S.7), because we wanted to test a model which includes the microbial approach for the denitrification submodel. The structure and the complexity of Coup made it necessary to modify some model parameters and settings to improve the fit between modeled and measured N2O and N2 fluxes. The applied settings and parameters are in Tables S.6, 265 S.7 and S.8. Parameters were adjusted separately for each experiment (silt-loam and sand) but were identical between treatments.

DeNi
Parameter adjustment and data input were accomplished using the DeNi source code. Measured soil texture, bulk density, 270 initial NO3 -, NH4 + and SOC were used to initialize the model. We ran the model calculated with one soil layer for the siltloam soil and with five, 2 cm thick soil layers for the sand soil, with differing water contents. We used the measured daily temperature and the theoretical (calculated) water content of each of the 5 layers (see 2.3.1). Irrigation, seepage and fertilization events were included, and the model was modified with calculated changes in NO3 --N and water content, which were calculated based on the irrigation, seepage and fertilization events. The theoretical NH4 + and NO3concentrations (Table  275 S.2) were changed (modeled production and consumption) by mineralization, nitrification, denitrification, leaching and the added fertilizer (Table S.1) during the simulations. For the calculation of missing soil physical parameters (e.g the soil gas diffusion coefficients) the respective pedotransfer functions were applied (Saxton and Rawls, 2000). Besides N2O and N2, the model also calculated the soil fluxes of CO2.

DNDC
The latest version of DNDC (DNDC95) was used to simulate N2O, N2 and soil CO2 emissions. The model was originally designed for field and regional scales. Therefore, certain adjustments had to be made to establish suitable model inputs to https://doi.org/10.5194/bg-2021-77 Preprint. Discussion started: 13 April 2021 c Author(s) 2021. CC BY 4.0 License.
represent the conditions of the laboratory incubations. Based on the experimental setup for the sand soil, the irrigation with KNO3 solution had to be simulated as rainfall containing NO3and the atmospheric background of NH3 and CO2 was 285 considered zero and negligible, respectively, since the incubation was in an artificial atmosphere. Minimum and maximum temperatures were set according to the actual experimental values. Measured soil inputs were included with microbial activity index (factor to modify the denitrification process) of 1. The mixing of the experimental soil prior to incubation was applied as litter-burying till with no crop and coupled with water and NO3fertilizer addition. Nitrate fertilizer was added twice with ryegrass residue as straw either mixed or omitted. Water was added once in the beginning and twice in the middle 290 of the experiment as per treatments in the form of irrigation following comparative tests with rainfall as well as rainfall and irrigation options.
To run the model using inputs from the silt-loam incubation, the microbial activity index, temperature setting and mixing of soil with water as irrigation and fertilizer were simulated as in the sand incubation but irrigation and fertilization were assumed to occur only once in the beginning and rainfall was considered zero. 295

Statistics
Statistical calculations were done using the Python 3 (Van Rossum and Drake, 2009) and the R (R Core Team, 2013) programming languages and GNUPlot (Williams and Kelley, 2011) interactive plotting program. A multiple comparison of means (Tukey HSD, p<0.05) was performed on the N2+N2O and CO2 data of the silt-loam soil. The N2+N2O data of the sand 300 soil was not normally distributed. Therefore, the Wilcoxon signed-rank test was used for these data to test the effect of the ryegrass application.

Silt-loam soil 305
Results of the seven treatments are shown in Table 5. CO2 fluxes were positively correlated with temperature ( Fig. S.3).
Cumulative CO2 fluxes were generally highest in the treatments with low WFPS and lowest in the treatments with high WFPS and bulk density (

Sand soil
Fluxes are shown for individual replicates of both treatments ( Fig. 2 and Fig. 3 WFPS was equivalent to a water potential of -3 kPa according to the water retention curve of this soil ( Fig. S.2). Leaching dynamics were also highly variable between replicates (Table S.1). 330 Table 6: Measured cumulative fluxes (N2, N2O, N2+N2O: g N m -2 day -1 ; CO2: g C m -2 day -1 ) and product ratios over a 58 days laboratory incubation of sandy soil from Fuhrberg, Germany. Shown are averages and standard deviation of 4 replicate cores with (C1-4) and without (C5-8) added ryegrass. Comparing the cumulative CO2 fluxes of the two treatments, ryegrass-amended columns were (2-4 times) higher than those without ryegrass ( Table 6). The CO2 fluxes reached a maximum after 8-13 days and then slightly decreased until the Day 32 The cumulative N2+N2O fluxes were almost 8 times higher in ryegrass compared to the control treatment. N2+N2O fluxes were initially high in both treatments (Figs. 2a and 3a) but decreased rapidly following the initial drainage period (Table   S.2). During the remainder of the experiment, fluxes remained low and were only to a minor extent affected by the 345 experimental manipulations. Initially, the ryegrass treated cores had high N2+N2O fluxes which rapidly decreased during the incubation. Between the first and the second (09/02 and 14/02) water content manipulation events, cores 2 and 3 responded with smaller N2 and N2O (core 3 only) peaks.
The N2O/(N2+N2O) ratio of fluxes (Table 6) shows that N2O dominated the N fluxes. The N2O/(N2+N2O) ratio was similar for both treatments. During the irrigation-fertilization period at day 31, the N2 production increased in both treatments (

Modeled results of silt-loam soil 370
DeNi and Coup overestimated CO2 production, with predicted CO2 fluxes 3 to 10 times higher than the measured values, whereas DNDC mostly underestimated the measured fluxes (Table 8). The variability of the model calculations is quite low, and the fluctuation of the values does not always follow the changes of the measured values. The time series of the CO2 flux calculation of DeNi followed the fluctuation of the temperature settings whereas the other models mostly predicted only decreasing trends over time as shown for treatment VI20N_80%_1.4 (Fig. 4). 375 On average, DeNi calculated ~4 times higher N2+N2O fluxes than measured. In contrast to this, N2+N2O fluxes obtained from Coup were about 9 times lower than the measured values, despite the fact that the N2O estimation of Coup was quite close to the measured values. In DNDC, it is notable that N2 fluxes were always zero and it therefore underestimated  After the initial 10 days and until day 29, the pattern of Coup is more or less similar to the measured values, but the 390 magnitude of the modeled values is approximately 2-3 times smaller (Fig. 4). There were a few similarities between measured and modeled fluxes when comparing the cumulative N2+N2O fluxes (Fig. 5) and the ratio of the N2+N2O fluxes (Table 7) (Table 7).
Normalized treatment effects of model results and measurements are shown in Table 7. The ratio between relative treatment differences of measured and modeled values is 1, if the measured and the modeled values changed with the same magnitude in the same direction. If the ratio is bigger than 1, the direction of measured and modeled values is the same, but the magnitude of the response is bigger in the model than was seen in the measured values. If the value is between 0 and 1, the 410 direction is the same, but the magnitude of the response is smaller in the model than was seen in the measured values. If the ratio is negative, the direction of the response is opposite in the model as compared to the measurements. For ratios of 0, there was no model response to differences between treatments.
For Coup, ratios showed that modeled treatment differences were either absent (10 of 21), lower than (4 of 21) or opposite (7 of 21) to measured differences. For DeNi, the model always responded to treatments (i.e. no 0 ratios), with most (14 of 21) 415 cases showing a model response in the same direction as measured values, and two cases where the model had significantly higher ratios than the measured values. For DNDC, with two exceptions, ratios indicated either lower (11 of 21) or opposite (5 of 21) response of the model as compared to measured values, with 3 instances where the model did not respond (i.e. ratio of 0). 420 Table 8: Average measured (average of the 5 measurement events for 34 days) and modeled (Coup, DeNi and DNDC models) N2, N2O (g N m -2 day -1 ) and CO2 (g C m -2 day -1 ) fluxes of 7 incubation treatments for a silt-loam, arable soil from Hattorf, Germany. Treatments include different levels of NO3addition (10, 20 and 40 mg N kg -1 ), WFPS (73-90%) and soil bulk density (1.4-1.52 g cm -3 ).

Modeled results of sand soil 430
For the ryegrass-treated sand, the Coup-estimated CO2 fluxes fitted the measured emission pattern quite well (Fig. 6e).
Except for an initial peak, the pattern and the magnitude of measured and modeled fluxes were almost identical. Coup overestimated the soil respiration for the control treatment (Fig. 7e), but the temporal pattern of the modelingespecially for the temperature manipulationfitted the measured values.
DeNi overestimated the CO2 fluxes for both treatments and did not respond to the labile organic C of the ryegrass treatment, 435 since modeled CO2 fluxes of both treatments were almost identical ( Fig. 6d and Fig. 7d). While the pattern of the modeled fluxes followed the changes of temperature and soil water content, the magnitude of the response to these changes was too large.
DNDC calculated the smallest CO2 fluxes among the three models. The modeled estimates did not reflect a litter effect and underestimated the measured values for the ryegrass-treated soil (Fig. 6f). The model provided much better estimation for 440 https://doi.org/10.5194/bg-2021-77 Preprint. Discussion started: 13 April 2021 c Author(s) 2021. CC BY 4.0 License. the magnitude of CO2 fluxes of the control treatment (Fig. 7f). While there was not an ideal agreement in the temporal pattern, some of the changes of the environmental conditions are clearly reflected.
Similar to the silt-loam experiment (Fig. 4b), the pattern of the estimated N2+N2O fluxes by Coup was opposite to the trend of the measured fluxes, exhibiting a constant initial increase in both treatments (Fig. 6b, 7b). The subsequent rapid decrease 445 of CO2 and N2+N2O fluxes resulted from the temperature manipulation. The modeled patterns of DeNi and DNDC (Figs. 7a and c) are closer to the measured fluxes and both clearly reflect the wetting phase, which caused an increase in measured N2+N2O fluxes of the treatment without litter but only elevated N2 fluxes in the ryegrass treatment.
Comparing the order of magnitude of cumulative modeled and measured N2+N2O fluxes (Table 9), DeNi showed agreement in the ryegrass treatment, but overestimated fluxes of the control treatment by one order. Conversely, DNDC and Coup 450 showed close agreement in the treatment without ryegrass but underestimated fluxes with ryegrass by one to two orders.
The N2O/(N2+N2O) ratio of cumulative fluxes modeled by DeNi and Coup was between 0.3 and 0.45 in both treatments (Table 9) and thus much lower than the measured ratios (>0.9, Fig. S.7, Table 9). The modeled N2O/(N2+N2O) ratio of DNDC was close to 1 because the N2 flux estimation of DNDC was almost zero, i.e. five orders of magnitude lower than measured fluxes. 455 The response of modeled N2+N2O fluxes to increasing soil moisture following irrigation differed among models, with DeNi and DNDC predicting immediate responses (Fig. 6a and c). The response for the soil moisture manipulation of Coup was not observed during the initial growth of denitrification (Fig. 6b).  and kg C ha -1 ) and product ratios (dimension less) for sand, arable soil from Fuhrberg, Germany. C1-4 means the first 4 465 parallel columns for the ryegrass treatment. The C5-8 means the 4 parallel columns of the control/non ryegrass treatment.

Silt-loam soil
The highest cumulative CO2 fluxes were measured at low WFPS/bulk density and the lowest fluxes at high WFPS/bulk 475 density (Table 5). Respiration thus reflected the typical responses to temperature and aeration (Davidson et al., 2000). Figure   1a shows that the total denitrification was controlled by several interacting factors, where decreasing nitrification can be explained by the combination of substrate exhaustion and temperature (Müller & Clough, 2014). The increasing denitrification in the wettest treatment (VII20N_90%_1.4) could be due to ongoing O2 depletion resulting from respiration at low diffusivity during the early phase of the incubation (Well et al., 2019). The low N2O/(N2+N2O) product ratio (between 0.088 and 0.264, Table 5) indicated that N2O was effectively reduced to N2, so that total fluxes were dominated by N2. Since high NO3contents and low pH are known to inhibit N2O reduction (Müller & Clough, 2014), the low N2O/(N2+N2O) ratios might explained by near-neutral pH values or low NO3contents, below the reported threshold for N2O reduction inhibition (45 mg N kg -1 ; Senbayram et al., 2019). The relevance of NO3content for controlling the product ratio is supported by the fact that the lowest N2O/(N2+N2O) ratio was observed in the treatment with 485 lowest NO3 --N concentration (II10N_80%_1.4), whereas the highest values were obtained at the highest NO3content (III40N_80%_1.4). However, it is notable that the highest NO3in this study (40 mg N kg -1 ) was still below the 45 mg N kg -1 threshold.

Sand soil 490
The dramatic differences between measured fluxes of control and ryegrass soils (2-4 orders of magnitude for CO2 and almost 8 for N2+N2O; Table 6) can be explained by the effects of labile carbon from ryegrass on microbial respiration and enhancement of denitrification due to increased O2 consumption and supply of reductants for denitrifiers (e.g. Senbayram et al., 2018). In contrast, the control soils not only had no ryegrass amendment but were also pre-incubated (decreasing the what labile carbon was present) to avoid an initial peak in CO2 fluxes after the re-wetting of the dry soil (see methods). The 495 CO2 fluxes of the ryegrass treated cores (cores 1-4) between days 4 and 12 show a rapid increase (Figs. 2d, 3d). The large response of respiration to the ryegrass treatment almost hides the smaller effects resulting from the changing water and NO3content, while these effects were clearly visible in the control treatment. However, small effects with a similar pattern to that seen in the control soils were also evident in the ryegrass treatments (Figs. 2d, 3d, S.4 day 25-35 increasing trend all cores expect core 2). Other notable responses in Figs. 2d, 3d are the higher peaks of CO2 on day 7 and a big decrease in the CO2 500 flux values for both treatments on day 38. On day 7, the water content of the soil cores was decreased (Fig. S.4) and it resulted the higher CO2 emission. On day 38, a simultaneous increase in water content and decrease in temperature (Fig. S.3 and S.4), which presumably caused lower CO2 flux.
The time course in N2+N2O fluxes (Figs. 2a and 3a) can be explained by the combination of easily available carbon, the effect of soil water content and changes in the soil NO3content. The control treatmentwithout organic matter amendment 505 was almost one magnitude smaller than the ryegrass treated soil cores, but the initial high water and nitrate content (80% WFPS, 66 mg N kg -1 dry soil, Table 2 and 3) resulted in higher N2+N2O fluxes in the first 4 days of the experiment for both treatments. The water potential at the bottom of the cores was changed at day 4 and the water and NO3content decreased in the soil cores (Table S.1 and S.2). The increase of the water (Fig. S.4) and NO3content (Table S.1 at 08.03.2017) between the days 24 and 27 led to increasing N2+N2O fluxes in both treatments. The NO3content and the seepage of leachate show 510 some variability between replicates (Table S.1 and S.2) which we attribute to the fact that initial water content (80% WFPS) was located in the steep sloping section of the water retention curve (Fig. S.2), where small changes in water potential would be related to large change in water content. The variable leaching is thus probably due to the limited precision of water https://doi.org/10.5194/bg-2021-77 Preprint. Discussion started: 13 April 2021 c Author(s) 2021. CC BY 4.0 License. potential control (Table S.1). At 80% WFPS, our estimated uncertainty in pressure head control of 20 mbar would lead to an uncertainty in soil water contents equivalent to 0.023 g g -1 or 8.1% WFPS. Presumably, the possible uncertainty of the 515 manual compaction of the soil columns may also have resulted in minor variability in water retention properties among the soil columns.
Seepage of the cores not only lowered water contents but also caused loss of NO3 - (Table S.1 and S.2). The high and variability in water and NO3content might explain some of the measured variability in gaseous N fluxes (initially high fluxes in both treatments, but decreasing quickly (Figs. 2a and 3a)). While the organic matter amendment clearly enhanced 520 denitrification in the initial phase with high water content, this was not the case during the later phases when fluxes of both treatments were similarly low, likely since anoxic micro-sites disappeared due to improved aeration.
The product ratio of fluxes (Table 6) shows that mostly N2O was emitted, which we attribute to the high NO3 --N level and the low pH (Table 1) (Müller & Clough, 2014). The product ratio was similar with and without litter amendment. This might indicate that the combined inhibitory effect on N2O reduction by low pH and high NO3was more effective than the potential 525 enhancement in N2O reduction in presence of labile C in the ryegrass treatment (Müller & Clough, 2014).

Possible explanations for the deviations between measurement and modeling
The goal of this work was to test and evaluate the denitrification sub-modules of the models and not to harmonize the measured and modeled values by calibration or to rate the performance of the different models. Clearly, after calibration, the 530 models can simulate results of the same magnitude as the measured values. Our aim, however, was to find the missing processes and limitations of the sub-modules for further model development.
Overall, there were large differences between the measured and modeled results. Modeled N2+N2O fluxes were between 10 and 580% of measured fluxes in the silt-loam incubation and between 1 and 9060% in the sand incubation (Table 8 and 9). DNDC, originally developed to accommodate field conditions, calculated almost zero N2 emissions for both treatments. The 535 structure of the model with a simple soil water management sub-module, rather than the option to manually set up the daily soil water content, may not be a good fit for laboratory experiments. The model provided a higher amount of leachate in the first days of the simulations. This could be the reason for the lower N2O and the almost zero N2 production.
In theory, there should be a certain lag time between rainfall or irrigation and the occurrence of denitrification in the soil (Tiedje 1978;Smith and Tiedje 1979). DNDC ignores this lag time ( Fig. 6c and 7c, day 25), as shown by the modeled N2 540 and N2O fluxes, which occurred almost immediately after the rewetting of the soil. In contrast, because Coup assumes growth of denitrifiers as a prerequisite of denitrification, there are no abrupt changes in the modeled denitrification, as any possible response was masked by the ongoing growth of the denitrifier community (refer to 4.2.3).
There was also disagreement in the N2O/(N2+N2O) ratio and the temporal dynamics of the modeled fluxes, which did not always fit well with the measurements (Fig. 4, 6, 7 (a, b, c), S.6 and S.7). Models were used with the default settings of 545 coefficients because it was not the objective of this study to calibrate them using the measured data. It was therefore not https://doi.org/10.5194/bg-2021-77 Preprint. Discussion started: 13 April 2021 c Author(s) 2021. CC BY 4.0 License. expected that the modeled data would fit the magnitude of measured fluxes. However, the poor fit in temporal dynamics and the N2O/(N2+N2O) ratio shows that some of the model routines were not adequate to obtain correct responses to the denitrification control parameters established in the experiments. It is notable, though, that some model parameters were not assessed in the experiments (e.g. labile C content, denitrifier biomass, anaerobicity of the soil) and also that the temporal and 550 spatial resolution in the measurement of control factors such as mineral N and soil moisture was limited; including these may have improved model estimates

Complexity of models
The agreement of measured and modeled results depends not only on the experimental set-up, and to which extent model 555 parameters are represented by measurements, but also on the model complexity. DNDC and Coup are complex, with more parameters and more elaborate descriptions of denitrification and decomposition than DeNi. However, using this detailed approach may allow some factors to dominate the denitrification calculations and give biased results (Metzger et al., 2016).
Laboratory mesocosm experiments simplify 'real' field conditions, and the simplicity of DeNi could be one reason why it had reasonably good success modeling the incubation experiment. The pure nitrification and denitrification approach of 560 DeNi minimizes the influence of sub-modules that represent more complex processes, which are present in Coup and DNDC. For example, rather than using a water management sub-model, we were able to input measured daily water and soil NO3content into DeNi, which may have contributed to the better predictions. In contrast, Coup and DNDC use sub-models to predict changes in soil water and NO3content. Coup has an option to overwrite the calculated daily water content data, which we used, but this option was not available for DNDC. In fact, Coup provides numerous options to turn on or off 565 different sub-modules and use constant values instead of dynamically changed parameters (Table S.6 and S.7). Simplifying by turning off sub-modules decreases the complexity of the model and, in 'simplified' experiments, such as ours, may actually improve the final results.

Labile organic carbon (litter) 570
While treatments without litter amendment were relatively low in labile C content, the ryegrass treatment was established to mimic incorporation of crop residues and thus contained large amounts of labile organic C. Coup and DNDC provide options to modify the labile C and N pools, and in running these models, the C and N content of the ryegrass was added to the respective labile pools. However, the results of Coup and DNDC didn't reflect the extremely fast decomposition that was observed in the experimental results (Fig. 6f). Although the measured results showed that soils with ryegrass amendment had 575 345% higher CO2 than control soils (Table 9.), Coup calculated similar CO2 fluxes for both treatments (Fig. 6e and 7e) as well as only a 10% difference in the modeled N2 and the N2O fluxes between the treatments (Fig. 6b and 7b). DNDC actually calculated 40% higher CO2 fluxes for the control treatment as compared to the ryegrass-amended soils (Fig. 6f and https://doi.org/10.5194/bg-2021-77 Preprint. Discussion started: 13 April 2021 c Author(s) 2021. CC BY 4.0 License. 7f). The decomposition rate of the ryegrass in the models needed to be much higher than the decomposition rate that is currently provided for the labile pools. Because DeNi has a simple, one C pool approach for calculating soil respiration, it 580 was also unable to handle the extra ryegrass as rapidly decomposable carbon. Similar to Coup, DeNi calculated similar soil respiration and N2+N2O fluxes for both treatments of the sand soil (Fig. 6a,d and 7a,d).
In these models, decomposition processes are assumed to be driven by soil water content and temperature (Table S.6), thus the microbial response to treatments (e.g. NO3addition, pH), although they are known to influence microbial carbon use (Manzoni et al., 2012), cannot be explicitly simulated. It should also be noted that decomposition of the labile and 585 recalcitrant pools in these models are calculated independently. However, field and empirical data (Kuzyakov, 2012) suggest adding labile C could also enhance the decomposition of resistant pool, e.g. priming effects, which none of these models account for. Our results also suggest the importance of simulating microbial dynamics of decomposition explicitly to better account for the drivers of decomposition, because these ultimately influence the denitrification flux estimations. It means that the direct application of these models with first order kinetics for decomposition to simulate the effects of fertilization or 590 changing N deposition on denitrification fluxes could be largely biased. Future research should aim to quantify more appropriate decomposition rates for models to better take into account labile pools.

Denitrifiers
In Coup, the biomass of denitrifiers directly limits the maximum denitrification rate. We assume that the slow increase of 595 fluxes obtained from Coup (Fig. 6b, 7b) was due to the modeled growth of denitrifiers, since the default setting assumed a low abundance of denitrifiers, hence the denitrifiers had to first grow before reaching maximum denitrification rates (denitrifier growth was observed in the model output although this data was not shown). It can be concluded that when modeling denitrification during incubation experiments, the model initialization must include inducement of denitrifier growth. 600 Another reason for the slow increase of the denitrifier biomass at sandy soil modelling could be that the modeled anaerobic soil volume fraction (ansvf) is orders of magnitude smaller than the measured ansvf (Rohe et al., 2020) and the small ansvf was not ideal for the growth of denitrifiers. This may have led to a non-realistic, too small denitrifier community, and therefore low N2O and N2 fluxes.

Anaerobic soil volume fraction
DNDC and Coup use a similar calculation of the ansvf and both models use it for the calculation of the denitrification processes. While the ansvf estimations of DNDC were not available as an output, the Coup results were obtained showing that ansvf was almost constant. This is not plausible since the parameters affecting ansvf, i.e. diffusivity and O2 consumption, must have been highly variable since soil moisture and respiration exhibited large differences between treatments and experimental phases. The underestimation of N2+N2O fluxes by Coup could therefore result from the inappropriate calculation of ansvf in the model (see in section 4.2.3).
One of the main goals of this study was to test the ability of the existing biogeochemical models to predict the temporal dynamics of N2 and N2O fluxes and identify where the models could be improved. Ensuring correct ansvf calculations could significantly improve the efficiency of denitrification sub-modules, and thus further work on these algorithms within Coup is 615 one area for future research that we would strongly recommend. Similarly, it would be beneficial to test the ansvf calculations of DNDC, which was not possible in our study, as the source code was not available and the ansvf is not included in output data.

Determination of control factors in the experiments 620
Within the sand incubation, another reason for the underestimations of denitrification products by Coup and DNDC could be properties of the soil itself. The soil had a low pH, which has a direct influence on denitrification processes (Leffelaar and Wessel, 1988). However, while the denitrification sub-module of DeNi is sensitive to changes in soil temperature, moisture, NO3and SOC content, the pH of the soil only influences nitrification processes. Therefore, the low pH may have had less effect on the N2O flux estimation of DeNi, as compared to Coup and DNDC. Another reason for the smaller denitrification 625 fluxes of Coup and DNDC could be the soil texture. Texture influences the hydrology, the anaerobic soil volume fraction and the diffusion of the gasses, which altogether control denitrification processes. According to the water retention curve, the range of water contents in the incubation were located in a section of the curve where small changes in water potential could lead to large changes in WFPS (Fig. S.4). In Coup and DNDC, WFPS has multiple effects on denitrification through respiration and diffusion processes. The challenge for these models is to describe these direct and indirect effects correctly to 630 match the observed response of denitrification. Because DeNi does not use a fully process-based approach, the effects of environmental factorslike WFPSare considered with various empirical functions. We suspect that the use of empirical functions (functions derived from experimental lab data to describe WFPS) was more successful in modeling WFPS effects on denitrification than the fully process-based approaches.

Suggestions for future model evaluation experiments and model improvement
This study has demonstrated advantages and shortcomings of modelling denitrification processes using current models. We suggest the following to improve model algorithms and parameters by targeted experimental studies: (1) design experiments to specifically evaluate sensitive input variables (e.g. decomposition of labile organic carbon), which can then be used to improve current model algorithms; (2) take more frequent measurements in future studies (ideally daily) to allow better 640 descriptions and evaluations of temporal dynamics (3) use updated techniques to take measurements. To the best of our knowledge, all previous model evaluation studies (NGAS and DailyDayCent) using measured denitrification data were https://doi.org/10.5194/bg-2021-77 Preprint. Discussion started: 13 April 2021 c Author(s) 2021. CC BY 4.0 License. based on the outdated acetylene inhibition technique (Bollmann and Conrad, 1997;Nadeem et al., 2013;Sgouridis et al., 2016). Future studies should (as was done in this study) be based on He/O2 or 15 N gas flux methods; (4) take measurements to evaluate unknown/hypothetical parameters in the model equations (e.g. static growth and death rate of the denitrifiers, rate 645 coefficients for the different denitrification processes, etc.) (5) adapt models so that the parameters better represent measurements from real soils (e.g. measured SOC fractions v.s. SOC pools in the models); (6) continue to re-evaluate how processes are describe in models. These models were developed decades ago, and new technical solutions appear constantly.
There are several missing or poorly described processes in the models. Strong simplification of some process descriptions (e.g. no or inadequate or poorly calibrated microbial dynamics (see section 4.2.2)) have to be overcome or their implications 650 have to be estimated. Further experiments are thus necessary to describe more precisely the effect of temperature, moisture and substrate manipulations on the microbial/denitrifier community and therefore on N2 and trace gas fluxes. These kinds of datasets will help to (i) identify inadequate process descriptions, (ii) calibrate the sub-modules separately from the other parts of the model and finally, (iii) develop new, better approaches for the description of the processes.

Conclusions
The goal of this study was to assess the ability of the denitrification sub-modules in three biogeochemical models to predict the N2 and N2O fluxes of incubated soils in response to different initial soil conditions and changing environmental factors.
The results show that the models did not calculate fluxes of the same magnitude as the experimental results; measurements were overestimated by DeNi and underestimated by DNDC and Coup. However, with only a few exceptions, the temporal 660 patterns of the measured and modeled emissions were quite similar for the sandy soil. For the silt-loam soil, Coup and DNDC showed no response in 47% and 14% of cases, respectively, and responded in the same direction in 19% and 52% of cases, respectively. For DeNi, the model responded in the same direction in 67% of cases, with 33% in the opposite direction. Treatment responses of the models suggest that in addition to calibration, improvement of the model functions is needed to better predict N2O and N2 fluxes from denitrification. While none of the models was able to determine litter-665 induced decomposition dynamics correctly, the complex models Coup and DNDC were apparently further hampered by their limited ability to give realistic estimates of soil moisture, anaerobic soil volume and denitrifier biomass. The simple structure of the DeNi model, using more empirical functions, thus can be more accurate for some experiments. This suggests that the potential advantage of Coup and DNDC to include more control factors is only useful when the control factors have been more thoroughly researched and respective functions are more reliable. Developing reliable functions for complex control 670 factors requires experimental data with more detail in temporal resolution and parameter determination. Further development of the models to overcome the identified limitations based on experiments with enhanced denitrification activity and control parameter determination can largely improve the predicting power of the models.