Local scale evaluation of the simulated interactions between energy, water and vegetation in land surface models

water and vegetation in land surface models. Jan De Pue1, José Miguel Barrios1, Liyang Liu2, Philippe Ciais2, Alirio Arboleda1, Rafiq Hamdi1, Manuela Balzarolo3, Fabienne Maignan2, and Françoise Gellens-Meulenberghs1 1Department of Meteorological and Climatological Research, Royal Meteorological Institute, Belgium 2Laboratoire des Sciences du Climat et de l’Environnement, LSCE/IPSL, CEA-CNRS-UVSQ, Université Paris-Saclay, Gif-sur-Yvette, France 3Department of Biology, University of Antwerp, Belgium Correspondence: Jan De Pue (jan.depue@meteo.be)

To reduce the uncertainties associated with soil moisture and LAI, data assimilation methods have been developed (MacBean et al., 2015;Albergel et al., 2018). Alternatively, diagnostic LSM have been found to be capable of producing robust estimates of surface fluxes. The surface fluxes in diagnostic LSM are derived from observed state variables, such as remote sensed soil  (Pastorello et al., 2020) and the ICOS '2018 drought initiative' dataset (Drought 2018 Team and ICOS Ecosystem Thematic Centre, 2019), TRENDY (Sitch et al., 2015, https://sites.exeter.ac.uk/trendy), CGLS: Copernicus Global Land Service (Camacho et al., 2013), ECMWF: soil texture used in the ECMWF Integrated Forecast System (https://apps.ecmwf.int/codes/grib/param-db?id=43), HWSD: harmonized world soil database (Nachtergaele et al., 2010), FAO/USDA: USDA texture map based on FAO digital Soil Map of the World (Reynolds et al., 2000) in coherent surface fluxes.
Contrary to ISBA and ORCHIDEE, the calculations for LE and GPP in diagnostic model do not share parameters like stomatal resistance. Instead, the GPP calculations are coupled to LE by using the actual evapotranspiration as an input variable. is the component dedicated to modeling the exchange of water, energy and carbon fluxes between the soil-vegetation-snow continuum and the atmosphere (Masson et al., 2013;Le Moigne et al., 2018). In this case, a configuration of ISBA with interactive carbon cycling is used, i.e. ISBA-CC (Gibelin et al., 2008;Delire et al., 2020). The fluxes of water and carbon from the vegetation are coupled through the stomatal resistance. This shared parameter is calculated through the A-gs surface 125 scheme, and largely depends on soil moisture stress and air temperature (Calvet et al., 2004). The parametrization for this scheme is based on plant traits derived from the TRY-database (Kattge et al., 2011;Delire et al., 2020).

ISBA
The assimilation of carbon results in the evolution of LAI through a biomass allocation scheme. The growth and senescence 5 https://doi.org/10.5194/bg-2021-355 Preprint. Discussion started: 20 January 2022 c Author(s) 2022. CC BY 4.0 License. of leaves is purely photosynthesis-driven. The biomass reservoirs are coupled to a soil organic matter module to calculate the respiration terms. 130 The simulations with ISBA were performed on the Surfex v8.1 platform 2 . The soil profile was discretized in 14 layers (up to 12 m depth), using a diffusion scheme for soil heat and water transfer and an exponential decrease of hydraulic conductivity through the profile. The nitrogen dilution scheme (Calvet and Soussana, 2001) and canopy radiation transfer scheme (Carrer et al., 2013) were enabled. In the forest patches, the energy fluxes were calculated with the recently developed multi-energy balance scheme (MEB, Boone et al., 2017). Contrary to the standard soil-vegetation composite version of ISBA (which was 135 used for the non-forest patches), MEB explicitly solves the transfer of mass and energy between the soil surface, the snowpack, the canopy and the atmosphere. At the time of this study, the combination of MEB and prognostic LAI modelling is still considered experimental (Le Moigne et al., 2018). A spin-up period of 3 years was sufficient to eliminate effects from the initial model state on the surface fluxes (respiration is not analysed in this study). ISBA was not coupled to a hydrological model (e.g. CTRIP Decharme et al., 2019). Consequently, there was no lateral ground water flow or a water table, only free 140 drainage at the bottom of the soil profile.

ORCHIDEE
ORCHIDEE is the land surface model of the Institut Pierre Simon Laplace (IPSL) earth system model, and was initially described in Krinner et al. (2005). We used the version prepared for the 6th Coupled Model Inter-comparison Project (CMIP6) (Boucher et al., 2020;Cheruy et al., 2020). 145 The LAI is prognostic, and the phenology models used for the various plant functional types (PFT) are described in Botta et al. (2000) and MacBean et al. (2015). The canopy is discretized in layers of increasing thickness from the top to the bottom of the canopy. The incoming light is attenuated through the canopy following a Beer-Lambert extinction law. The photosynthesis is modelled at the leaf level following Farquhar et al. (1980) for C3 species and Collatz et al. (1992) for C4 species. The maximum carboxylation rate at 25°C is a PFT-dependent parameter. The maximum carboxylation rate varies with the temperature 150 following Medlyn et al. (2002) and Kattge and Knorr (2007). A water stress function depending on soil moisture and root profile (Rosnay and Polcher, 1998) is applied to the maximum carboxylation rate, the stomatal and the mesophyll conductances. An analytical solution to the three equations linking CO 2 assimilation, stomatal conductance and CO 2 leaf intracellular concentration is computed following Yin and Struik (2009). The assimilation is then upscaled over the layers to calculate the GPP.

155
A single-layer energy balance is computed per grid cell. LE is the weighted average of the snow sublimation, the soil evaporation, the canopy transpiration and the evaporation of foliage water; all these terms were initially computed following Ducoudré et al. (1993). The soil is now discretized over 2 meters into 11 layers of increasing thickness, and the hydrology scheme follows Richard's equation (De Rosnay et al., 2002;d'Orgeval et al., 2008). There is free drainage at the bottom. The soil thermodynamics are described in Wang et al. (2016), and the snow scheme is detailed in Wang et al. (2013). Not all sites are equipped with soil moisture sensors, nor is there a standardized setup or post-processing for soil moisture in the FLUXNET or ICOS framework. Consequently, the validation of the simulated soil moisture and the sensitivity analysis were only performed for the sites with sensors. Furthermore, some sites were equipped with multiple sensors in the soil profile.
Here, only the median score of the sensors was used in the statistics (i.e. one score per site). For the validation, all sensors up 185 to 2 m depth were used, whereas only the sensors up to 0.5 m depth (i.e. the shallow root zone) were used in the sensitivity analysis (though the impact on the results was minimal).

Validation
The simulated H and LE were validated with the observed daily mean fluxes from flux towers. The non-closure of the energy 190 balance is a well-known issue in the eddy covariance observations (Cui and Chui, 2019). The turbulent fluxes in the FLUXNET and ICOS dataset were corrected for this, under the assumption that the measured Bowen ratio is correct (Pastorello et al., 2020). Due to missing observations of the ground heat flux, this correction was not possible for all sites. The validation of H and LE was only performed for the sites where all fluxes were available. The mean correction of LE of each site is listed in Table 2. 195 Similarly, the simulated GPP was validated with the FLUXNET/ICOS GPP data. The net ecosystem exchange (NEE) observed at the fluxtower was partitioned into its ecosystem respiration (RECO) and GPP components using the daytime fluxes and constant friction velocity (USTAR) threshold method (Pastorello et al., 2020). Only data with a quality flag indicating good quality (1) or better was used in this analysis.
An important key to the feedback mechanism between the surface fluxes is the LAI. The simulated LAI from ISBA-CC and

200
ORCHIDEE was validated using the remote sensed LAI from the European Copernicus Global Land Service 3 . The LAI data product used here is derived from SPOT-VGT and PROBA-V satellite data, it has a spatial resolution of 1 km and a temporal resolution of 10 days (Camacho et al., 2013). The sites were selected to be fairly homogeneous within the footprint area, and the observed LAI is assumed to be representative for the direct surroundings of the eddy covariance stations.
The simulated soil moisture profiles of ISBA and ORCHIDEE, and the ERA5 soil moisture (used in DiagMod), were validated 205 where possible. To reduce biases caused by different soil physical properties of the soil profiles or differences in scale between models and observations, the observed and simulated volumetric soil moisture (θ) was converted to the effective saturation (S e ) as follows: where θ min and θ max were assumed to be the 5th and 95th percentile of the observed soil moisture in a site for the observations, 210 or the residual and saturated water content for the simulations.
For H, LE, GPP, LAI and S e , the classical validation indices are calculated: mean error (ME), root mean square error (RMSE), Pearson correlation (r) and Nash-Sutcliffe model efficiency (NS). They were calculated as in Equations 2 -5, in which y * and y o are the predicted and observed values, y the mean of y and n o the number of observations: Taylor diagrams were constructed using the Pearson correlation (r) and standard deviation (σ) of the observed and simulated variables. The validation was performed using the daily totals/averages.

220
Furthermore, the same analysis was also performed on the anomalies to the mean annual cycles, to isolate the capability of the models to capture seasonal variability. The validation indices of the seasonal anomalies have the subscript ANOM , e.g. NS ANOM .
Significant differences between the models were evaluated with the Wilcoxon signed-rank test, and the significance of the PFT and HCB to classify the model performances was evaluated with the Kruskal-Wallis H-test. To assess the sensitivity of the fluxes to the state variables (S e and LAI), the slope of the seasonal anomalies of the fluxes against the anomalies of the state variables was determined. This analysis was performed for the observations and the simulations, and compared. Note that the linear slope was used here, though a linear response is not necessarily expected (e.g. the response to soil moisture anomalies depend on a wet/dry regime). The goal of this analysis was to investigate whether LSM 230 are capable of reproducing a similar relationship as found in the observations. Significant differences between the models were evaluated with the Wilcoxon signed-rank test.
To evaluate whether errors in the state variables result in errors in the surface fluxes (or vice versa), the spearman rank correlation between both was calculated. Since Copernicus LAI was the reference LAI, this analysis was not possible for LAI in DiagMod.

Phenology
The detection of the start, maximum and end of the seasonal cycle (SOS, MOS and EOS) was achieved by applying a smoothing operation (20 day rolling mean), followed by a threshold procedure (Maleki et al., 2020). In this threshold procedure, the minima and maxima were used to delineate the growing and senescent phase of the season. MOS was defined as the date when 240 the maximum of the season is reached, SOS and EOS were defined at the date where the growing or senescent phase crosses the threshold value T . T was calculated for each growing or senescent phase as T = P 5 + 0.2(P 95 − P 5 ), where P 5 and P 95 are the 5th and 95th percentile. This procedure was performed for the simulated and observed LE, GPP and LAI.

Model dynamics
To compare the model dynamics, the simulated LE flux partitioning, water balance and water use efficiency were evaluated as 245 well. Direct observations of the LE flux partitioning were not available, but it is possible to extract the Transpiration component from the total LE flux, using the underlying Water Use Efficiency (uWUE) method (Zhou et al., 2016;Nelson et al., 2020).

Validation surface fluxes: LE and GPP
The bias (ME) and accuracy (RMSE) of the simulated LE and GPP are shown in Fig. 2, together with Taylor diagrams of the 250 simulated fluxes and their seasonal anomalies. It was evident that the inter-site variability of the model performance is much larger than the inter-model variability. In terms of bias and accuracy, the differences between the models were relatively limited.
All models suffered a substantial underestimation of LE, whereas the overall bias in GPP was relatively small. Significant differences (Wilcoxon p<0.05) were found in the bias of GPP between DiagMod (overestimation) and ISBA (underestimation), and the simulated LE was significantly more accurate in ISBA, compared to ORCHIDEE.

255
Notably, no substantial bias was found in the simulated H of any model to compensate for the consistent bias in LE (results shown in supplementary material). In this study, the corrected fluxes from the FLUXNET/ICOS dataset were used as a reference. They were corrected to close the energy balance of the observations with the assumption that observed Bowen ratio was accurate. If the non-corrected fluxes were used instead, the bias in LE was reduced, but the simulated H was overestimated (not shown here). This points at the significant uncertainty associated with the observed fluxes from eddy-covariance measure-260 ments. The estimated observation uncertainty of the turbulent fluxes (associated with random measurement errors and energy balance correction) had the same order of magnitude as the model errors.
The Taylor diagrams in Fig. 2 show that the average variability of the simulated LE and GPP was in fair agreement with the observations. After removal of the mean seasonal cycle, the performance of the models decreased (r ANOM ,NS ANOM ), but the mean variability of the anomalies is reasonably accurate. In terms of r and r ANOM of LE and GPP, ORCHIDEE was signif-DiagMod. The impact of the land cover type of the test site on the model performance is illustrated in Fig. 3 Table 3. Nash-Sutcliffe model efficiency coefficient of LE, GPP, Se and LAI, and their seasonal anomalies. Median scores given for all sites, and grouped per dominant land cover type. The scores for the DiagMod Se are computed using the ERA5 Se.

Validation state variables: Soil moisture and LAI
The validation results of S e and LAI are shown in Fig. 4. The soil moisture used in DiagMod was taken from ERA5, and tended 285 to be overestimated compared to in situ observations, whereas an overall negative bias was found in ISBA and ORCHIDEE. No significant differences were found in r between all models. The simulated variability of soil moisture was too low in all models, in particular for ORCHIDEE. Notably, ERA5 outperformed ISBA (p> 0.05) and ORCHIDEE (p< 0.05) in terms of accuracy, despite their use of in situ meteorological forcings (e.g. precipitation). ORCHIDEE performed significantly worse than the other two models for all validation metrics (Wilcoxon p < 0.05). The highest correlation in the anomalies was simulated by 290 ISBA.
Compared to the surface fluxes, the accuracy of the simulated soil moisture was substantially lower. The validation scores of S e are given in Table 3, separated per dominant land cover type. In all models, the simulated S e was significantly better for herbaceous sites, compared to forest sites. The herbaceous sites are generally found in a water-driven dryland climate, with strong precipitation-driven anomalies.

295
Similarly, the prognostic LAI was also of poorer quality than the simulated surface fluxes. ISBA had a significantly better ME and RMSE than ORCHIDEE, but both models overestimated LAI and strongly underestimated its variability. In particular, the variability of LAI in the evergreen needleleaf forests was strongly underestimated in both models, as well as the variability of LAI in evergreen broadleaf forests in ORCHIDEE. Furthermore, both models obtained only a poor correlation, and achieved a very poor correlation of the seasonal anomalies.

300
In both models, the simulated LAI for forest sites was better than for the herbaceous sites, though not significantly (p>0.05).
The simulated anomalies were modelled significantly better (p<0.05) in forest sites than herbaceous sites (Table 3). The median performance is shown with the opaque markers.

Sensitivity of surface fluxes to soil moisture and LAI
The sensitivity of the fluxes to the soil moisture was strongly dependent on the land cover type, both in the observations and in the models (Fig. 5). A stronger response was found in the herbaceous sites, compared to the forest sites. ISBA and ORCHIDEE 305 have a too high sensitivity to soil moisture, whereas the response in the diagnostic model was closer to that in the observations.
Fig. 6, the same data is plotted, but classified per drought class. This illustrates the oversensitivity of ISBA and ORCHIDEE to drought stress. Despite their differences in implementation and parametrisation, a striking similarity in their sensitivity to drought was found, both for LE and GPP. The observations did not show an increase in sensitivity of GPP to S e in dryer sites.
In the forest sites, the response of GPP to S e anomalies is counter-intuitively negative. This might indicate that soil moisture 310 anomalies in forest sites were more dominated by wet anomalies, associated with rainfall events. These events coincide with a reduction in solar radiation, hence resulting in a negative GPP response. In herbaceous sites soils were generally drier, so the positive impact of the reduced drought stress after the rainfall event was more dominant, resulting in a positive response. This behaviour was mimicked well in the models.
The sensitivity of LE and GPP to LAI was generally higher in the herbaceous sites. Here, the models tended to underestimate 315 the sensitivity to LAI. In the forest sites, the sensitivity was lower according to the observations. The modelled sensitivity of LE to LAI was reasonably accurate, whereas the sensitivity of GPP to LAI was too strong. error correlation to LE and GPP, compared to S e . Grouped per dominant land cover type (Fig. 8), both models agree that the error correlation between LAI and GPP was higher in the herbaceous sites, compared to the forest sites. Notably, this was not the case for LAI-LE.
Furthermore, the errors in S e were strongest correlated to those in LE for all models. The highest error correlation was found in DiagMod, where this was most pronounced for the herbaceous sites. In these sites, the S e -GPP error correlation was also 325 the strongest for DiagMod, whereas no strong S e -GPP error correlation was found in the other models.

Phenology
The timing of the start, maximum and end of the seasonal cycle was validated for LE, GPP and LAI. Fig. 9 shows the boxplots of the mean error in all sites. In all models, the bias and accuracy of the seasonality of LE and GPP was comparable, whereas 330 the leaf phenology (i.e. LAI) was poorer. The simulated phenology of LAI was delayed substantially, in particular in ISBA.
This bias was most pronounced by the MOS, and to a lesser extent in EOS.
ISBA performed significantly worse than ORCHIDEE (Wilcoxon p < 0.05) for ME of MOS GPP and MOS LAI, and RMSE of MOS LAI. The prognostic LAI in both models tended to peak towards the end of the growing season, whereas the maximum LAI was reached in the beginning of the season according to the observations. This is illustrated in Fig. 10 which turned out to be challenging to capture for ISBA and ORCHIDEE. An example is shown for the Savanna sites in Fig.   11. DiagMod relied on the remote sensed LAI and was significantly more accurate than the prognostic models to capture EOS of LAI in herbaceous sites.
As the models were configured to run without dedicated management practices for the crop sites, EOS was estimated too late due to the harvest practice (Fig. 11). Even in DiagMod, EOS of GPP was delayed.

Water balance, WUE & LE partitioning
The water balance partitioning in ISBA and ORCHIDEE is shown in Fig. 12. In both models, the evapotranspiration fraction across PFT was similar, but substantial differences were found in the drainage and runoff in both models. Whereas nearly no water was lost through runoff in the ISBA simulations, a substantial amount of runoff was simulated with ORCHIDEE. On the 350 other hand, the drainage in ISBA was consistently larger than in ORCHIDEE. DiagMod does not compute a water balance, so could not be included in this comparison.   When the observed average water use efficiency is plotted versus the average LE flux, a pattern emerges in which the sites are grouped per PFT (Fig. 14). A similar pattern was found in the ISBA simulations, but not in the ORCHIDEE simulations.
The range in WUE across the test sites was much smaller in ORCHIDEE than in the observations. The difference in water use efficiency can be attributed to differences in the modelled plant physiology, or the amount 360 of drought stress experienced by the vegetation. As mentioned above, the rootzone soil moisture dropped significantly more frequent below field capacity in ISBA, compared to ORCHIDEE.

Model performance
The validation metrics of the three models were generally in agreement with previously performed local scale evaluations.

365
Similar simulations with the diagnostic model were done in the validations reports of both the LSA SAF evapotranspiration and surface fluxes products (Ghilain et al., 2018) in one hand and the LSA SAF GPP product (Martínez et al., 2020) in the other hand. The accuracy and Pearson correlation obtained here were better than the ones previously reported. This can be attributed to the use of local forcings in this study, which are not used in the LSA SAF products. The weaker performance of the algorithm for the sensible heat flux was also identified by Ghilain et al. (2018).

370
The GPP product is a recent addition to the ensemble of LSA SAF MSG products. It was demonstrated to outperform similar products which also rely on the Montheith light-use efficiency method (Martínez et al., 2020). Here, it was found perform consistently well for forest and herbaceous sites, and achieve a comparable model performance as ISBA.
In previous intercomparison studies at local scale (Balzarolo et al., 2014) or global scale (Friedlingstein et al., 2021), GPP was simulated more accurately with ORCHIDEE than with ISBA, but this was not confirmed here. Since these studies, substan-375 tial improvements have been made to ISBA: introduction of the MEB scheme, parametrization update, diffuse multilayer soil scheme, etc. Delire et al., 2020). The introduction of the MEB scheme for forests on the energy fluxes was evaluated in-depth by Napoly et al. (2017) at local scale (though prognostic LAI was not included in that study). Substantial improvements to G and H were reported, thanks to the addition of an insulating litter layer. The introduction of the MEB scheme improved the mechanistic representation of the canopy, and issues due to a shared roughness length of the vegetation 380 and bare soil in the composite scheme were circumvented. Our findings agree with that outcome, but the bias we found for LE, is not in agreement with previous findings.

Observation uncertainties
With the emergence of freely available data from eddy covariance network, the use of local datasets is an increasingly standard-385 ized approach to evaluate the performance of land surface models (Balzarolo et al., 2014;Napoly et al., 2017;Williams et al., 2020;Chen et al., 2018;Joetzjer et al., 2015). However, the eddy covariance observations notoriously suffer from substantial biases and non-closure of the energy balance (Foken, 2008;Mauder et al., 2020). The non-closure of the energy balance is attributed to 1) large advective fluxes caused by surface, 2) systematic measurement errors due to mismatch in observation footprint or inadequate sample rate, 3) thermal processes, such as heat storage or vegetation metabolism (Mauder et al., 2020;390 Chu et al., 2021;Liu et al., 2021). The test sites in this study were selected to have a relatively homogeneous landcover. Regardless, the resulting uncertainty in the observations was of the same order of magnitude as the model errors. The turbulent fluxes are typically underestimated, as is the GPP (Massman and Lee, 2002;Gao et al., 2019). Note that GPP is not corrected for this possible bias in the ONEFLUX processing pipeline (Pastorello et al., 2020). Furthermore, some studies have indicated that the eddy covariance observations are closer to lysimeter data if the energy balance is closed by correcting LE only (Gebler  et al., 2015). Considering this, the negative bias of the simulated LE (and GPP) in this study could be even underestimated.

Forest vs Herbaceous
Generally, the differences between the accuracy of the simulated surface fluxes was most distinct in the sites dominated by herbaceous vegetation. These sites have the most pronounced inter-annual variability, and seasonal anomalies are strongly 400 driven by precipitation events (Weber et al., 2009). This can be largely attributed to their natural occurrence in dryer climates and shallower root system, compared to forests. The seasonal cycle of LAI in the herbaceous sites, and its variability, was simulated poorly with the prognostic models. The error correlation analysis indicated that these errors were strongly related to errors in the surface fluxes.
Contrary to the prognostic models, the diagnostic model performed consistently well for all types of land cover. Only the 405 seasonal variability of GPP in forest sites was simulated less accurately than with the prognostic models. Whereas the remotesensed observations adequately captured this variability for the herbaceous sites, they seemed to fall short for the forest sites.

Interactions
LAI and soil moisture are two key variables in the interaction between water, energy and vegetation. Though our understanding of the involved processes at leaf-level scale is advanced, it remains challenging to scale these relations to the canopy level. This 410 was illustrated by erroneous sensitivity of the models to LAI and soil moisture. As in previous studies, the sensitivity of LE and GPP to soil moisture was generally overestimated Huang et al., 2016) in ISBA and ORCHIDEE, whereas the diagnostic model represented the observed sensitivity relatively well.
The interplay between LE and LAI was analysed in detail by Forzieri et al. (2018) and (Forzieri et al., 2020). The estimated global sensitivity of LE to LAI (3.66 ± 0.45 W m −2 / m 2 m −2 , according to Forzieri et al., 2020) is lower than the one reported 415 here, but the applied methodology was not the same. Contrary to our study, anomalies due to climatic drivers (i.e. precipitation, temperature etc.) were factored out, resulting in a different response. The oversensitivity of LE to LAI in ORCHIDEE was also not confirmed in our study. Still, in accordance to these studies, a stronger response between LE and LAI was found for herbaceous/soil moisture supply-driven sites, compared to forest/demand-driven sites.
Despite the differences in their architecture and parametrization, ISBA and ORCHIDEE demonstrated similar behaviour in the 420 interaction between water, energy and vegetation. Comparable sensitivities and error correlations were found in both models, indicating that they share common weaknesses in their implementation.

LAI
The errors in the surface fluxes were strongly correlated to errors in LAI for both prognostic models, even though their sen-425 sitivity to LAI reflects the observed sensitivity reasonably well (compared to the sensitivity to soil moisture). This seemed to indicate that the source of the errors in the fluxes lies in the feedback mechanism between GPP and LAI (i.e. biomass allocation and phenology), rather than in the forward link between GPP and LAI (i.e. photosynthesis and leaf to canopy upscaling).
The prognostic simulation of LAI in ISBA was introduced by Gibelin et al. (2006) and uses a fairly simple scheme. The latest update was the revision of plant trait parameters according to the TRY database (Kattge et al., 2011;Delire et al., 2020). It 430 has frequently been reported that the seasonal cycle of the simulated LAI in ISBA is delayed by a month or more (Lafont et al., 2012;Gibelin et al., 2008;Joetzjer et al., 2015). Delire et al. (2020) attributes this to the leaf longlivety parameter, and (Szczypta et al., 2014) mentions the vegetation undergrowth dynamics as a possible cause for the mismatch between remote sensed LAI and the prognostic LAI in LSM. However, the issue seems to be related to the architecture of the biomass allocation scheme as well. The assimilated carbon is attributed to the leaf biomass pool first, from where it trickles down to the other 435 pools. The consequence is that the simulated LAI in ISBA starts slow and builds up LAI until late in the second half of the season, whereas the observed seasonal LAI cycles reach a maximum in the first half of the growing season.
Improvements to prognostic LAI are required to increase the skill of the LSM to simulate surface fluxes. It is also demonstrated by the diagnostic model that a fairly simple model is capable of simulating the surface fluxes accurately, given accurate observations of LAI.

440
In that context, processes from ORCHIDEE and other LSM could be adopted to improve the fairly simple biomass allocations scheme in ISBA. The importance of non-structural carbohydrates to capture the leaf phenology in LSM is well-known, though rarely implemented (Asaadi et al., 2018). Fatichi et al. (2019) indicates that a full-grown canopy of a deciduous broadleaf forest contains approximately 30% of the total yearly assimilated carbon, yet it is grown in 1 month (1/5 -1/6 of the growing season). This rough simplification illustrates that reserve dynamics are essential to simulate the seasonal cycle of the vegeta-445 tion accurately. Such dynamics are implemented in ORCHIDEE: once certain phenological criteria are fulfilled, the carbon in a reserve pool is allocated to leaf biomass to kickstart the phenological cycle. Still, despite the dedicated phenology module, non-structural carbohydrates reserve dynamics, and a more advanced leaf demography, simulating LAI remained challenging in ORCHIDEE. The timing of the phenological cycle was more accurate in ORCHIDEE, though the accuracy of the simulated LAI was significantly poorer than ISBA. This was the case in particular for herbaceous vegetation. This tendency towards 450 delayed phenology (and in particular a delayed leaf senescence) is found in most earth system models in CMIP5 and CMIP6 (Park and Jeong, 2021).
The discrepancy in complexity between the modelling of photosynthesis and that of the biomass allocation has been highlighted by several authors (Fatichi et al., 2016;Friend et al., 2019), though the main challenge lies in the parametrization of those processes. The allocation of carbon in terrestrial vegetation is an important knowledge gap, hindering the advancement 455 of earth system models.
Finally, there are several important differences between the remote sensed vegetation and the idealized vegetation in the models which need to be recognized when comparing both. Firstly, the role of the understory has a well-known impact on the remote sensed LAI (Camacho et al., 2013), whereas the LSM do not consider the separate evolution of an understory. This can result in substantial differences in the seasonal cycle of LAI. This was illustrated by the differences in the simulations and observations 460 of the LAI cycle at ENF sites. Continuous in situ LAI observations with hemispherical photography in ENF sites are rare, but Rautiainen et al. (2012) reported that the effective canopy LAI (including non-green foliage) in FI-Hyy (boreal ENF site) remained constant from June till mid-September. This in agreement with the flat LAI cycle for ENF in ORCHIDEE, but is in contrast with the remote sensed LAI and the prognostic LAI in ISBA. In an empirical model based on in situ observations for the FR-LBr site, LAI demonstrated a seasonal cycle. The understory was responsible for most of the seasonal variation, and 465 30% of the LAI was attributed to the understory during the summer (Rivalland, 2003). The seasonal cycle in the remote sensed LAI seems exaggerated (ranging between 1 m 2 m −2 in winter and 4 m 2 m −2 in summer).However, considering the understory and seasonal variation in needleleaf greenness (Seyednasrollah et al., 2021), assuming a flat LAI does not seem accurate nether, in the context of simulating GPP.
Which brings up a second issue: the remote sensed LAI is the 'green' LAI, i.e. photosynthetically active leaves (Camacho 470 et al., 2013). Whereas LAI in LSM is a key variable which wears many hats. A single LAI variable is used to represent the role of leaves in several processes (photosynthesis, interception, canopy radiation transfer, surface roughness, etc.), in which the greenness of the canopy is not always important. These discrepancies contribute to the mismatch between LAI in the observations and the models. Addressing them might further advance the representation of vegetation in LSM.

475
A significant difference between ISBA and ORCHIDEE is found in the simulated soil moisture dynamics, the water partitioning and the water use efficiency. The simulated WUE in ISBA was in fair agreement with what is deduced from the eddy covariance observations. In contrast, the WUE in ORCHIDEE had a much narrower range. The comparison of the LE partitioning learns also that a larger fraction of the water was transpired in ORCHIDEE, compared to ISBA. The differences in WUE and flux partitioning could be attributed to differences in the simulated plant physiology, or to the quality of the simulated soil 480 water content. The variability of the simulated water content in ORCHIDEE was strongly underestimated, and the vegetation experienced significantly less drought stress in ORCHIDEE. It is likely that this translated to a low variability in WUE as well.
Furthermore, a substantial part of the precipitation was lost as surface runoff, compared to ISBA. Though we did not have validation data to evaluate the water partitioning, it seems that the simulations of ORCHIDEE could be improved significantly by addressing the soil moisture dynamics.

485
Overall, the accurate simulation of soil moisture and water infiltration is a challenge (perhaps one of the main challenges) in land surface models (Vereecken et al., 2019). The poor quality of the simulated soil moisture is also evident in this study, despite the use of the multi-layer diffuse water transport scheme. In this study, the superior simulation of soil moisture in ISBA contributes to the good performance in simulating the surface fluxes, in particular for sites with herbaceous vegetation and water-driven climate. The soil physical parameters are determined using a global pedotransfer model, using only texture as 490 input. New, advanced pedotransfer functions have emerged in recent years, using not only texture, but also climatology and land use as predictors (Gupta et al., 2021). As soil moisture is at the basis of many processes in LSM, incorporating these PTF seems the logical new step forward in LSM (Fatichi et al., 2020).
The local scale simulations in this study were not coupled to a hydrological model, thus ground water dynamics were lacking.
Though only a limited effect of capillary rise was found in studies with a coupled groundwater hydrology, the impact can be non-negligible for forest ecosystems with a deep root system (Decharme et al., 2019;MacBean et al., 2020). The further development of ground water dynamics in LSM is indispensable for the accurate coupling of energy, water and carbon in forest vegetation and its response to severe drought events.
Several efforts have already explored the potential of improving soil moisture dynamics in LSM. Substantial improvements to soil moisture have indeed been obtained by calibrating the pedotransfer functions, or soil physical parameters. Yet, the impact 500 thereof on the surface fluxes has been found to be relatively limited (Pinnington et al., 2021), or in some cases even negative (Raoult et al., 2021). Though many parameters in ISBA and ORCHIDEE are derived from databases (Delire et al., 2020), the LSM have been calibrated to produce accurate surface fluxes using (amongst others) eddy covariance observations. The limited accuracy of the soil moisture dynamics might have been overcompensated in the resulting parametrization (Raoult et al., 2021).
The oversensitivity to drought stress in ISBA and ORCHIDEE is possibly an illustration of this. Improvements to the intricate 505 network of gears under the hood of LSM are a delicate matter. Addressing the soil moisture dynamics should go hand in hand with corrections to the oversensitivity to drought stress.

Conclusions
Three land surface models were compared at local scale, using identical meteorological forcing and prescribed land cover. The goal was to evaluate their skill to simulate surface fluxes (LE and GPP), as well as their simulated interaction between water, 510 energy and vegetation. It was found that the diagnostic model (based on LSA SAF algorithms) performed consistently well for all land covers. The prognostic models (ISBA and ORCHIDEE) performed similarly well for the forest sites, but the simulations for herbaceous sites revealed some important shortcomings. The sensitivity analysis demonstrated that both models overestimate the sensitivity to drought stress, which was occurring most frequently in herbaceous sites. On the other hand, the error analysis showed that errors in the prognostic LAI (and not soil moisture) were the dominant source of errors for LE 515 and GPP in ISBA and ORCHIDEE. Given the acceptable sensitivity to LAI, the source of these errors is likely found in the feedback mechanism between GPP and LAI. Compared to observations, the simulated phenological cycle in both models was delayed and failed to capture the observed seasonal variability. Improvements in the leaf phenology and biomass allocation scheme are required to improve the simulated surface fluxes.
The analysis here demonstrated key strengths and weaknesses of each LSM. Most notably, we showed that ISBA and OR-520 CHIDEE shared key deficiencies concerning the coupling of the water, energy and vegetation, despite their differences in architecture and parametrization. Improving the feedback between GPP and LAI, the soil moisture dynamics, and the oversensitivity to drought might advance the performance of these LSM significantly.
Code and data availability. The scripts and datasets used in this study are freely available upon request to the authors.

Test sites
The sites were classified according to plant functional type (PFT) and Hydro-climatic biome (HCB;Papagiannopoulou et al., 2018). The distribution of the sites according to this classification is shown in Fig. A1.