Plant phenology evaluation of CRESCENDO land surface models. Part I: start and end of growing season

Plant phenology plays a fundamental role in land-atmosphere interactions, and its variability and variations are an indicator of climate and environmental changes. For this reason, current land surface models include phenology parameterizations and related biophysical and biogeochemical processes. In this work, the climatology of beginning and end of the growing season, simulated by seven state-of-the-art European land surface models, is evaluated globally against satellite observations. The assessment is performed using the vegetation metric leaf area index and a recently-developed approach, named four grow5 ing season types. On average, the land surface models show a 0.6-month delay in the growing season start, while they are about 0.5 months earlier in the growing season end. Difference with observation tends to be higher in the Southern Hemisphere compared to the Northern Hemisphere. High agreement between land surface models and observations is exhibited in areas dominated by broad-leaf deciduous trees, while high variability is noted in regions dominated by broad-leaf deciduous shrubs. Generally, the timing of the growing season end is accurately simulated in about 25% of global land grid points versus 10 16% in the timing of growing season start. The refinement of phenology parameterization can lead to better representation of vegetation-related energy, water, and carbon cycles in land surface models, but plant phenology is also affected by plant physiology and soil hydrology processes. Consequently, phenology representation and, in general, vegetation modelling is a complex task, which still needs further improvement, evaluation, and multi-model comparison. 1 https://doi.org/10.5194/bg-2020-319 Preprint. Discussion started: 4 September 2020 c © Author(s) 2020. CC BY 4.0 License.


Introduction
Plant phenology and its variability have a substantial influence on the terrestrial ecosystem (e.g. Churkina et al., 2005;Kucharik et al., 2006;Berdanier and Klein, 2011) and land-atmosphere interactions (e.g. Cleland et al., 2007;Richardson et al., 2013;Keenan et al., 2014) making phenology variability one of the indicators of climate change (e.g. Schwartz et al., 2006;Soudani et al., 2008;Jeong et al., 2011) since modifications in both spring and autumn phenology are reported in recent decades under global warming (e.g. Menzel et al., 2006;Richardson et al., 2013;Zhu et al., 2016;Chen et al., 2020;Zhang et al., 2020). 20 Given the influence of plant phenology on vegetation productivity, and since green leaves are the primary interface for the exchange of energy, mass (e.g., water, nutrient, and CO 2 ), and momentum between the terrestrial surface and the planetary boundary layer (Richardson et al., 2012), land surface models (LSMs) need to accurately simulate plant growing season cycles.
Limitations may result in biases and uncertainties in representing vegetation productivity and carbon cycle (e.g. Churkina et al., 2005;Kucharik et al., 2006;Berdanier and Klein, 2011;Richardson et al., 2012;Friedlingstein et al., 2014;Savoy and Mackay, 25 2015; Buermann et al., 2018). For example, Kucharik et al. (2006) show an overestimated April-May net ecosystem production triggered by biases in plant budburst. Berdanier and Klein (2011) describe a link between above ground net primary production, growing season length, and soil moisture in high-elevation meadows. They show that the potential impact of changes in active growing season length on biomass production accounts for about 3-4 g m −2 d −1 . The work by Richardson et al. (2012) is an example of a systematic evaluation of LSMs' phenology representation. They evaluate fourteen models participating in the 30 North American Carbon Program Site Synthesis against ten forested sites, within the AmeriFlux and Fluxnet-Canada networks.
Their assessment reveals a typical bias of about two weeks in LSMs representation of the beginning and end of the growing season. They also show a low skill in LSMs' reproduction of the observed inter-annual phenology variability. These biases lead to an overestimation of about 235 gC m −2 yr −1 in the gross ecosystem photosynthesis of deciduous forest sites. However, uncertainties in simulated maximum production partially balance this overestimation. The work by Buermann et al. (2018) is 35 another example of a multi-LSMs evaluation. They observe widespread lagged plant productivity responses across northern ecosystems associated with warmer and earlier springs, which is weakly captured by ten evaluated TRENDYv6 current LSMs.
Consequently, current LSMs still present biases in simulating timings and the magnitude of the vegetation active season.
The latest generation of LSMs have started including a more detailed description of land biophysical and biogeochemical processes, and they have become able to explicitly represent carbon and nitrogen land-cycles, as well as plant phenology and 40 related water and energy cycling on a global scale (e.g. Oleson et al., 2013;Lawrence et al., 2018). In particular, current LSMs link Leaf Area Index (LAI) and plant phenology to changes in temperature, precipitation, soil moisture and light availability (e.g. Oleson et al., 2013;Lawrence et al., 2018), as displayed in observations (e.g. Caldararu et al., 2012;Zeng et al., 2013;Tang and Dubayah, 2017).
In this framework, the European CRESCENDO project (https://www.crescendoproject.eu/) fostered the development of a 45 new generation of LSMs to be used as the land component of the Earth System Models (e.g. Smith et al., 2014;Olin et al., 2015;Cherchi et al., 2019;Mauritsen et al., 2019;Sellar et al., 2020;Seland et al., 2020;Yool et al., 2020;Boucher et al., 2020) employed in the Coupled Model Intercomparison Project Phase 6 (CMIP6, Eyring et al., 2016). In particular, seven novel In this paper, we present a brief description of the methods, LSMs and satellite data used (section 2). Next, we present the main results of the satellite data comparison and evaluation of LSMs against observations (section 3). Finally, we discuss the methodology, data, and results (section 4), followed by concluding remarks (section 5).
2 Method, models and data 2.1 Satellite observation 70 To perform a comprehensive global phenology evaluation, a satellite-based observational dataset is required. LAI satellite observations present uncertainties and limitations related to the assumptions and algorithms applied in the LAI calculation (Section 4.2 e.g. Fang et al., 2013;Jiang et al., 2017). For this reason, three satellite observational products are considered in this work, which differ in acquisition sensor (i.e. AVHRR for LAI3g, MODIS for MODIS LAI, and SPOT/PROBA VEGETA-TION for SENTINEL; Piao et al., 2020) and methodologies for data-acquisition: LAI3g , MODIS version 6 75 (https://lpdaac.usgs.gov/ ; Myneni, 2015a, b), and SENTINEL version 2 (https://www.copernicus.eu/ ; Drusch et al., 2012). The full time series of LAI3g data is generated by an artificial neural network (ANN) algorithm that is trained with the overlapping data of NDVI3g and Terra Moderate Resolution Imaging Spectroradiometer (MODIS) LAI products (see Zhu et al., 2013). It covers the 1982-2011 period with a 15-day temporal frequency and a 1/12 • spatial resolution. The MODIS LAI algorithm is based on a three-dimensional radiative transfer equation that links surface spectral bi-directional reflectance factors to vegeta-80 tion canopy structural parameters (see Yan et al., 2016a). It covers the 2000-2017 period with a 4-or 8-day temporal frequency and a 500 m spatial resolution. SENTINEL LAI dataset is obtained through a neural network applied on top-of-atmosphere input reflectances in red and near-infrared bands derived from PROBA-V. The instantaneous LAI estimates obtained in this way go through a temporal smoothing and small gap filling, which discriminate between evergreen broadleaf forest and noevergreen broadleaf forest pixels (see Verger et al., 2019). It covers the 1999-2019 period with a 10-day temporal frequency 85 and a 1km spatial resolution. Note that SENTINEL has a reduced latitudinal cover compared to MODIS and LAI3g since it covers up to 75 • N versus the 90 • N of the other two products.
The 2000-2011 period is common to the three satellite datasets and it is used in the present analysis. The satellite observations are aggregated into monthly values and regridded, by means of a first order conservative remapping scheme (Jones, 1999), to a regular 0.5 • x 0.5 • grid for consistency with the LSMs' output.

Land surface models
Seven European LSMs, which are part of the CRESCENDO project, are evaluated in this study. Further details on each of these 100 LSMs are provided in the following sections, and briefly summarized in Table 1.

110
The evergreen plant phenology is characterized by a background litterfall, that is a continuous leaf fall and fine roots turnover distributed along the year. A PFT-specific leaf longevity parameter drives this mechanism (Oleson et al., 2013).
The seasonal-deciduous plant phenology derives from the model Biome-BGC version 4.1.2 (Thornton et al., 2002) and parameterizations by White et al. (1997). The leaf onset starts when the soil temperature of accumulated growing-degree-day (Oleson et al., 2013). Summer and winter solstices also act as time-limiting factors.
Finally, the stress-deciduous plant phenology involves grass and trees that respond to both cold and drought stresses. This parameterization is based on the White et al. (1997) grass phenology model. The leaf onset is soil moisture-driven in areas characterized by year-round warm conditions, while both soil moisture and soil temperature drive the leaf onset in region characterized by a cold season. Sustained period of dry soil or cold temperature, or day-length shorter than six hours trigger 120 the end of the growing season. Further details can be found in Oleson et al. (2013). The irrigation area is based on crop type and region, and the irrigation triggers for crop phenology are newly updated from the CLM4.5. New phenology-related features in CLM5.0 include additional antecedent rain requirement trigger for stress deciduous phenology to reduce the occurrence of anomalous green-up during the dry season driven by upwards water movement from 130 wet to dry soil layers (Dahlin et al., 2015). In addition, several major changes have been made in the CLM5.0. One of the major physiological changes include maximum stomatal conductance, which now uses the Medlyn conductance model (Medlyn et al., 2011), rather than the previously used Ball-Berry stomatal conductance model. In CLM5.0, the Jackson et al. (1996) rooting profiles are used for both water and carbon, where the rooting depths were increased for broadleaf evergreen and broadleaf deciduous tropical tree PFTs. As soil moisture is an important control on plant phenology, the modified rooting profiles would 135 affect the stress deciduous trigger and other water-related processes. Other modifications include nutrient dynamics, hydrological and snow parameterizations, plant hydraulic functions, revised nitrogen cycling with flexible leaf stoichiometry, leaf N optimization for photosynthesis, and carbon costs for plant nitrogen uptake (Lawrence et al., 2019).

JULES-ES
JULES-ES is the Earth System configuration of the Joint-UK Land Environment Simulator (JULES). JULES-ES is the ter-140 restrial component of the new UK community ESM, UKESM1 (Sellar et al., 2020). It is based on the core physical land configuration of JULES (JULES-GL7) as described in Wiltshire et al. (2020). JULES-GL7 simulates the exchange of heat, water and momentum between the land and atmosphere using prescribed land-cover including LAI. JULES-ES simulates the exchange of carbon and the change in surface properties that control the physical interaction between the land and atmosphere including LAI and landcover. JULES-ES includes a full carbon and nitrogen cycle with dynamic vegetation, 13 Plant Func-145 tional Types with trait based physiology (Harper et al., 2016), and a representation of crop harvest and landuse change. In JULES-ES, the allometrically defined maximum LAI varies with the carbon status (Clark et al., 2011) and extent of the under-lying vegetation. In the case of natural grasses LAI can vary rapidly sub-seasonally whereas tree PFTs have a smaller variation.
Phenology operates on top of this variation for Deciduous Broadleaf and Needleleaf PFTs based on an accumulated thermal time model.

150
The simulations described here used a near-final configuration of JULES-ES prior to the final tuning performed as part of UKESM1 (Yool et al., 2020). JULES-ES is run offline forced by global historic meteorological data as described in section 2.3. to the common CRESCENDO protocol as described in section 2.3. Simulations were conducted without natural changes in the land cover, instead a static map of natural land cover based on Pongratz et al. (2008) was used. Anthropogenic land cover changes were applied using land-use transitions (see Reick et al., 2013) derived from the LUH2 forcing, whereby rangelands were treated as natural vegetation (see also Mauritsen et al., 2019). To avoid a cold start problem when applying land-use transitions, simulations were already started in 1700. JSBACH3.2 contains a multilayer hydrology model (Hagemann and Stacke,160 2015) and a representation of the terrestrial nitrogen cycle (Goll et al., 2017). (1) For the evergreen phenology, two subsequent phases are distinguished: a "growth phase" (spring), characterized by the absence of leaf shedding and positive growth, and a "rest phase" (all other seasons), with non-zero leaf shedding and no growth. (2) For the summergreen phenology a third phase is added in between those two phases: a "vegetative phase" (summer) with no growth as in the rest phase but with 170 a small shedding rate. The phases of both, the evergreen and the summergreen phenologies, are determined by temperature thresholds calculated by the alternating model of Murray et al. (1989) from heat sums, chill days and critical soil temperatures.

JSBACH
(3) The raingreen phenology does not have distinct phases, and the growth rate is non-zero whenever the soil moisture exceeds the wilting point and the net primary productivity (NPP) is positive. The shedding rate depends on the relative soil water content. (4) The grass phenology resembles the raingreen phenology but further requires the air temperature and soil moisture 175 to exceed a critical value for a non-zero growth rate. Because grass roots are less deep than tree roots, soil moisture is taken from the upper soil layer in the grass phenology, while for the raingreen phenology the water content of the whole soil column is considered. (5) In contrast to the other phenology types, the growth rate of the crop phenology is modeled as a function of NPP. JSBACH distinguishes tropical and extra-tropical crops in order to reflect different farming practices in dependence of the prevailing climatic conditions: tropical crops have no distinct phase switches, and require sufficient temperature, an 180 upper layer soil moisture above wilting point and a positive NPP. Extra-tropical crops have distinguished phases for summer and winter crops which strongly depend on temperature (heat sum) determining the growth and the shedding rate (emulating harvest events).
The vegetation in the conducted JSBACH3.2 simulations was represented by 12 PFTs, each of which is linked to one of the phenology types: one forest type with evergreen phenology, one forest and one shrub type with summergreen phenology, two 185 forest and one shrub type with raingreen phenology, C3 and C4 grasses as well as C3 and C4 pastures with grass phenology, and C3 and C4 crops with extra-tropical and tropical crop phenology, respectively.

LPJ-GUESS
The to the classes found in LUH2, namely both annual and perennial C3 and C4 crops, and C3 N fixers, and two herbaceous cover crops (C3 and C4) that are grown in-between the main agricultural growing seasons. The interactive carbon-nitrogen biogeochemical cycles in LPJ-GUESS induces nutrient limitation on natural vegetation and crop growth, and decomposition rate of soil organic matter that will influence soil biogeochemistry, CO2 fluxes and N trace gas emissions (Smith et al., 2014;Olin et al., 2015).

200
Plant phenology is described by means of three specific parameterization: (1) evergreen plant phenology; (2) seasonaldeciduous plant phenology; (3) stress-deciduous plant phenology (Smith et al., 2014). An explicit phenological cycle is simulated only for leaves and fine roots in seasonal-deciduous and stress-deciduous PFTs, whereas evergreen PFTs have a prescribed background litterfall for leaves, fine roots and sapwood. Seasonal-deciduous plant phenology is based on a PFT-dependent accumulated GDD sum threshold for leaf onset, with leaf area rising from 0 to the pre-determined annual maximum leaf area 205 linearly with an additional 200 (100 for herbaceous and needleleaved tree PFTs) degree days above a threshold of 5 • C. For seasonal-deciduous PFTs, growing season length is fixed, all leaves being shed after the equivalent of 210 days with full leaf cover. Stress-deciduous plant phenology PFTs shed their leaves whenever the water stress scalar ω drops below a threshold, ω min , signifying the onset of a drought period or dry season. New leaves are produced, after a prescribed minimum dormancy period, when ω again rises above ω min (Smith et al., 2014). Crop PFT sowing and harvest decisions are modelled based on 210 climate variability (Waha et al., 2011;Lindeskog et al., 2013) and climatic thresholds (Bondeau et al., 2007).

ORCHIDEE
The ORCHIDEE model used for the CRESCENDO simulations is the land component of the IPSL (Institut Pierre Simon Laplace) ESM and the version corresponds to the one used for the CMIP6 simulations (Boucher et al., 2020). The model calculates primarily the fluxes (and stocks) of water, energy, and carbon between the different soil and plant reservoirs and the fluxes with the atmosphere. Since its first description in Krinner et al. (2005), the model has substantially evolved; we describe below only the main features relevant for this study.
The surface heterogeneity is described with 15 different PFTs that can be mixed in each grid cell. The annual evolution of the PFT distribution is derived from the LUH2 database as described in more details in Lurton et al. (2019). All PFTs share the same equations but with different parameters, except for the leaf phenology. They are grouped into three soil tiles, 220 with independent hydrological budget, according to their physiological behavior: high vegetation (forests) with eight PFTs, low vegetation (grasses and crops) with 6 PFTs, and bare soil with one PFT. Only one energy budget and one snow budget is calculated for the whole grid cell.
For the carbon cycle, photosynthesis is parameterized based on Farquhar et al. (1980) and Collatz et al. (1992) for C3 and C4 plants, respectively, using the implementation proposed by Yin et al. (2009). Once the carbon is fixed by photosynthesis, 225 growth and maintenance respirations are calculated and the remaining carbon is then allocated into 8 plant compartments (below and above ground sapwood and heartwood; leaves; fruit; roots; reserves), following the so-called resource limitations approach (Friedlingstein et al., 1999). Each compartment has a specific turnover and the dead biomass enters to the soil via four litter pools, which then feeds three soil organic carbon pools, following the CENTURY model (Parton et al., 1987).
A Phenology module describes leaf onset and leaf senescence for deciduous PFTs. In temperate and boreal regions, leaf 230 onset is driven by an accumulation of warm temperature in spring, following the concept of Growing Degree Days (GDD).
In addition a minimum period of cold temperature, based on a Number of Chilling Days (NCD), is used to avoid buds dying with late frosts. Both criteria are combined, with PFT-specific GDD and NCD thresholds to be met, before leaves can start growing. For the dry tropics and semi-arid ecosystems, a moisture availability criteria is used based on water accumulated in soil. Both temperature and moisture criteria are combined for grasses and crops and the different parameters of the leaf 235 onset parameterisation have been calibrated with satellite data (Botta et al., 2000). Leaves are then further separated into four age classes with different photosynthetic efficiency. Leaf fall is controlled by different turnover processes. The first one is a simple aging process and a second senescence process based on climatic conditions (either based on air temperature or on soil moisture conditions) is applied.

240
ISBA-CTRIP is the land surface model of CNRM-ESM2-1 (http://www.umr-cnrm.fr/spip.php?article1092). It is used within the SURFEX version 8 modeling platform representing SURFace EXchanges between ocean, lakes, and land. It solves the energy, carbon and water budgets at the land surface and was recently thoroughly upgraded. soil. It includes a module for wild fires, land cover changes, and carbon leaching through the soil and transport of dissolved organic carbon to the ocean. Leaf photosynthesis is represented by the semi-empirical model proposed by Goudriaan et al. (1985). Canopy level assimilation is calculated using a 10-layer radiative transfer scheme including direct and diffuse radiation. Vegetation in ISBA-CTRIP is represented by 4 carbon pools for grasses and crops (leaves, stem, roots and a non-structural carbohydrate storage pool) with 2 additional pools for trees (aboveground wood and coarse roots). Leaf phenology results 250 directly from the daily carbon balance of the leaves. Leaf turnover time is dependent on potential leaf longevity reduced when 10-day assimilation rates start to decrease. Leaf area index is diagnosed from leaf biomass and specific leaf area index, which varies as a function of leaf nitrogen concentration and plant functional type. To allow for leaf growth after dormancy there is an imposed minimum leaf biomass. The model distinguishes 16 vegetation types (9 tree and 1 shrub types, 3 grass types and 3 crop types) alongside desert, rocks and permanent snow. Crops have the same phenology as grasses. A detailed description of 255 the terrestrial carbon cycle can be found in Delire et al. (2020).

Experimental setup
In this study, the S3 CRESCENDO simulations were used, characterized by transient CO 2 , climate, and land-use forcing.
Each model spin-up is obtained by recycling climate mean and variability from the period 1901-1920, with the pre-industrial (1860) atmospheric CO2 concentration until carbon pools and fluxes reach a steady state. The 1861-1900 period is simulated 260 using the same climate forcing as the spin-up, but with time-varying atmospheric CO 2 and land-use forcing. Finally, the LSMs are forced over the 1901-2014 period with changing CO2, climate, and land-use forcing. All LSMs are commonly driven by the atmospheric forcing reanalysis CRUNCEP version 7 (Viovy, 2018), and the land-use data is taken from the Land Use Harmonization version 2 (Hurtt et al., 2020). Note that the use of LUH2 land cover transitions differs across the models (see model description). CRUNCEPv7 provides for 2m air temperature, precipitation, wind, surface pressure, shortwave radiation, 265 long-wave radiation, and air humidity.
Each LSM is run on different spatial resolutions (Table 1), but the outputs of these simulations are provided on a regular 0.5 • x 0.5 • grid, over which simulations and observations are compared. The LAI monthly mean output from these simulations are used in the present analysis. with summer dormancy (SGS-D); (4) two growing seasons (TGS). The EVG type is identified when relative changes in LAI annual cycle are smaller than 25% of local LAI mean value. Note that GSS and GSE timings are not detected in EVG areas.

Growing season analysis
The other three types are identified based on LAI annual cycle shapes, LAI slopes and transition timings, and critical threshold, as illustrated and summarized in Supplementary Figure 1. When one single growing season is identified, SGS-S and SGS-D are discerned based on the peak-month (i.e. in the Northern Hemisphere (NH) SGS-S is detected when LAI peak occurs between when two growing seasons at least three-month-long are detected. Further details can be found in Peano et al. (2019). Note that in this analysis, the timings of the TGS GSS correspond to the GSS timings of the first growing season cycle, while the GSE are the second GSE timings, as described in Peano et al. (2019). The 4GST method is applied on monthly LAI data in this work, instead of 15-day time-scale used in Peano et al. (2019). Given the monthly time window used in this analysis, difference up 285 to one month can descend from the monthly time-frequency, which limits a more detailed bias assessment between LSMs and observations.

Satellite data comparison
We inspect the main differences between LAI3g, MODIS and SENTINEL by plotting the spatial distribution of the four 290 growing season types, GSS, and GSE ( Figure 1).
The three products show a high consistency in the distribution of growing season types (agreement of about 80%, Table   2), with the main differences occurring in tropical regions, such as in Amazon and Congo basins, and in semi-arid areas, such as central Australia (Figures 1a,d,g). Compared to MODIS, LAI3g differs mainly in EVG regions (Table 2) due to an underestimation of EVG areas in the Tropics (Supplementary Figure 2). These regions are characterized by high canopy 295 density, which saturates to high LAI in the satellite data (e.g. Myneni et al., 2002), resulting in limited seasonal variability.
In addition, the AVHRR sensor used to derive LAI3g is less responsive to changes in vegetation compared to MODIS and SPOT/PROBA data (Piao et al., 2020). Both LAI3g and SENTINEL differ from MODIS in areas featured by the TGS type (Table 2). The Horn of Africa is the only region where all three satellite products place a TGS type (Figure 1).
Larger differences among satellite products are found in the assessment of GSS and GSE (Figure 1), especially in the NH 300 where LAI3g and SENTINEL clearly anticipate GSS (Figures 1e,h) with respect to MODIS. The three satellite products present a consistency similar to the one reached by the growing season type distribution (about 75%) when a one-month tolerance is considered (Table 3), since time-resolution of the products has been homogeneizied to one month (see Section 2.4).
Keeping these differences in mind, the MODIS data are in the following sections as the main observation reference given its ability to capture seasonality of the different biomes (Yan et al., 2016b). Comparisons with the other satellite products are 305 presented in the supplementary material.

Growing season types distribution
The 4GST allows estimating the ability of each LSM in capturing the observed spatial distribution of the four growing season types ( Figure 2).
In general, all the LSMs capture the single growing season that peaks in summer (SGS-S type) reasonably well, especially 310 in the NH mid-and high-latitude regions. The majority of LSMs are also able to correctly represent the two growing seasons (TGS) in the Horn of Africa region (Figure 2). Most LSMs are unable to reproduce the observed growing-season-type distri-bution in the SH, except for the evergreen (EVG) tropical areas. A partial exception is LPJ-GUESS, which shows large SGS-S type areas in South America, Southern Africa and Northern Australia, in agreement with MODIS.
LSMs used in this study are primarily able to capture the observed EVG and SGS-S regions with agreement between 36.0% 315 and 95.4%, and between 44.3% and 79.5%, respectively (Table 2). In contrast, the TGS regions are seldom reproduced by LSMs, and the agreement rate with MODIS ranges between 0.4% and 19.1%, (Table 2). Overall, the CRESCENDO Multi-Model Ensemble (MME) mean reproduce the same MODIS growing season type distribution over about 69.5% of global land-surface area, with a 45.4% to 74.0% range among models (Figure 2 and Table 2). It is noteworthy that the evergreen type is correctly detected in the broad-leaf evergreen tropical areas in both satellite observations and LSMs (Figure 2). On the 320 contrary, the high-latitude needle-leaf evergreen regions are partially represented in LSMs, while satellite data do not catch these areas due to satellite limitations resulting from the impact of cloud and snow cover on light availability during the winter season (see Section 4.2).
This initial evaluation highlights that LSMs have difficulties in accurately representing SH phenology. The correct location of the less common types, i.e. single growing season with summer dormancy (SGS-D) and TGS, is as well hardly captured by the

Variability of growing season start and end
4GST is then applied to evaluate the ability of LSMs to represent the GSS and GSE timing in vegetated areas not classified as EVG-type (white regions in Figures 3 and 4 correspond to not-vegetated and EVG-type domains).

330
On average at the global scale, LSMs approximately exhibit a disagreement of 0.6 months and 0.5 months in GSS and GSE, respectively, with LSMs simulating a later GSS and an earlier GSE, practically shortening the growing season by one month (Table 4). This bias is not evenly distributed around the globe. LSMs reproduce the correct growing season length in about 17% of the global land grid-cell, but sometimes growing season is affected by a shift in seasonality, as in the case of JULES-ES (Table 3). 335 Generally, the GSE simulated by the LSMs show a better agreement with MODIS (about 25% agreement in vegetated grid-cell, ranging from 4.9% to 26.4%, Table 3) compared to GSS timings (15.8% agreement in vegetated land grid-cell, ranging from 2.7% to 19.1%, Table 3). Considering a one-month-tolerance to account for the downgraded time-resolution, the agreement between LSMs and MODIS increases to ∼45% and ∼31%, respectively (Table 3).
LSMs exhibit larger agreement with MODIS GSS and GSE timings in the NH compare to the SH (Figures 3, 4 and Tables 3,  (Table 3). In particular, LPJ-GUESS shows good skill (agreement with observation larger than 15%) in capturing both GSS and GSE timings in both hemispheres (Figures 2f,   3f, and 4f).
LPJ-GUESS is the model that shows the highest agreement with MODIS (Table 3) and the lowest bias in average GSS and GSE timings (0.4 and 0.1 months, respectively, Table 4). JULES-ES shows the lowest agreement with MODIS (Table 3) and 345 the highest bias in the average GSS and GSE timings (1.2 and -2.3, respectively, Table 4). This result may be associated with the representation of PFTs in the two models used to describe global vegetation. LPJ-GUESS, indeed, is the model featuring the largest number of PFTs, while JULES-ES uses the least (Table 1). Moreover, JULES-ES and LPJ-GUESS differ also on the details of the phenology parameterization. LPJ-GUESS features a more elaborate phenology description compared to JULES (Sections 2.2.3, 2.2.5 and Table 1).

Latitudinal variability
The MME zonal average shown in Figure 5 highlights the LSMs' abilities and limitations in simulating the observed GSS and GSE timings at different latitudes.

365
The GSS bias ranges between -1.8 months (earlier GSS) just south of the Equator and +2.0 months (delayed GSS) south of 50 • S (Figure 5a). The GSE bias ranges between -3.0 months in the 0-10 • N latitudinal band and +1.3 months in the southern sub-tropics. The CRESCENDO LSMs correctly simulate the GSE timings north of 60 • N. The Spearman correlation of the GSS and GSE latitudinal distributions is 0.67±0.07 and 0.51±0.11, respectively. These values are significant at the 95% level based on a Monte Carlo approach.

370
In the NH mid-and high-latitude, the LSMs' GSS exhibit an average delay of up to 1.6 months, especially in North America (Supplementary Figure 10a). Note that differences among satellite data occur in the NH mid-and high-latitude, highlighting potential differences among these three products (see Section 4.2). Large LSM biases in NH tropical region GSE timings and southern sub-tropical GSS timings are driven by premature values in Africa (Supplementary Figures 10c,d). These discrepancies may derive from difficulties in the LSM's ability to simulate the observed phenology type in Africa (Figure 2 Figure 5a). Similar trends are reproduced by the LSMs, but with a higher magnitude ( Figure 5). In the NH, the difference between simulated and observed trends may be driven by an overestimated influence of radiation and temperature on leaf senescence in LSMs. In the SH, the discrepancies between observed and modeled trends may be related to relatively large phenology variability in the SH associated with the small vegetated land area in this hemisphere.

Regional variability
To assess sources of biases in the LSMs, different biomes derived from the ESA CCI land cover map ( In general, LSMs show a higher agreement in representing GSS timings compared to GSE timings. Consequently, the 400 different approaches used to describe the start of the growing season are relatively consistent among LSMs. In comparison, the representation of the end of vegetative season requires further investigation and development.

Land surface models
The plant phenology growing season start and end are mainly triggered by changes in solar radiation, temperature, and soil 405 moisture conditions (e.g. Caldararu et al., 2012;Zeng et al., 2013;Tang and Dubayah, 2017). State-of-the-art LSMs represent the phenological transitions using different parameterizations based on the climate conditions (Section 2.2). Many of these parameterizations (see Section 2.2) are based on values derived from localized observations (e.g. White et al., 1997;Thornton et al., 2002;Savoy and Mackay, 2015). Consequently, the phenology parameters are calibrated on specific regions of the globe, which may be one reason for the large spread of values seen in the present analysis. Generally, phenology calibration areas are located in the NH, where LSMs exhibit better results and larger coherence compared to the SH. Among the LSMs evaluated here, LPJ-GUESS, CLM4.5, and ORCHIDEE show good skill (agreement with observation larger than 15%) in the SH (Table 3). On the other hand, CLM5.0 and JULES-ES do not reach such agreement in the NH (Table 3). High skill (agreement with observation larger than 20% for at least one timing) in the NH are obtained by CLM4.5, ORCHIDEE, and ISBA-CTRIP (Table 3). The different performance between models can occur from differences in 415 phenology parameterization as well as different vegetation cover types (Plant and Crop Functional types), soil characterization, and initial spatial resolution (Table 1).
Among the LSMs evaluated here, JULES-ES shows relatively lower skill in simulating GSS and GSE timings compared to the other LSMs (Table 3). This result may be ascribed to the smaller number of PFTs (see Table 1) and details of the phenology parameterization that characterize this LSM (Section 2.2.3 and Table 1). JSBACH accounts for a similar number of PFTs 420 (Table 1), but features a more complex phenology scheme (Section 2.2.4 and Table 1), and its skill in reproducing GSS and GSE timings is higher than that of JULES-ES (Table 3). Similar to JSBACH, ORCHIDEE feature a PFT-oriented phenology scheme (Section 2.2.6 and Table 1), which contributes to the high skill noted for ORCHIDEE. In general, this comparison highlights the complexity of vegetation phenology modelling and the strong inter-linkages be-440 tween climate, hydrology, soil, and plants.

Satellite data
Satellite-based LAI datasets have been used in this work as a benchmark for the evaluation of the LSMs' phenology performance globally. However, satellite observations present some caveats and uncertainties (e.g. Myneni et al., 2002;Fang et al., (namely AVHRR for LAI3g, MODIS for MODIS LAI, and SPOT/PROBA VEGETATION for SENTINEL) have been used in this study. The comparison between these datasets shows major issues associated with LAI products derived from satellite reflectance observations. For example, large differences between LAI3g, MODIS, and SENTINEL occur at high latitudes and in tropical regions (Figure 5), where thick clouds and snow cover can affect the data reconstruction (e.g. Delbart et al., 2006;Kandasamy et al., 2013;Yan et al., 2016b). LAI satellite data are also affected by the applied gap-filling algorithms, 450 which could create spurious seasonal cycles as well as smooth the observed phenology season (e.g. Kandasamy et al., 2013;Chen et al., 2017). In addition, the observed reflectance saturates in regions characterized by dense canopies, reaching the LAI upper limits (e.g. Myneni et al., 2002;Maignan et al., 2011) of approximately 7.0 m 2 /m 2 in MODIS and LAI3g (Myneni et al., 2002). This issue can affect the identification of growing season cycles in thickly forested areas leading to possible overestimation of evergreen type detection. on the global scale. In particular, it takes into account SGS-D and TGS types that were neglected in previous analyses (Murray-Tortarolo et al., 2013) due to their reduced coverage (Table 2).

460
Regions with multiple growing seasons per year (TGS) are difficult to capture on a global scale, despite their important influence on climate (e.g. Zhang et al., 2003;Dalmonech and Zaehle, 2013;Peano et al., 2019). The state-of-the-art LSMs, indeed, exhibit a low skill in reproducing this specific growing season type (Figure 2 and Table 2). Two growing seasons usually occur in regions characterized by two separate rain seasons, in semi-arid areas or in cropland regions (e.g. Zhang et al., 2003Zhang et al., , 2005Martiny et al., 2006). In this analysis, observations and LSMs, except for JULES-ES and ORCHIDEE, agree on a 465 TGS type only in the Horn of Africa. This region features two distinct precipitation seasons (e.g. Liebmann et al., 2012;Peano et al., 2019), which trigger the TGS phenology type.
Cropland areas can present multi growing season behavior because of irrigation and crop rotation, such as in South Asia, and China (e.g. Wu et al., 2010;Gumma et al., 2016). Unlike the Horn of Africa, these regions are not captured as TGS in the present analysis due to assumptions and limitations within LSMs, for example CLM4.5 represents all annual crops by a 470 generic C3 PFT (Section 2.2.1, and Table 1), and 4GST assumptions. 4GST TGS type detection adopts a minimum length of three months to detect a growing season. This assumption derives from the need to avoid the detection of small oscillations within the same growing seasons . Consequently, this assumption affects the model recognition of multiple growing seasons, especially in cropland areas. South Asia, for instance, is characterized by different timing and phenology intensity for each crop growing season (Gumma et al., 2016). Therefore, only some specific crops can be detected based on 475 the 4GST assumptions and growing season signature (Gumma et al., 2016). Wu et al. (2010) distinguish multiple growing seasons in China using local maximum and detection threshold. This method improves the multi growing season identification, especially for the crops identified by a strong phenology cycle but may exclude crop characterized by a weak phenological cycle. For these reasons, more specific analyses of semi-arid and crop regions based on higher spatial and temporal data are needed and will be the focus of a future study.

480
Besides, LSMs' crop phenology parameterizations require further development to improve the description of each specific crop.
LSMs evaluate phenology at PFT level, but the final LAI values are returned at the grid-cell level. For this reason, a more detailed evaluation of the parameterization would require using PFT-level values. However, global coverage of observed PFT level phenology values are missing, making this analysis limited to specific biomes, such as through PhenoCAM data (Richardson 485 et al., 2018). This analysis will be the focus of a future work.

Conclusions
This study evaluates the ability of seven European Land Surface Models (LSMs) to reproduce the timings of start and end of the plant growing season at the global level. The assessment is performed based on the novel four growing season types methodology, and uses a set of three satellite observation products as a benchmark to account for some of the uncertainty in Despite a lower ability of LSMs to represent SH phenology, LPJ-GUESS, CLM4.5, and ORCHIDEE show reasonably good outcomes in these regions. In the NH, high skill is achieved by CLM4.5, ORCHIDEE, and ISBA-CTRIP. Uncertainties and spread among LSMs remain, which might affect our understanding of present-day and future impact of land and vegetation 505 interactions with the climate and carbon cycle. Therefore, further improvements in LSMs will be necessary. Improvements in the phenology parameterization can lead to better representation of vegetation in the LSMs. However, phenology in LSMs is influenced by vegetation and hydrological parameterizations and land surface boundary conditions (e.g. PFT distribution), as shown by the CLM4.5 and CLM5.0 phenology differences.
This study highlights the complexity of vegetation phenology modeling and the strong inter-linkages between climate, hydrology, soil, and plants, which need further details and generalization inside the LSMs code.   27 https://doi.org/10.5194/bg-2020-319 Preprint. Discussion started: 4 September 2020 c Author(s) 2020. CC BY 4.0 License.