Multi-scale assessment of a grassland productivity model

. Grasslands provide many important ecosystem services globally, and projecting grassland productivity in the coming decades will provide valuable information to land managers. Productivity models can be well calibrated at local scales but generally have some maximum spatial scale in which they perform well. Here we evaluate a grassland productivity model to ﬁnd the optimal spatial scale for parameterization and thus for subsequently applying it in future productivity projections for North America. We also evaluated the model on new vegetation types to ascertain its potential generality. We ﬁnd the model most suitable when incorporating only grasslands, as opposed to also including agriculture and shrublands, and only in the Great Plains and eastern temperate forest ecoregions of North America. The model was not well suited to grasslands in North American deserts or northwest forest ecoregions. It also performed poorly in agriculture vegetation, likely due to management activities, and shrubland vegetation, likely because the model lacks representation of deep water pools. This work allows us to perform long-term projections in areas where model performance has been veriﬁed, with gaps ﬁlled in by future modeling efforts.


Introduction
Grassland systems span nearly 30 % of the global land surface (Adams et al., 1990) and play a prominent role in terrestrial carbon cycles (Parton et al., 2012). Grasslands in North America provide a large proportion of food and fiber agricultural products for the region. Annual productivity of grasslands in central and western North America is driven in large part by precipitation (Sala et al., 2012). Future changes in the amount, intensity, and timing of precipitation will be heterogeneous across North America (Easterling et al., 2017), resulting in heterogeneous changes to grassland productivity. For example, even with consistent shifts in climate, different locations can experience different changes in productivity due to local-scale responses (Zhang et al., 2011;Sala et al., 2012;Knapp et al., 2017). This highlights the need for models which can be resolved at small spatial and temporal scales, thus making long-term grassland productivity projections as informative as possible.
A promising method for this is low-dimensional models, which are process-based models with some simplified components (Choler et al., 2010(Choler et al., , 2011. For example, a low-dimensional model might approximate transpiration as a function of potential evapotranspiration, soil available water, and live vegetation cover along with a single parameter. As opposed to a high-dimensional model with multiple functions accounting for leaf area index, stomatal conductance, rooting depth and surface area, etc. (Caylor et al., 2009;Asbjornsen et al., 2011), the low-dimensional model is advantageous since it can generalize across broad regions with relatively few inputs. Yet they are still susceptible to over-fitting to local conditions since parameters or model structure can be tied to specific locations or plant functional groups (Fisher and Koven, 2020). Thus parameterizing lowdimensional models must be done with care such that they are applicable to a broad area while maintaining an acceptable level of accuracy.
Here we evaluate a low-dimensional model with the intention of it driving productivity projections. The PhenoGrass model developed by Hufkens et al. (2016) is a pulse-response productivity model with temperature and precipitation as the primary drivers. The model is parameterized using observations from the PhenoCam network which have a small spatial resolution (footprints of < 1 ha), sub-daily sampling, and sites across all major biomes. These attributes make the PhenoGrass model potentially widely applicable. We expand on the evaluation of the original study by using 84 Pheno-Cam sites, totalling 89 distinct time series with 463 site years of data. Using the original methods and input datasets we test the model's performance with this expanded dataset. We also evaluate the model across different combinations of North American ecoregions and vegetation types to find an optimal spatial scale in which to parameterize and apply the model. Finally, we address where the model performs poorly and how productivity projections for these areas could be implemented or improved.

PhenoGrass model
The PhenoGrass model is an ecohydrology model which has interacting state variables for soil water, plant available water, and plant fractional cover (Hufkens et al., 2016). Model inputs are daily precipitation, temperature, potential evapotranspiration (derived from the Hargreaves equation; Hargreaves and Samani, 1985), and solar radiation. The primary output is fractional vegetation cover (fCover). The original model form, derived in Choler et al. (2010Choler et al. ( , 2011, used only temperature and potential evapotranspiration and was parameterized using satellite-derived normalized difference vegetation index (NDVI) data. Hufkens et al. (2016) expanded on the original Choler model by incorporating growth and senescence restraints from temperature and solar radiation, and also included a scaling factor to convert PhenoCam G cc (green chromatic coordinate) data to a fractional cover estimate (see equations in the Supplement). Hufkens et al. (2016) evaluated the PhenoGrass model using 14 grassland PhenoCam sites across western North America with a total of 34 site years. They found the modeled fractional cover correlated well with annual productivity at both daily and annual timescales. Despite its name the PhenoGrass model can theoretically apply to any vegetation type with a distinct growth signal in response to precipitation, as hypothesized in the original threshold-delay model (Ogle and Reynolds, 2004). Here we evaluate two additional vegetation types, shrubs and agricultural plots, to test how applicable it is beyond grasslands.

PhenoCam data
The PhenoCam network is a global network of fixed, nearsurface cameras capturing true-color images of vegetation throughout the day (Richardson et al., 2018a). Using a ratio of the three RGB bands a greenness metric (green chromatic coordinate, G cc ) is calculated from each image, resulting in a daily-scale time series of canopy greenness. G cc is a unitless metric which is highly correlated with satellite-derived NDVI (Richardson et al., 2018b) and flux-tower-derived primary productivity Toomey et al., 2015).
Each PhenoCam image is subset to one of several different plant vegetation types based on the field of view. These regions of interest (ROIs) serve as the basis for the G cc calculation and subsequent post-processing (Seyednasrollah et al., 2019).
We downloaded all PhenoCam data with ROIs of the grassland (GR), shrubland (SH), and agricultural (AG) vegetation types for the years 2012 to 2018, totalling 89 distinct time series and 463 site years ( Fig. 1; Table S2 in the Supplement). As input to the PhenoGrass model, we used the 3 d smoothed G cc scaled, for each ROI, from 0-1. In the model parameterization each ROI time series is further transformed to fractional vegetation cover (fCover; see Supplement equations) using the local mean annual precipitation (MAP) combined with a scaling factor (Hufkens et al., 2016;Donohue et al., 2013).

Environmental data
For historic precipitation and temperature, we used the 1 km resolution Daymet dataset (Thornton et al., 2018), extracting daily time series for the pixel at each PhenoCam tower location. Daily mean temperature was calculated as the average between the Daymet daily minimum and maximum temperatures and smoothed with a 15 d moving average. Potential evapotranspiration was calculated using the Hargreaves equation (Hargreaves and Samani, 1985). Soil wilting point and field capacity were extracted at each PhenoCam location from a global dataset (Global Soil Data Task Group, 2000). Despite other options being available we chose to use the same climate and soil datasets as those used in Hufkens et al. (2016) so that any differences in results can be attributed to the expanded PhenoCam dataset used here.

Model evaluation
To find the most appropriate spatial scale, we evaluated the model using three different combinations of ecoregion and vegetation type with 11 total model parameterizations ( Fig. 2). Here we use the term "spatial scale" to refer to the combination of ecoregion(s) and vegetation type(s) used within each model. This includes using all sites of one vegetation type within an ecoregion or all sites of a specific vegetation type from several ecoregions (Fig. 2). The largest spatial scale used all PhenoCam locations described above (89 sites). Next were all sites, respectively, within the three vegetation types indicated by the ROI (grasslands, shrublands, and agricultural). Finally, we parameterized models for each vegetation type within each level 1 North American ecoregion (e.g., all grassland sites within the Great Plains ecoregion). All sets of parameterized models were limited to have at least five sites.
We evaluated each of the 11 models using the Nash-Sutcliffe coefficient of efficiency (NSE; Eq. 2), as well as the mean coefficient of variation of the mean absolute er-  ror (F , Eq. 3) of the daily fractional cover estimates. NSE was calculated for each site and then averaged across all sites within the respective spatial scale (NSE; Eq. 1). There was no cross-validation using out-of-sample data in the initial fitting as it would have been computationally expensive. Rather, error metrics from these in-sample tests were treated as a best case scenario of what each model parameterization can achieve. From these results we used a threshold to select which models to evaluate further using cross-validation. The threshold value was an NSE of 0.65, which is viewed as "acceptable" for time-series models (Ritter and Muñoz-Carpena, 2013). Parameterization was done using differential evolution, a global optimization algorithm, to minimize F .
N is the number of sites within the spatial scale evaluated, n is the number of daily values at each site, fCover i,obs and fCover i,pred are observed and predicted values, respectively, and fCover obs is the average observed fCover at each site. Models which exceeded the threshold were subject to further evaluation. For each model we performed a leave-one-out cross-validation, in which the model was re-fit with one PhenoCam site not included in the training data and then evaluated against this left-out site. In this step a scaling coefficient to link mean annual precipitation with PhenoCam G cc was held constant at the value obtained in the first fitting. The resulting NSE and F are from all modeled sites using their respective out-of-sample test.

Results
At the largest spatial scale in which the PhenoGrass model was parameterized using all 89 sites, the model performed poorly with an NSE value of 0.31 (Table 1; Fig. 3). Models built using all sites of a respective vegetation type performed poorly as well, though they were slightly better than the all site model (Fig. 3). The best model performance was achieved when models were built using a single vegetation type subset for a single ecoregion. Grasslands within the Great Plains and eastern temperate forests ecoregions were the only instances where NSE exceeded the 0.65 threshold, though grasslands within northwestern forests came close (NSE = 0.64).
In all 11 models the PhenoGrass model tended to underestimate the highest fCover values and to a lesser degree overpredict the lowest values (Figs. 3 and 4). The best performing models (grasslands in the Great Plains and eastern temperate forests) minimized this effect (Fig. 4). The worst performing models, grasslands in North American deserts, had little variation in predicted fCover values, resulting in the lowest NSE overall.
The grassland vegetation type, subset to specific ecoregions, predominantly outperformed models built with other spatial scales (Table 1). Models built using grasslands within the eastern temperate forest and Great Plains ecoregions had the highest NSE values of 0.82 and 0.69, respectively. Using leave-one-out cross-validation on these two grassland models resulted in similar errors of 0.79 and 0.67 for the eastern temperate forest and Great Plains, respectively. Though northwestern forest grasslands had an in-sample NSE just be- Figure 3. Observed and predicted daily fCover values of the all site model and the three vegetation type models, each using all available sites and years with the respective spatial scale. NSE is the average Nash-Sutcliffe coefficient of efficiency among sites, and F is the mean coefficient of variation of the mean absolute error. The correlation line (blue) represents the overall trend in predicted versus observed values, while the 1 : 1 line (red) represents a perfect fit. low the 0.65 threshold, the cross-validation was well below it (0.52). Grasslands in the North American deserts were not modeled well at any spatial scale and had the lowest NSE values in the entire analysis. The observed greenness patterns of these desert grasslands had extremely high variability in their magnitude and timing with short distinct peaks in greenness and numerous off-peak fluctuations. The fitted model, which minimized F among the five sites, was not able to reproduce this high variability and instead produced fCover values that were severely constrained to a narrow range (Fig. S1 in the Supplement).
Agriculture and shrubland sites were poorly modeled at all spatial scales. Performance of agriculture within the eastern temperate forest ecoregion (NSE = 0.33) improved over the all agriculture model (NSE = 0.18) but decreased in the Great Plains (NSE from 0.24 to 0.18). There was only a single ecoregion with a minimum of five shrubland sites, North American deserts, and it performed only slightly better than the all shrubland model. Shrublands in North American deserts did not have the high variability seen in desert grasslands.

Discussion
We performed an extensive evaluation of the PhenoGrass model across ecoregions and vegetation types to determine the best spatial scale at which to parameterize and apply the model. We found the model most suitable to grassland vegetation when constrained to the ecoregion level, though it did not perform well in grasslands in the North American desert ecoregion. Shrublands and agriculture were not well represented by the model regardless of the spatial scale. Results from this study will facilitate long-term projections of grassland productivity constrained to an appropriate vegetation type and scale. The PhenoGrass model performed best in grassland sites embedded within ecoregions. Studies using earlier forms of the model applied it exclusively to grasslands (Choler et al., 2010(Choler et al., , 2011Hufkens et al., 2016), and results here confirm that it performs well in grassland vegetation with two exceptions. The model did not work in the desert grasslands, nor did it generalize well when built using all North American grasslands simultaneously. Grasslands in the North American desert biome coexist with shrubs, resulting in complex water use dynamics described in more detail below. The pulse-response design of PhenoGrass, which makes it well suited in areas with a high cover of perennial grass, is likely not applicable when grasses are interspersed with woody plants.
Shrublands were not well modeled at any spatial scale. Dryland shrubs, representing 8 of the 15 shrubland Pheno-Cams analyzed here, coexist with grasses by accessing different pools of soil water (Weltzin and McPherson, 2000;Muldavin et al., 2008) and thus have different responses to precipitation and resulting greenness patterns (Browning et al., 2017;Yan et al., 2019). A prior form of the PhenoGrass model was designed to work with dryland shrubs by using two soil water pools (Ogle and Reynolds, 2004), yet here PhenoGrass, with a single soil water pool, was less effective for shrubland vegetation. The single pool of the PhenoGrass model is coupled with fluxes from precipitation and evapotranspiration and thus is not well suited for representing the deeper water pools that shrubs can routinely access (Schenk and Jackson, 2002;Ward et al., 2013). Potential improvements would likely need to incorporate a deep soil water pool, in addition to the shallow pool, which are each utilized by the respective plant functional groups. This has already been implemented in highly parameterized ecohydrology models (Scanlon et al., 2005;Lauenroth et al., 2014) and could potentially be used here to make a more generalized PhenoShrub model to apply across large scales. This approach could also help in modeling North American desert grasslands which coexist among shrubs.
Agriculture areas performed poorly with the PhenoGrass model. Management practices of crops artificially increase productivity beyond what would naturally occur, and planting and harvest result in abrupt changes in greenness metrics (Bégué et al., 2018). While the results were not necessarily surprising, to our knowledge this is the first attempt to use near-surface images to drive a productivity model for . Observed and predicted daily fCover values for models from seven spatial scales, for which only specific vegetation types within a single ecoregion were used in model fitting. Each uses all available sites and years with the respective spatial scale. NSE is the average Nash-Sutcliffe coefficient of efficiency among sites, and F is the mean coefficient of variation of the mean absolute error. The correlation line (blue) represents the overall trend in predicted versus observed values, while the 1 : 1 line (red) represents a perfect fit. agricultural vegetation. We have shown that the PhenoGrass model, designed for natural systems, does not generalize to actively managed agricultural systems. Future work in using PhenoCam data to model agricultural productivity would likely need to incorporate crop-specific parameters and management activity which other cropland modeling systems use (Fritz et al., 2019). The integration of the PhenoCam network within the Long-Term Agroecosystem Research (LTAR) network will likely be beneficial for this as the timing and intensity of management activities or experimental treatments can be incorporated into modeling efforts. Hufkens et al. (2016) originally evaluated the PhenoGrass model using 14 grassland sites distributed among seven North American ecoregions. In their evaluation they had an average NSE of 0.71, while here the model performed poorly when using more than one ecoregion. It is likely that the original 14 grassland sites were ideal locations for the PhenoGrass model since on average they have a single greenup season every year in the spring or summer (Fig. 5a). The additional 24 grassland sites used in the current study have high seasonal variability and elongated growing seasons (Figs. 5b, S1) and were thus more difficult to represent in a single continental-scale grassland model. This highlights the need for longer time series in evaluating low-dimensional models as it may take many years for a single location to experience the full range of variability. As the PhenoCam data archive grows, temporally out-of-sample validation can be done to better evaluate performance in novel conditions. Further work could use the expanded PhenoCam time series to compare with flux-tower-derived productivity, which Hufkens et al. (2016) found was well correlated at the five original locations with available flux data.

Conclusions
Replication is an important step in the scientific process, especially given newly available data. Here we have validated prior modeling work and highlighted its limitations. Newer vegetation models can be validated in the same framework and applied to areas where PhenoGrass performs poorly. This can result in a spatial ensemble where the output for any one location and vegetation type is represented by the most appropriate model. Our current work will allow for long-term projections of grassland productivity for a large fraction of North America.
Author contributions. SDT designed the study and executed the modeling and analysis. SDT wrote the manuscript with input and suggestions from DMB throughout the writing process.
Competing interests. The authors declare that they have no conflict of interest.