Estimating dry biomass and plant nitrogen concentration in pre-Alpine grasslands with low-cost UAS-borne multispectral data – a comparison of sensors, algorithms, and predictor sets

. Grasslands are an important part of pre-Alpine and Alpine landscapes. Despite the economic value and the signiﬁcant role of grasslands in carbon and nitrogen (N) cycling, spatially explicit information on grassland biomass and quality is rarely available. Remotely sensed data from unmanned aircraft systems (UASs) and satellites might be an option to overcome this gap. Our study aims to investigate the potential of low-cost UAS-based multispectral sensors for estimating above-ground biomass (dry matter, DM) and plant N concentration. In our analysis, we compared two different sensors

Abstract. Grasslands are an important part of pre-Alpine and Alpine landscapes. Despite the economic value and the significant role of grasslands in carbon and nitrogen (N) cycling, spatially explicit information on grassland biomass and quality is rarely available. Remotely sensed data from unmanned aircraft systems (UASs) and satellites might be an option to overcome this gap. Our study aims to investigate the potential of low-cost UAS-based multispectral sensors for estimating above-ground biomass (dry matter, DM) and plant N concentration. In our analysis, we compared two different sensors (Parrot Sequoia, SEQ; MicaSense RedEdge-M, REM), three statistical models (linear model; random forests, RFs; gradient-boosting machines, GBMs), and six predictor sets (i.e. different combinations of raw reflectance, vegetation indices, and canopy height). Canopy height information can be derived from UAS sensors but was not available in our study. Therefore, we tested the added value of this structural information with in situ measured bulk canopy height data. A combined field sampling and flight campaign was conducted in April 2018 at different grassland sites in southern Germany to obtain in situ and the corresponding spectral data. The hyper-parameters of the two machine learning (ML) approaches (RF, GBM) were optimized, and all model setups were run with a 6-fold cross-validation. Linear models were characterized by very low statistical performance measures, thus were not suitable to estimate DM and plant N concentration using UAS data. The non-linear ML algorithms showed an acceptable regression performance for all sensor-predictor set combinations with average (avg; crossvalidated, cv) R 2 cv of 0.48, RMSE cv,avg of 53.0 g m 2 , and rRMSE cv,avg (relative) of 15.9 % for DM and with R 2 cv,avg of 0.40, RMSE cv,avg of 0.48 wt %, and rRMSE cv, avg of 15.2 % for plant N concentration estimation. The optimal combination of sensors, ML algorithms, and predictor sets notably improved the model performance. The best model performance for the estimation of DM (R 2 cv = 0.67, RMSE cv = 41.9 g m 2 , rRMSE cv = 12.6 %) was achieved with an RF model that utilizes all possible predictors and REM sensor data. The best model for plant N concentration was a combination of an RF model with all predictors and SEQ sensor data (R 2 cv = 0.47, RMSE cv = 0.45 wt %, rRMSE cv = 14.2 %). DM models with the spectral input of REM performed significantly better than those with SEQ data, while for N concentration models, it was the other way round. The choice of predictors was most influential on model performance, while the effect of the chosen ML algorithm was generally lower. The addition of canopy height to the spectral data in the predictor set significantly improved the DM models. In our study, calibrating the ML algorithm improved the model performance substantially, which shows the importance of this step.

Introduction
Grasslands are import ecosystems covering about 40 % of the global land area (excluding Antarctica and Greenland) (White et al., 2000). In pre-Alpine (i.e. the hilly Alpine foreland) and Alpine landscapes (i.e. the core Alps), grasslands are a dominant element. (Pre-)Alpine grassland ecosystems provide a variety of goods and services (Egarter Vigl et al., 2016) such as food and forage for livestock production, leading to a high economic value (Egarter Vigl et al., 2018;Gibson, 2009;White et al., 2000). At the same time, grassland plants and soils play a significant role in carbon (C) and nitrogen (N) cycling (Gibson, 2009;Wiesmeier et al., 2013) and are improving water purification and soil stability (Lamarque et al., 2011). Furthermore, mountain grasslands are among the most species-rich ecosystems in Europe and high in endemism (Ewald et al., 2018;Väre et al., 2003;Veen et al., 2009;White et al., 2000). With the agricultural intensification in the lowlands, mountain grasslands act increasingly as a sanctuary for species that were common throughout Europe (European Environmental Agency, 2010). Therefore, grasslands in mountain areas have important environmental and biological as well as aesthetic functions (Fontana et al., 2014).
Besides changing climatic conditions, human intervention proved to be an equally important driver of changing ecosystem functioning in managed (pre-)Alpine grasslands (Rossi et al., 2020;Schirpke et al., 2017;Spiegelberger et al., 2006;Walter et al., 2012). The knowledge about grassland yields (biomass) and fodder quality is critical for the management of grasslands and livestock, e.g. with regard to harvest time and frequency, stocking rates, or timing and amount of fertilizer application (Capolupo et al., 2015;Primi et al., 2016). Grassland quality with respect to the nutritive value of forage is assessed by key chemical parameters including crude protein or N, fibre, organic matter digestibility (OMD), and metabolizable energy (ME) (Pullanagari et al., 2016(Pullanagari et al., , 2013. On the field scale, information needs of farmers are closely related to different national implementations of the European Nitrates Directive (Council Directive 91/676/EEC of 12 December 1991), influencing management practices and economic revenues. On the regional scale, ecosystem characteristics such as the N balance and associated losses of greenhouse gases and N leaching needs to be assessed by authorities.
N uptake by plants is the highest N flux in pre-Alpine grasslands (Schlingmann et al., 2020;Zistl-Schlingmann et al., 2020). Thus, N uptake in relation to fertilization rates represents an important measure for optimizing grassland management on a farm and regional scale, as decision-making is getting more and more complex due to legislation and climate change (e.g. drought effects). Hence, a thorough mapping, monitoring, and assessment of grassland traits such as above-ground biomass (dry matter, DM) and chemical composition parameters (e.g. plant N concentration) are required to ensure the preservation of grassland ecosystems and their sustainable use. However, spatially explicit and accurate information on grassland biomass and quality on a field and regional scale is lacking. Robust and reliable methods and applications for grassland monitoring are needed, which ideally scale well and are cost-effective.
Considering the diversity and the large area covered by grasslands, traditional techniques based on field sampling or proximal sensing (e.g. field spectrometers) reach their limits when aiming for a regional assessment of grassland traits (Wachendorf et al., 2017). Here, remotely sensed data from satellites are increasingly established as promising data sources for a continuous and comprehensive mapping of vegetation parameters. Green vegetation can be monitored continuously using its spectral-reflectance properties acquired by optical sensors (Atzberger, 2013;Baret and Buis, 2008). The utilization of satellite information is of high value in particular when large and/or remote areas need to be studied. Also the fast data collection and processing, the relatively low costs of many remote sensing data products (Wachendorf et al., 2017), and time series of well-calibrated satellite sensors are advantageous.
However, while emerging services such as the Copernicus Land Monitoring Service provide land cover information at an unprecedented spatial and temporal resolution, these products still do not provide the necessary spatially detailed information in specific areas such as mountain regions. Mountains are often characterized by small and heterogeneous grassland patches, a high overall cloud occurrence, and frequent cloud formation at specific locations. Furthermore, steep terrain leads to shadows, often permanently affecting the same areas given the constant acquisition time of most satellites. Even outside permanently shadowed areas, bidirectional reflectance distribution function (BRDF) effects result from the highly variable sun-sensor-terrain geometries (Richter, 1998). Together, these factors limit the reliability of spaceborne observations in mountainous areas. Airborne remote sensing data have occasionally been used in the past to match the required spatial scale and to explore the increased radiometric resolution of hyperspectral sensors (Atzberger et al., 2015;Burai et al., 2015;. But airborne data are still affected by the abovementioned weather-and topography-related challenges. Furthermore, they are associated with higher costs for the users if there are no data available for the study region from other flight campaigns. Remotely sensed data from unmanned aircraft systems (UASs) are a promising possibility to overcoming satelliteand airborne-specific issues due to their high flexibility in flight planning, the very high spatial resolution (lower centimetre range, depending on flight height), and the availability of some low-cost multispectral systems. Vegetation traits can be mapped under challenging conditions on the field scale applying UASs (Maes and Steppe, 2019). BRDF information can be derived from UAS sensors -similar to tra-ditional airborne campaigns -as data are usually flown with high overlap, providing additional information (Koukal and Atzberger, 2012). However, besides their advantages, data acquisition with UASs also has some limitations. Most UASs cannot be operated under moist and windy conditions, and legal restrictions of the country and study regions need to be considered. Changing illumination (e.g. through clouds and variations in solar angle) affects the quality of imagery, making a sound radiometric calibration an essential processing step. Accordingly, the standardization and comparability of sensors and workflows are an issue, especially when accounting for the quality of low-cost sensors (Aasen et al., 2018;Assmann et al., 2018;Olsson et al., 2021;Poncet et al., 2019;Salamí et al., 2014).
Previous studies using UAS data have looked into the mapping of biophysical parameters such as leaf area index (LAI) (Verger et al., 2014;Yao et al., 2017), chlorophyll (Jay et al., 2017), biomass Viljanen et al., 2018), plant density (Jin et al., 2017), and canopy height (Song and Wang, 2019;Ziliani et al., 2018) as well as combinations of these parameters (Jay et al., 2019). However, most UAS studies investigate the mapping of plant traits in monocultural crop stands, while multispecies systems such as natural or cultivated permanent grassland ecosystems like in pre-Alpine regions have been studied less often. Notable exceptions are Bareth and Schellberg (2018), Grüner et al. (2019), Lussem et al. (2019), Wang et al. (2017), and Zhang et al. (2018). Even fewer studies investigate the potential of UAS-borne sensor data for the estimation of grassland quality. Capolupo et al. (2015) estimated various biochemical plant traits (crude protein, crude ash, crude fibre, sodium and potassium concentration, metabolic energy) from UASacquired hyperspectral images (400-950 nm) of experimental grassland plots in Germany. The authors compared the use of linear regression with narrowband vegetation indices (VIs) and partial least squares regressions (PLSRs), concluding that PLSRs yielded better results for biochemical parameters (R 2 ranging from 0.21 for sodium to 0.80 for metabolic energy). Wijesingha et al. (2020) investigated crude protein and acid detergent fibre of eight grassland sites in Hesse (Germany) using a hyperspectral sensor (450-998 nm). Five predictive regression algorithms were tested, of which the support vector regression achieved the best result for crude protein estimation (normalized RMSE = 10.6 %) and a cubist regression model proved best for acid detergent fibre estimation (normalized RMSE = 13.4 %). Although these studies achieved promising results for forage quality estimation, they rely on hyperspectral data.
There are far fewer studies available utilizing cheaper UAS-borne multispectral data to estimate grassland quality parameters. Caturegli et al. (2016) utilized the normalized difference vegetation index (NDVI) calculated from multispectral sensor (Tetracam ADC Micro) data in a linear regression to estimate the N status of three turfgrass species. Depending on the species, R 2 varied between 0.66 and 0.86.
Hence, the potential of low-cost multispectral UAS-borne data for field-scale mapping and assessment of multispecies grasslands is not yet fully tested and exploited.
Thus, the objective of this study is to evaluate the potential of low-cost UAS data for estimating DM and plant community N concentration of managed pre-Alpine grasslands. The multispectral Parrot Sequoia sensor (SEQ) has been applied in several vegetation mapping/monitoring studies in the agricultural context (e.g. Grüner et al., 2020;Guan et al., 2019;Handique et al., 2017;Matsumura, 2020;Moncayo-Cevallos et al., 2018;Stroppiana et al., 2018). However, some associated quality issues have been reported (Olsson et al., 2021;Poncet et al., 2019). Therefore, we want to compare the performance of the SEQ sensor with another low-cost multispectral sensor, namely the MicaSense RedEdge-M (REM). We used statistical learning algorithms to build regression models and estimated DM and N over the whole of the UAS scenes. We utilized the multispectral data of the two UAS sensors together with in situ data of DM, N concentration, and bulk canopy height (CH) from a test campaign in April 2018 on sites in southern Germany. Additionally to the multispectral data, we evaluated the importance of canopy height as a predictor, primarily to see if it could improve the predictive performance of the models for our study region. In our study, we addressed the following research questions: -Is the spectral information of the UAS sensors sufficient to estimate and map the spatial pattern of DM and N concentration on managed pre-Alpine grasslands?
-How important is the calibration of hyper-parameters of the tested machine learning algorithms for the model performance?
-What are the effects of different sensors, statistical modelling approaches, and predictor sets on the predictive capabilities of the models?
2 Material and methods

Study area, sampling design, and measurements of grassland traits
The study area is located in southern Germany (Fig. 1), within the German Terrestrial Environmental Observatories (TERENO) Pre-Alpine Observatory (Kiese et al., 2018;Zacharias et al., 2011). The region is characterized by a warm temperate climate, i.e. Cfb climate zone according to the Köppen-Geiger climate classification (Rubel et al., 2017). For the period 1981-2010 the mean annual air temperature at the study sites was between 8.0 and 8.6 • C (DWD Climate Data Center, 2019b), and mean annual precipitation was between 1008 and 1419 mm (DWD Climate Data Center, 2019a). Field data were acquired at 10 plots on managed grasslands (Table 1). The plots are situated on the three sites "Fendt" (FE, 600 m a.s.l.), "Rottenbuch" (RB, 700 m a.s.l.), and "Eschenlohe" (EL, 630 m a.s.l.). Care was taken to include different grassland types and management practices in order to render robust and transferable models also for our single campaign. The plots represent a variety of management intensities ranging from very extensively managed grasslands with no fertilizer application and just one cut per year to very intensively managed grasslands with five cuts and five slurry applications per year. A species inventory in June 2020 characterized 9 out of 10 plots as Arrhenatheretum elatioris, while 1 was classified as Caricion davallianae grasslands (Table 1). Figure 2 provides an overview of the workflow of this study. Details on the different working steps are presented in the following paragraphs and sections. The field campaign with UAS flights and vegetation sampling took place from 24-25 April 2018. The phenological stage of the plots ranged from principal growth stage 1 (leaf development) to 4 (development of harvestable vegetative plant parts) ( Table 1). After the UAS flights, at each site (FE, RB, EL) up to four 30 m × 30 m plots (FE1, FE2, F3, FE4, RB1, RB2, RB3, EL1, EL2, EL3) were sampled at 9 to 12 georeferenced subplots of 0.25 m × 0.25 m. Bulk canopy height (CH, in cm) was measured with a rising plate meter. The vegetation within the subplot was clipped down to stubble height (3 cm). In the lab, the vegetation samples were sorted into the plant functional types of non-green vegetation, legumes, non-leguminous forbs, and graminoids. After the samples were dried in an oven at 65 • C until constant weight was achieved, the dry weight was determined, and the dry biomass per area was calculated (dry matter, DM, in g m −2 ). For the determination of mean plant community nitrogen concentration (plant N concentration, mass-based, in wt %), the dried vegetation samples were milled and analysed with an elemental analyser (vario Max cube, Elementar Analysensysteme GmbH, Germany). The reader is referred to the corresponding data paper (Schucknecht et al., 2020a) for more detailed information on the sampling, sample processing, and analysis.
From the collected in situ data we used the information from the single subplots to develop the models (see Sect. 2.3). Canopy height (CH) was used as a predictor variable, and DM and plant N concentration were response variables (Fig. 2).

UAS flights
Two different multispectral sensors were tested for this experiment: the four-band Parrot Sequoia (SEQ; Parrot Drones SAS, Paris, France) and the five-band MicaSense RedEdge-M (REM; MicaSense Inc., Seattle, USA) ( Table 2). For measuring the incoming solar radiation, both sensors were accompanied by irradiance sensors ("sunshine sensors") that were attached at the top of the drones. This information was used for image calibration during data processing. Before each flight, data from sensor-specific calibration targets were taken for radiometric calibration of the multispectral images during the processing. The UAS flights over the FE and RB sites took place on 24 April 2018 between 09:50 and 16:30, and the ones over the EL site (EL-North and EL-South) were on 25 April 2018 between 09:00 and 10:50. The SEQ was operated on a fixed-wing UAS (eBee, senseFly, Cheseaux-sur-Lausanne, Switzerland) with automated flight control. The flight height was set to 80 m, leading to a ground sample distance of 8.7-12.9 cm (depending on the terrain relief). The eBee was flown with a regular grid flight pattern with an image overlap of 75 %.
The REM was operated on a multicopter UAS (DJI Matrice 200, SZ DJI Technology Co., Ltd., Shenzhen, China) by an external company (Globe Flight GmbH, Germany). Due to logistical reasons only the FE and RB sites could be covered. The multicopter was flown manually at a flight height of about 70 m following a regular grid with an overlap of the single images of > 80 %. The ground sample distance of the different REM flights was between 7.7-8.8 cm.
For all flights with the different sensors, up to 10 ground control points (GCPs) were distributed in the overflight area of the UAS for georeferencing. The exact coordinates of the GCPs' centres were obtained with a global navigation satellite system (GNSS) receiver (Viva GNSS GS 10, Leica Geosystems AG, Switzerland) run in static mode for 10 min which resulted in an accuracy of 0.3 cm in the horizontal direction and 0.5 cm in the vertical direction in the postprocessing mode (Datasheet of Leica Viva GNSS GS10 receiver, 2020).

Processing of UAS images
The processing of the UAS images was done with the PIX4Dmapper Pro software (Pix4D S.A., Prilly, Switzerland) and consisted of three steps. The photogrammetric processing was based on a structure from motion (SfM) approach. First, keypoints of the images were extracted and matched, and the internal (e.g. focal length) and external (e.g. orientation) parameters of the camera were calibrated. Georeferencing was done with the integration of the measured GCPs and their identification on several input pictures. The root mean square error (RMSE) of the georeferencing varied between 1.9 and 4.7 cm according to the Pix4D processing reports. As a result of the first step, georeferenced automated tie points were created. In the second step, the point cloud densification was done corresponding to the Pix4D standard template for agricultural applications. The final step included the mosaicking of the adjusted and calibrated single images to the orthomosaics of each single band. The final spatial resolution of the multispectral images was 9.6 cm for FE, 10.2 cm for RB-North, 10.0 cm for RB-South, 8.7 cm for EL-North, and 12.9 cm for EL-South for the SEQ data Major towns are indicated for reference (pink diamonds). Background: true-colour composite of Sentinel-2B images from 27 April 2018 (contains modified Copernicus Sentinel data from 2018, processed by ESA). Coordinate reference system used is EPSG:25832. and 7.7 cm for FE, 8.8 cm for RB-North, and 8.2 cm for RB-South for REM data. The radiometric correction of the input images was done using the data of the irradiance sensor and the reflectance panels.
Additional flights of the fixed-wing UAS equipped with an RGB (red-green-blue) camera (Sony Cyber-shot WX 220, Sony Corp., Minato, Japan) were performed on all sites to retrieve higher-resolution orthophotos (spatial resolution: 0.030 to 0.043 m) for the different sites of the study area. The georeferenced high-resolution orthophotos were used to manually extract the coordinates of the centre points of the subplots (Schucknecht et al., 2020a). Afterwards, the reflectance values of the georeferenced multispectral images from SEQ and REM were extracted and averaged for each subplot using a 3 pixel × 3 pixel window around the centre point (Fig. 2, grey box). The 3 pixel × 3 pixel window approximately corresponds to the size of the subplot. Due to the high horizontal accuracy of the GNSS measurements (0.3 cm) and the low RMSE of georeferencing (max 4.7 cm), we expect just minor location errors.
Note that we could just use spectral information from the obtained UAS images as predictors in the model development. Theoretically, it is also possible to derive canopy height information from high-resolution UAS-derived RGB data by creating a digital surface model and subtracting the digital terrain model (DTM) from it as e.g. shown by Grüner et al. (2019) and Wijesingha et al. (2019). Poley and McDermid (2020) emphasized the importance of a highquality DTM for deriving reliable vegetation structure estimates from UAS imagery. Unfortunately, we did not have such a high-quality DTM for our study sites and hence could not derive UAS-based canopy height information. Therefore, we used the in situ bulk CH as a substitute to build models with CH as a predictor variable.

Vegetation indices
A set of different vegetation indices (VIs) was calculated from the spectral bands (Supplement Table S1). The various ratio (number of indices used n = 6), orthogonal (n = 1), hybrid (n = 5), red-edge (n = 4), and modified chlorophyll Table 1. Site and plot characteristics partly taken from Schucknecht et al. (2020a). Mean annual climate parameters (MAP: mean annual precipitation height, MAT: mean annual temperature) were derived from the DWD Climate Data Center (DWD Climate Data Center, 2019a, b) and correspond to the period 1981-2010. Grassland type and species richness (SR; i.e. number of vascular plant species) were obtained by a species inventory in 2020 (Schuchardt and Jentsch, 2020). The phenological stage was determined by inspecting the photos of the plots with respect to the dominant species (species abbreviated as LM for Lolium multiflorum, TR for Trifolium repens, LP for Lolium perenne, PP for Poa pratensis, KP for Koeleria pyramidata, and FP for Festuca pratensis). Provided is the number of the principle growth stage (1: leaf development (main shoot), 2: formation of side shoots/tillering, 3: stem elongation or rosette growth/shoot development (main shoot), 4: development of harvestable vegetative plant parts or vegetatively propagated organs/booting (main shoot)) according to the BBCH (Biologische Bundesanstalt für Land-und Forstwirtschaft, Bundessortenamt und Chemische Industrie) classification (Meier, 2018 (n = 4) indices were selected from the overview presented in Asam (2014). In addition, hyperspectral indices dedicated to chlorophyll (n = 6) were selected from the summary of Ollinger (2011) and adapted to the multispectral data. In total, 26 VIs were calculated for REM data, and 18 were for SEQ data (due to the missing blue band).

Model selection
Regression models were built to estimate DM and plant N concentration based on multispectral UAS data and in situ bulk canopy height information (Fig. 2, brown box corresponding to model building, validation, and application).
Combinations of several regression algorithms and predic- tor sets (PSs) were compared to see how different modelling schemes affect the model performance. Two machine learning (ML) algorithms, namely gradient-boosting machines (GBMs; Friedman, 2002Friedman, , 2001 and random forests (RFs; Breiman, 2001), were used in this study. They have been confirmed to be comparable to the other state-of-the-art (classic) machine learning methods for remote sensing applications (Caruana and Niculescu-Mizil, 2006;Fernández-Delgado et al., 2019, 2014Orzechowski et al., 2018). The two selected algorithms are ensemble-based and have a relatively small number of hyper-parameters (Bernard et al., 2009;Friedman, 2001;Probst et al., 2019). These ensemble-based ML algorithms are known to be able to deal with a large number of highly correlated features (e.g. spectral data and derived vegetation indices) and non-linear relationships without excessive data pre-processing (Hengl et al., 2018). In addition to them, a linear regression model (LM) was built to serve as a baseline statistical learning model in the model performance comparison. GBM (Friedman, 2002(Friedman, , 2001 is an ensemble of models based on the idea that weak learners can form a strong learner. The algorithm is adding weak models using a gradient descent process. Gradient boosting can take various forms, i.e. different loss functions and optimization schemes. In this study, we took the standard implementation from Friedman (2001Friedman ( , 2002 following Greenwell et al. (2020). GBM has a few tunable parameters, with the major parameters including number of trees (N tree ), learning rate, and interaction depth (see Table 3), which are supposed to be calibrated using domain data to avoid under-and overfitting (Greenwell et al., 2020).
RF is a decision-tree-based ensemble algorithm that uses bootstrap aggregation (i.e. bagging) and the random subspace method (Breiman, 2001). For each decision tree a new bootstrap sample of the training data is created, and the tree is fitted to the data. RF has three hyper-parameters, namely the number of trees (N tree ), the number of randomly selected predictors in each split of the decision tree (m try ), and the minimum number of samples in terminal nodes (node size). It is suggested that for a good model performance the number of trees need to be large enough but should not yield to overfitting (Strobl et al., 2009). The parameter m try should be carefully calibrated, in particular when predictors are strongly correlated (e.g. Bernard et al., 2009;Kuhn and Johnson, 2013;Probst et al., 2019;Strobl et al., 2009). The node size determines how many samples a tree needs to grow without being pruned.

Hyper-parameter calibration
We parameterized the machine learning algorithms using the nested cross-validation scheme (Arlot and Celisse, 2010;Vabalas et al., 2019;Varma and Simon, 2006) (Table 3). In the nested design, the optimizer in the calibration routine does not see the information included in the hold-out fold. The calibration is done for each of the 10 iterations for randomly split 6 cross-validation folds. For each training fold, parameter searching was done in an internal 5-fold cross-validation using the root mean square error (RMSE) as a penalty function.
To minimize the computing time, we used an efficient parameter space searching algorithm named (sequential) model-based optimization (MBO) (Bischl et al., 2014;Martinez-Cantin et al., 2007;Shahriari et al., 2016). In this algorithm, an optimizer traverses the parameter space guided by a naive Bayesian parameter proposal function, which identifies a candidate region that is likely to include the optimal parameter combinations. In its iterative process, a new parameter proposal is made based on an acquisition function, or "infill", which is supposed to offer the best improvement in the next step. We used the "confidence bound" as infill for GBM and RF. This infill proposes a parameter combi-nation to minimize uncertainty around the good parameter estimates (Bischl et al., 2014). It tries to evaluate the parameter region with large uncertainty with low errors (i.e. good performance), thus expects to reach a large improvement if searched in the next iteration. In this study, parameter values are proposed and evaluated in 500 iterations sequentially, and the final values are selected by the lowest error. The impact of calibration is quantified by the difference between the initial error (i.e. based on the random combination of the parameters sampled from the prescribed ranges) and the best error, which is defined by the lowest error achieved (Malkomes et al., 2016;Swersky et al., 2013). Note that the calibration was done for the 10 iterations individually; in each iteration the nested six-folds share the calibrated parameters. Calibrated values and their summaries are presented in Table S2 and Figs. S2 and S3.

Predictor set definition
Six different sets of predictor combinations were used in the models. The number of predictors differs for models using SEQ and REM data and is provided in brackets below: -PS1 -raw reflectance bands, using only raw reflectance data from SEQ (n = 4) and REM (n = 5), baseline scenario -PS2 -vegetation indices (VIs), using just VI but not raw reflectance bands (n SEQ = 18, n REM = 26) -PS3 -raw reflectance bands and vegetation indices (VIs) (n SEQ = 22, n REM = 31) -PS4 -bulk canopy height (CH, from field measurements), testing the sole use of CH as a reference for structural information (n = 1) -PS5 -raw reflectance bands and bulk canopy height (CH, from field measurements), using spectral and structural information (CH) (n SEQ = 5, n REM = 6) -PS6 -raw reflectance bands, CH, and VI, all available spectral and structural input data (n SEQ = 23, n REM = 32).
Bulk CH was selected as a predictor because we wanted to test the effect of adding structural information; i.e. can the addition of UAS-derived structural information to the spectral information improve the estimation of DM and N concentration in pre-Alpine grasslands? Due to the missing digital CH model for our sites, we used the in situ bulk CH as a substitute. With the in situ bulk CH data we can test the effect of CH on the model results but cannot provide spatial predictions in the form of maps. Hence, models using CH (PS4-PS6) were excluded from spatial predictions. Table 3. Range of the hyper-parameters used in the calibration for gradient-boosting machines (GBMs) and random forests (RFs). The calibration routine searches for the optimal parameter values within the prescribed ranges. Typical default values for the GBM from Greenwell et al. (2020) and RF from Probst et al. (2019). The final calibrated hyper-parameters are presented in Table S2.

Input data for model development
We used data from FE and RB plots to train and test (internally validate) the regression models (n = 82 for DM; n = 81 for N). As REM data were not acquired at the EL site, field data from the EL plots (n = 32) were excluded from the model training. However, the field data from the EL plots were used as an additional external validation of the models utilizing data from the SEQ sensor (brown part of Fig. 2; see Sect. 2.3.5).

Model evaluation procedure
To derive robust statistics, the regression models were built using a 6-fold cross-validation and repeated 10 times with random data splits. Each repetition is connoted as an "iteration" throughout the paper. For each iteration, the data are again randomly split into six folds: five folds to train a model and the hold-out fold to test the model. The corresponding cross-validated evaluation metrics are denoted with the subscript "cv". The model evaluation metrics used in the study are the averages from the test folds of the 10 iterations. Ground observations from the EL site were used to validate the models based on SEQ data without further site-specific training -for this site REM data were unavailable ( Fig. 2; corresponding evaluation metrics indexed with the subscript "val"). Evaluation metrics used are the coefficient of determination of the validation (R 2 ), root mean square error (RMSE), relative RMSE (rRMSE), and bias (Bias) (Eqs. 1-4). All metrics were averaged over the 10 iterations.
where y is an observed value,ŷ is a prediction, and n is the number of samples. Relative RMSE is normalized by the observed data range and used to compare regression models with unequal data input following Richter et al. (2012).

Model implementation
We used GNU R 3.6 (R Core Team, 2021) for model implementation. GBM was built using the R package "gbm3" (Greenwell et al., 2020), and RF was built via the R package "randomForest" (Breiman, 2001). Linear regression models were built using all available predictors (LM full ) and the best subset of predictors (LM best ) using variable selection. The variable selection was done by an exhaustive search, i.e. evaluate the Akaike information criterion (AIC) (Akaike, 1973) of all possible combinations via the "regsubsets" function in the R package "leaps" (Lumley, 2020). Interactions among the predictors were considered in the ML models but not explicitly in the linear models using interaction terms. We did not include interaction terms in the LMs, as the linear models with (first-and second-order) interaction yielded very large prediction errors in the cross-validation scheme (results from the preliminary analysis, not shown here).

Variable importance
In ML, measuring variable importance (Strobl, 2008) is a standard way to evaluate an overall impact of a specific predictor, often among a large number of highly correlated predictors. In this study, we evaluated variable importance to see how the different predictors overall contribute to the model performance. It is our interest to identify if there is a small number of dominant predictors or rather a combination of many predictors that contain the crucial information. We investigated the variable importance (VarImp) of the predictors used in the ML regression models in each data and model combination.
For each model, we collected variable-importance measures from each 6-fold cross-validation and averaged them; this was repeated for the 10 iterations. As each iteration yielded unequal model performance, the importance metrics of each iteration was normalized by R 2 of each iteration before averaging, which resulted in the mean VarImp and its uncertainty range. Note that variable-importance measures are based on reduction in mean squared error (MSE) but calculated differently for each ML algorithm. For GBM, we used the relative influence measure suggested in Friedman (2001) and, for RF, permutation-based out-of-bag importance of Breiman (2001). Various R packages were used to calculate variable importance, depending on the algorithm (Greenwell et al., 2020;Lumley, 2020;Meinshausen, 2017Meinshausen, , 2006; R Core Team, 2021).

Mappings: spatial predictions of DM and N concentration
Spatial predictions were calculated for models that do not need CH data (i.e. PS1-PS3). We used the models to predict DM and N values for the whole of the UAS scenes of the three sites. The models are with 10 iterations, thus predicted 10 times, and averages and the coefficient of variations are reported. Note that spatial plant trait estimates may be only valid for pixels of unshaded and vegetated grassland.

Statistical tests for the marginal model performance
Model performance metrics were averaged over sensors, model algorithms, and predictor sets to derive marginal performance with respect to each component. We used nonparametric statistical methods to test the differences in R 2 and RMSE. For sensors and algorithms (n treat = 2), we used the non-parametric Wilcoxon signed rank test (Wilcoxon, 1945). For predictor sets (n treat = 4), we used the nonparametric Kruskal-Wallis rank sum test (Kruskal and Wallis, 1952) to test overall effect and Dunn's rank sum test (Dunn, 1964) to carry out post hoc tests between treatments. We used R packages "stats" and "dunn.test" (Dinno, 2017; R Core Team, 2021).

Variable interdependencies
Correlations between variables measured in the field can affect the modelling of DM and N concentration or can even be exploited to improve the modelling. We created scatter-plots of selected variables (Fig. 3) and calculated the Spearman correlation coefficient of canopy height and DM or N concentration, respectively (Fig. 3a, b). Canopy height values varied between 3 and 21 cm (median = 10 cm, n = 116) and were significantly correlated with DM (r = 0.69, p value < 0.01) but not with N concentration (r = 0.02, p value > 0.1). We also found no statistically significant correlation between DM and N concentration (r = 0.12, p value > 0.1; Fig. 3c). Based on these results, we would expect that canopy height could improve the modelling of DM but not of N concentration and that any successful modelling of N concentration does not simply reflect a correlation of the spectral data with DM.

Biophysical and spectral characteristics of field samples
The spectral discrimination of grassland samples with different levels of DM or N concentration is a prerequisite for the estimation of DM and N concentration with multispectral data. In our study, the DM values of the measured subplots varied between 7 and 340 g m −2 (median = 113 g m −2 , n = 114), and for plant N concentration it was between 1.2 wt % and 4.4 wt % (median = 2.9 wt %, n = 113). Despite the fact that we targeted homogenous grassland plots, there was a distinct spatial within-plot variability (Schucknecht et al., 2020a), which however can be reflected by the spatial resolution of the UAS-based multispectral data. In general, the spectral profiles of selected subplots (Fig. 4) follow the expected shape of vegetated surfaces with low reflectance values in the visible range of the spectrum and higher reflectance values in the NIR region. Subplots with different DM and N concentration values show slightly different spectral profiles, with the highest standard deviations of the reflectance values in the red-edge and NIR band. However, the spectral profiles of subplots with different DM or N concentration values do not follow a clear pattern, e.g. with monotonically increasing reflectance of the NIR band with increasing DM. There is a positive linear relationship between NIR reflectance and DM, but this is not very strong (Fig. 5). Additionally, these spectral profiles have altered patterns for the two sensors (Fig. 4), with the SEQ sensor generally showing higher reflectance values (Fig. S1).

Hyper-parameter calibration
During the calibration process, the model performance increased, as improved parameter sets are used in the course of the iteration procedure (Fig. S2). Compared to initial model performance, which is based on 12 randomly sampled parameter sets from the given ranges (i.e. first 12 time steps in calibration; Bischl et al., 2014), the magnitude of improvement on average was 11 %. The difference between the lowest error achieved and the initial error is 19.4 % (GBM) and 5.5 % (RF) in DM estimation and 16.1 % (GBM) and 2.9 %  (RF) in N concentration estimation, averaged for the REM and SEQ sensor. Thereby the best error (i.e. the smallest error achieved by the ith iteration) did not substantially decrease as the optimizer approaches the global optimum after 400 iterations in most of the cases (Fig. S2). However, in some cases better parameter values were still discovered at the very end of the iteration procedure (e.g. PS1 in Fig. S1a). RF parameters changed less than GBM parameters along the iterations ( Fig. S1b and d). Furthermore, parameter proposals are less fluctuating for RF than for GBM as shown in the distance between consecutive parameter proposals (Fig. S3).

Model results
Our results indicate that the ML algorithms performed substantially better than the linear models in estimating DM and plant N concentration (Tables 4, 5, S3). ML algorithms yield an average regression performance of 0.44 for R 2 cv . Throughout the sensor-predictor set combinations, average (avg) R 2 cv was 0.48 for DM (rRMSE cv,avg = 15.9 %; Table 4) and 0.40 for plant N concentration (rRMSEcv,avg = 15.2 %; Table 5). In contrast, the "baseline" linear models are very low in R 2 cv , seemingly unsuitable to estimate DM and plant N concentration (all models R 2 cv ≤ 0.1; Table S3). Therefore, we focus in the following on the detailed results from the ML models (Figs. 5, 6; Tables 4, 5) and further investigate their characteristics.
The optimal combination of sensors, predictor sets, and ML algorithms leads to a notable increase in model performance compared to the average performance of all ML models -for both DM and plant N concentration (Tables 4, 5; Fig. 5). The best model for the estimation of DM (R 2 cv = 0.67, rRMSE cv = 12.6 %) is an RF model that utilizes all possible predictors (PS6) with REM sensor data (Fig. 6a). The best model for plant N concentration (R 2 cv = 0.47, rRMSE cv = 14.2 %) is achieved by the combination of RF, the PS6 predictor set, and SEQ input data (Fig. 6d).
The bias of our tested ML models varies between −2.5 and 2.2 g m −2 for DM (Bias cv,avg = 0.1 g m −2 ) and between −0.04 wt % and 0.01 wt % for N concentration (Bias cv,avg = 0.00 wt %). Although overall biases are low (Bias cv,avg =< 1 % of the mean DM and mean N observation), the models tend to underestimate high DM and high N plant concentration (Figs. 6, S7, and S8).  , and predictor sets (PS1: raw reflectance data, PS2: VI, PS3: raw reflectance data + VI, PS4: canopy height, PS5: raw reflectance data + canopy height, PS6: raw reflectance data + VI + canopy height). The bars show the mean of the cross-validated coefficient of determination (R 2 ), and the error bars represent ± standard deviation of the 10 iterations per model.

Effect of different sensors
The effect of the multispectral sensor on the model performance varied by the different grassland parameters. For DM, the models that built on REM input data perform better (R 2 cv,avg = 0.52, RMSE cv = 50.6 g m −2 ) than models that built on SEQ data (R 2 cv,avg = 0.39, RMSE cv = 57.5 g m −2 ) averaged over algorithms and predictor sets (Table 4). This difference is statistically significant for R 2 cv and RMSE cv (p value < 0.01; Fig. S4a, c). Additionally, we tested the models built on REM data without the blue band to see if the performance gap is due to the additional blue band of REM. The results show generally better performance of the REM-without-blue-band-based models (R 2 cv,avg = 0.54) than SEQ-based models (Table S3), suggesting that the better performance of models using REM data is not (entirely) related to the additional blue band.
Considering the estimation of plant N concentration (Table 5), models utilizing SEQ input data perform generally better (R 2 cv,avg = 0.43, RMSE cv = 0.50 wt %) than those using REM input data (R 2 cv,avg = 0.32, RMSE cv = 0.52 wt %), and these differences are statistically signifi-cant for R 2 cv but not for RMSE cv (p value < 0.01 for R 2 cv , p value = 0.092 for RMSE cv ; Fig. S4b, d).

Effect of predictor sets and variable importance
The selection of the subsets of predictors clearly influences the performance of DM models (Figs. 5, S6, S7; Tables 4, A1). For REM-based models, all predictor sets that only use spectral data (i.e. PS1: raw reflectance; PS2: VI; PS3: raw reflectance + VI) show a similar and slightly higher performance than the predictor set using only canopy height (PS4). For SEQ-based models, the use of VI (PS2, PS3) or canopy height only (PS4) improves the model performance compared to the baseline scenario just using raw reflectance data (PS1). The best model results for REM-and SEQ-based models are obtained by combining spectral data with canopy height information (PS5, PS6). These predictor sets show significantly higher R 2 cv and lower RMSE cv than predictor sets with spectral data (PS1, PS2, PS3) or canopy height information only (PS4).
For plant N concentration (Figs. 6, S6, S8; Table A1), the models using spectral predictors (PS1, PS2, PS3) show a sim- Table 4. Overview of the DM models and cross-validation evaluation metrics for all combinations of sensors (REM, SEQ), predictor sets (PS1: raw reflectance data, PS2: VI, PS3: raw reflectance data + VI, PS4: canopy height, PS5: raw reflectance data + canopy height, PS6: raw reflectance data + VI + canopy height), and ML algorithms (GBM, RF). The unit of RMSE cv and absolute Bias cv is grams per square metre. All metric values of single-sensor-predictor set-algorithm combinations are averages of the 10 iterations. Best results per sensor in bold. The first 11 rows per parameter show aggregated median results (e.g. median of all DM models). N obs = 82.  ilar model performance (R 2 cv,avg = 0.39-0.42) that is insignificantly lower than for models using all available predictors (PS6; R 2 cv,avg = 0.43). Models using just canopy height as the predictor are not capable of predicting N concentration (R 2 cv,avg = 0.04) and perform notably worse than all other predictor sets (p value < 0.05 for RMSE cv ; p value < 0.05 for R 2 cv except for the comparison of PS1 and PS4).

Variable importance
The analysis of variable importance shows some general observations (see Table S4 for detailed results). The added value of the inclusion of structural information in the predictor set for the estimation of DM is further substantiated by canopy height having the highest relative variable importance in all predictor sets where it is included (PS5, PS6), independent of the used sensor data and ML algorithm. For example, canopy height accounts on average for 39 % of error reduction, measured by relative importance for all models using PS5, and 28 % for the all models using PS6. Besides, the NIR band shows the highest variable importance for DM estimation (Table S4) over all sensor-algorithm combinations in the baseline scenario (PS1). For predictor set PS5 (raw reflectance + CH), the NIR band has the second highest variable importance after canopy height, again over all sensor-algorithm combinations. If VI and raw reflectance were included in the predictor set (PS3, PS6), one (PS3 with the SEQ-GBM combination) or a few VIs (all other sensoralgorithm combinations) are ranked higher than the raw reflectance band with the highest variable importance. Vegetation indices that have a variable importance of at least 5 % in all sensor-algorithm combinations of predictor sets includ-ing VI (PS2, PS3, PS4) are the Datt's index (Datt, 1999), NDVI re , and RR1, which all include the NIR band in their formula (Table S1).
In contrast to DM, there is in general no clear order or dominance of a certain predictor recognizable over all sensor-algorithm combinations for the estimation of N concentration. Canopy height shows a much lower variable importance compared to the DM models, with average relative importance of 15 % for PS5 and 8 % for PS6.

Effect of modelling algorithm
Overall, the two tested ML algorithms show a smaller difference in model performance than the two sensors and the predictor sets (Tables 4, 5; Fig. S5) for DM and plant N concentration. RF usually performs better (DM: R 2 cv,avg = 0.52; N: R 2 cv,avg = 0.42) than GBM (DM: R 2 cv,avg = 0.47; N: R 2 cv,avg = 0.36), but neither the difference in R 2 cv and RMSE cv between GBM and RF for DM nor the difference in RMSE cv for N concentration is significant (p value > 0.1). Models using REM data generally show a higher difference in model performance between GBM and RF, both when considering DM and N.
Noticeable is the distribution of relative importance of the predictors between the two ML algorithms (Table S4). GBM is often characterized by one dominating variable (especially for DM), which has substantially higher relative importance than other variables. In contrast, RF models show a more gradual decrease in variable importance for the subsequent ranked predictors.

External model validation with data from Eschenlohe (EL)
The models based on SEQ sensor data were additionally validated against the ground observations from the EL site that were not used for model training. Considering DM models (  Fig. S9).
All N concentration models show very low R 2 values (R 2 val ≤ 0.03) for the external validation site ( Table 6). The models for the external validation site predict levels of N concentration > 2.2 wt % but do not sufficiently capture the variations between 2.2 wt % and 4.1 wt % (Figs. 8b, S10).
The GBM model using PS4 (canopy height) predicts a single value for all N observations (Fig. 8b), implying that the model is not sensitive in this range. The levels of rRMSE (rRMSE val,avg = 28.0 %) are also higher than those of the cross-validated results (rRMSEcv,avg = 16.3 %).

Spatial predictions
The spatial DM prediction with the best performing spatial model (RF model with REM data and predictor set PS3 -raw reflectance data + VI) for the RB-North site shows withinfield variability (Fig. 9a). Furthermore, the extensively managed field around plot RB3 is characterized by very low DM values, which corresponds to field observations. The individual iterations of this model combination show very similar DM predictions that just differ slightly in areas with low DM as indicated by the coefficient of variation (CV) of the 10 iterations (Fig. 9b). Compared to this, the difference in spatial DM prediction between different DM model combinations (Fig. 9c) is much higher with the highest differences (CV > 30 %) occurring at places with very low DM. The main spatial pattern between different model combinations is similar, but less pronounced spatial patterns may differ depending on the used combination of sensor, ML algorithm, and predictor set.
The spatial prediction of plant N concentration for the RB-North site (Fig. 9e) also shows a certain within-field variability, and the extensively managed field around plot RB3 stands out with very low N concentrations. Most of the grassland pixels outside the extensive field are characterized by N concentrations between 2.5 wt % and 3.5 wt %. The differences between the 10 iterations of one model combination (Fig. 9e) and between different model combinations (Fig. 9g) are generally lower than for the DM models.
The spatial prediction maps for the other sites (Figs. S11 to S14) also indicate within-field and between-field variability in DM and plant N concentration as well as the highest differences between models at grassland areas with low values of DM and N concentration.

Discussion
In this study, we analysed the potential to estimate DM and plant N concentration with low-cost UAS-based data in pre-Alpine managed grasslands. We tested two multispectral sensors, three statistical models, and six different predictor sets and evaluated marginal effects of them. The models were trained and validated with in situ data. An emphasis was put on the calibration of the two ML algorithms GBM and RF.

Suitability of multispectral data to estimate DM and plant N concentration
The spectral differences between samples of different DM and plant N concentration levels (Fig. 4) indicate that an  estimation of these two grasslands parameters could be obtained by multispectral UAS data. However, the link between spectral pattern and the level of DM or plant N concentration, respectively, does not seem to be straightforward as e.g. demonstrated by the weak linear correlation between DM and NIR reflectance and the unsuitability of linear models to estimate DM and plant N concentration. Potential reasons for the rather weak bivariate relationship could be the different species compositions of the subplots, differences during the acquisition (time during the day, clouds), and the radiometric correction of the multispectral sensors (see Sect. 4.7 for a more detailed discussion of UAS acquisition issues), which can be challenging and a source of uncertainty (Olsson et al., 2021). Accordingly, our model results confirmed that the estimation of DM and plant N concentration is feasible when applying machine learning algorithms but with a noticeable error range. The best models using all available multispectral data (i.e. raw reflectance and VI) plus bulk canopy height information achieved an R 2 cv of 0.67 (rRMSE cv = 12.6 %) for DM and an R 2 cv of 0.47 (rRMSE cv = 14.2 %) for N concentration. These findings are in line with other studies which also confirmed the suitability of ML algorithms for grassland parameter estimation based on UAS data. The multitemporal study of Grüner et al. (2020) of an experimental farm with legumegrass mixtures also applied an RF model to the spectralreflectance data and VI of an SEQ sensor and achieved an R 2 cv of 0.62 and rRMSE cv of 17 % for DM estimation. The authors showed that the modelling performance was clearly improved by adding texture parameters to the predictor set (R 2 cv = 0.87 and rRMSE cv of 11 %). The analyses of Wijesingha et al. (2020) addressed the prediction of forage quality in grasslands with multitemporal hyperspectral UAS data in the wavelength range between 450-998 mm. They compared different regression algorithms and found that support vector regression worked best for the prediction of crude protein (R 2 cv = 0.81, cross-validated normalized RMSE cv = 9.6 %), but RF yielded similarly good results.
In our study, plant N concentrations models did not perform as well as the DM models, generally achieving much lower accuracies. The same decrease in accuracy was also found for the external validation site, where the N models marked much lower R 2 and higher RMSE values compared to the DM models. Furthermore, the N concentration models benefitted to a much lesser extent from the addition of the canopy height information to the spectral predictors. One reason for the less good performance of N concentration models in our study is certainly the result of the lower value range of plant N concentration (coefficient of variation in all samples, CV N,all = 19.8 %) compared with DM (CV DM,all = 57.3 %). Most of the N concentration in leaves is related to pigments like chlorophyll and proteins involved in photosynthesis with the most important being Rubisco (Ollinger, 2011, and references therein). While pigments are the dominant absorbers in the visible range of the electromagnetic spectrum, non-pigment compounds mainly have absorption features at longer wavelengths (Ollinger, 2011, and references therein). In his review, Ollinger (2011) summarizes several hyperspectral vegetation indices that are used for chlorophyll detection. However, the author emphasized that the effects of plant N concentration on leaf spectra are still unclear, e.g. if spectral reflectance is mainly driven by direct effects of N-containing compounds or indirect effects of related traits. In a recent publication, Berger et al. (2020) question the use of the commonly used chlorophyll-nitrogen relationship, as it is not maintained after the vegetative growth stage, and propose to quantify instead leaf protein concentration. The authors recommend the use of hyperspectral sensors for N quantification, as the spectral signatures related to proteins are subtle and mainly located in the short-wave infrared (SWIR) region. This should be further explored together with ML models trained on radiative transfer models Overall CV of N concentration for all PS1 and PS3 models. Predictor set PS3: raw reflectance data + VI. Estimation of DM and N concentration represents the mean of the 10 iterations for the selected model. Note that spatial estimates are only valid for pixels of unshaded and vegetated grassland. (Berger et al., 2020) or other modelling approaches like crop, plant growth, or biogeochemical models assimilating remote sensing data.

Importance of machine learning algorithms and their calibration
Linear models generally failed to capture variability both in DM and plant N concentration estimation as expressed in the cross-validated predictive performance. In all cases, the predictive performance was substantially better for ML algorithms. Linear models do not cope well with a large number of highly correlated predictors as well as with nonlinearity (Marchese Robinson et al., 2017). Explorative modelling techniques such as manual feature engineering in linear models, including advanced models such as generalized linear models, may help to achieve better model performance to the level of ML algorithms, as those methods can cover some weaknesses of LM. As shown in the results, hyper-parameter calibration of ML algorithm was confirmed to be crucial, leading to 11 % improvement in model performance. The lowest error was not revealed in the early iterations, and the parameter searching was discovering lower error values almost until the end of the iterations, suggesting that there is a risk of drawing inference based on the sub-optimal result when "default" parameter values are used. Note that the impact of calibration was more pronounced for GBM; in contrast the calibration of RF was seemingly more efficient (i.e. discovered optimal parameter values in a smaller number of iterations) in line with previous studies (Bernard et al., 2009;Probst et al., 2019). The MBO algorithm used in the calibration is more efficient in exploring a high-dimensional parameter space than naive searching algorithms such as grid searching, random sampling, or Latin hypercube sampling (Bischl et al., 2014). MBO does not need binning or discretization and can also stop earlier than scheduled, unlike grid searching, when it reaches the prescribed goal or yields no improvement. Those features enable MBO to explore parameter domains more comprehensively and effectively. A caveat of the adaptive algorithm is that it needs prescribed stopping rules (e.g. number of iterations or percent of improvement), though such stopping rules do not assure the optimal performance. Objective stopping rules should be further investigated in future applications such as metrics based on convergence, e.g. the Gelman-Rubin diagnostic commonly used in Markov chain Monte Carlo modelling (Brooks and Gelman, 1998;Gelman and Rubin, 1992).
In this study, a nested cross-validation scheme is applied ensuring that the calibration routine does not see the holdout data. Otherwise, the calibration could rather lead to loss of predictive power (Vabalas et al., 2019). We should note that the DM and plant N concentration values of the validation site (EL) are within the range of observations made in the training sites (Schucknecht et al., 2020a). Training data span-ning a wide range of observed DM and N concentration values and maybe originating from different types of grassland at different times in the growing season would be desired to build a generally applicable model. Overall, the two tested ML algorithms yielded comparable model performance after calibration; RF performed slightly better (higher R 2 cv and lower RMSE cv ) but without statistical significance.

Impact of different sensors
The two tested multispectral sensors affected the model performance in different ways depending on the considered grassland parameter. While REM-based models outperformed SEQ-based models in the estimation of DM, SEQbased models yielded significantly better results for plant N concentration estimation in terms of R 2 cv and RMSE cv . The two sensors have a different spectral setup with slightly different central wavelengths. While SEQ has a wider green and red band, REM has an additional blue band. Furthermore, the red-edge bands of the two sensors are not overlapping as well as the NIR bands. Considering some reflectance spectra of grass (e.g. USGS spectral characteristics viewer, https://landsat.usgs.gov/spectral-characteristics-viewer (last access: 12 May 2022); Rossi et al., 2020;Rossini et al., 2012), we would assume that the difference in the red-edge band position could lead to significant reflectance differences between SEQ and REM for one sample (due to the steep increase in reflectance in the red-edge region). The effect of the different central wavelength in the NIR band might be less pronounced (as the NIR plateau is reached), and the different band width of the green and red bands are expected to have a negligible effect on the difference in reflectance value between the two sensors. These and other constructional differences in the sensors might partly explain differences in the spectral profiles of certain subplots (Fig. 4) and thus differences in model performance.
A possible reason for the SEQ sensor showing generally higher reflectance values than the REM sensor with sometimes even implausibly high reflectance values in the NIR band (> 0.7; Figs. 5, S1) might be calibration issues of the sensor. The radiometric correction of the multispectral sensors, which is needed to convert digital numbers to surface reflectance, can be challenging and a source of uncertainty (Olsson et al., 2021), which might be even more important than the spectral specification of the sensors. Poncet et al. (2019) compared different radiometric correction methods for the Parrot Sequoia sensor including the manufacturer method using a one-point calibration plus a sunshine sensor like in our study. The authors found no method allowing for maximizing data accuracy for all bands and different flight conditions. The manufacturer-recommended method that includes the sunshine sensor yielded comparable data accuracy as the best empirical method but could be improved by the combination with an empirical calibration (Poncet et al., 2019). In their study, Olsson et al. (2021) evaluated the accu-racy of the Parrot Sequoia camera and sunshine sensor, highlighting the influence of the camera temperature on the sensitivity of the camera, the influence of the atmosphere on the images, and the influence of the orientation of the sunshine sensor on raw irradiance data. We were not aware that the Sequoia sensor needs to be sufficiently warm before reaching a stable sensitivity (Olsson et al., 2021). Since we took images of the sensor-specific calibration targets before each flight, this might have negatively influenced our radiometric calibration and introduced uncertainty in the reflectance data. The above-mentioned difference between the reflectance levels measured by the two sensors is not likely to have a considerable effect on the results of the estimation of canopy traits, as only data from one of the sensors are used in model fitting. However, this means that the results from the model fit will only be valid for the reflectance data from the sensor used, and parameters from different model fits based on data from the two sensors are thus not directly comparable. Proper calibration of the two sensors would therefore be advised in future studies to make the results more generally applicable and comparable.
Besides constructional and radiometric correction aspects, changing illumination conditions may have contributed to differences in the reflectance values of a certain subplot in the comparison of the two sensors. Data acquisition at the Fendt site was partly affected by passing clouds, which is also visible in the data of the irradiance sensor of the SEQ (not shown). In general, it would be beneficial to have in situ reflectance data of the subplots (e.g. from a field spectroradiometer) to validate the reflectance values of the two UAS sensors, but these are not available.
In summary, we could not conclusively clarify the exact reasons that led to the differences in spectral signatures of the two sensors and their model performance. This would require additional laboratory and field tests that are out of the scope of this study. However, our study indicates that the REM sensor might be preferred for applications targeting biomass estimation in pre-Alpine grasslands.

Impact of different predictor sets
The models solely dependent on UAS data (PS1, PS2, PS3) were moderately good, both for DM and plant N concentration estimation. Adapting vegetation indices is straightforward and feasible under any condition. Therefore, it is important to evaluate its added value. With regard to the cross-validation, models using PS2 (VI) and PS3 (raw reflectance + VI) showed generally higher R 2 cv and lower RMSE cv values than that of PS1 (raw reflectance), but the difference was not significant neither for DM nor for N concentration. The addition of VI seemed to be more important for the external validation, as it significantly increased the predictive performance of DM for the validation site EL compared to the baseline scenario PS1 (Fig. S7). The addi-tion of VI in N models for the validation on the EL did not improve the model performance.
This finding is in line with other studies, which also pointed out that VI and other arithmetic band combinations may help to improve the prediction accuracy for vegetationrelated quantitative and thematic variables (Maschler et al., 2018;Seo et al., 2016). The benefits are observed despite the fact that VIs do not really add "new" information, which is not yet contained in the spectral signatures (Baret and Guyot, 1991;Atzberger et al., 2011). The empirically observed benefits are most probably linked to the reduction in shadowrelated brightness effects.
Our results highlight the importance of combining spectral information with canopy height for the estimation of DM. Models using spectral and CH information (PS5, PS6) had significantly higher R 2 cv and lower RMSE cv values than those using just spectral information (PS1, PS2, PS3) or just CH information (PS4). The effect of combining canopy height with spectral data in the predictor set is larger than the effect of the used ML algorithm or sensor. It may suggest that canopy height is playing a crucial role, as it reflects seasonal growth and canopy structure independent from the spectral information.
Canopy height as a sole predictor was not suitable to estimate plant N concentration. The effect of adding CH to spectral data in the predictor set was slightly positive but not significant. This low relevance of CH in the estimation of plant N concentration could be expected from the missing correlation between N concentration and canopy height (Fig. 4).
High-resolution UAS-based RGB data can be used to derive canopy height models that can then be integrated in the spatial DM estimation. Some studies already utilized canopy height information in the estimation of grassland yield (Grüner et al., 2019;Lussem et al., 2020Lussem et al., , 2019Viljanen et al., 2018). However, it needs to be kept in mind that a precise and high-resolution DTM is required to derive reliable vegetation structure estimates from UAS imagery (Poley and McDermid, 2020). The generation of such high-quality DTMs can be challenging in areas with a dense vegetation canopy as is the case for our pre-Alpine grasslands. In their review, Poley and McDermid (2020) reported different methods for DTM generation that have been applied when ground points were not well visible: active sensors like lidar and terrestrial or aerial laser scanning as well as terrain interpolation based on high-accuracy GPS data collected on the ground (see references in Poley and McDermid, 2020). A low-cost alternative to the active sensors might be a UAS-based digital surface model of the freshly cut grassland as used for example in Lussem et al. (2019).
In addition to CH, texture and the spatial variation in the image elements were shown to correlate with vegetation structure and heterogeneity (Gallardo-Cruz et al., 2012) and can vary with the phenological stage of the vegetation (Culbert et al., 2009). Grüner et al. (2020) demonstrated that the modelling performance of DM in legume-grass mixtures was improved by the addition of texture parameters in the predictor set.

Spatial predictions
Spatial pattern in DM and N concentration can differ depending on the used combination of sensor, ML algorithm, and predictor set of the model, especially with respect to less strong patterns. The magnitude of these differences in terms of the coefficient of variation in all used models is larger for DM than for N concentration with the highest CV in areas of low DM or N concentration values. However, without additional spatial information (on soil properties, soil moisture, species composition, etc.) it is hard to interpret these differences in spatial pattern and assess the quality in spatial prediction of the single models. We plan to adapt the developed ML models to multiple campaign applications and already collected field data of several plots at different dates during the growing season 2019 and 2020. In this context, it would be an interesting research question if the spatial pattern of single models are persistent in time.

Transferability of model results
A limitation of this study is that the model is trained using data of a single flight campaign at each site, which may raise the question about the transferability of the developed models; i.e. do the relationships apply also for data from other sites and dates (across different growth stages)? The training data were collected from several sampling sites differing in management, species composition, current canopy height, and phenological stage to increase the general validity of the results of a single campaign. The spatial transferability of the developed models was (partly) tested with the external validation with data from the Eschenlohe site that were not used in model building. The results indicate that DM models work moderately well at other sites that are within the value range of the training sites (Fig. 8a). With respect to the estimation of N concentration, the models failed to predict the variability in N at the validation site (Fig. 8b). However, the model is fairly good at capturing the mean N concentration values (e.g. small bias in Fig. 8b), implying it needs to learn more about the low and high N domains. Therefore, we expect that both DM and N models would benefit from an increased training database that captures a wider range of values and originates from different grasslands.
The applicability of the developed models across different phenological stages is partly accounted for by the use of training data originating from different phenological phases (Table 1). However, the full range of phenological phases is not covered. Including training data from multiple sampling campaigns over the growing season would be desired for further studies, as Rossini et al. (2012) showed a change in the reflectance spectra of grasslands for different times in the year.
Another aspect of transferability is whether the trained model is reusable in other problem domains. The models we used (GBM and RF) are not directly reusable in other problem domains, meaning the trained weights are not usable if there is any change in model setup (e.g. addition or removal of predictors, changing response variables). This is due to the fact that the two algorithms belong to the family of "shallow" learning algorithms, in contrast to "deep" algorithms. They learn features in a small number of layers (i.e. shallow), compared to deep algorithms often comprised of several layers. However, the shallow algorithms are still useful to other studies in the sense that model diagnosis metrics (i.e. variable importance) and optimal model structure (i.e. calibrated parameter values) are informative to similar research questions. These metrics are useful to understand the processes and help design field campaigns and build new models in another domain, space, and time. In contrast to deep algorithms, they are relatively straightforward to build and train while being moderately good and robust at capturing variations.

Acquisition of UAS data
The acquisition of UAS data has some advantages over satellite imagery like the flexibility in flight conduction (no fixed overflight day and time) and the possibility of acquiring data during cloud cover. Having the full control of flight scheduling also allows for deciding whether the acquisition conditions are sufficient for the specific application and the corresponding data quality requirements or whether the campaign needs to be postponed. On the other hand, the UAS flight campaigns depend on good weather conditions (no precipitation, little wind), and it is sometimes not easy to find a suitable date when the weather is suitable and all people from the field campaign team have time. In general, stable illumination conditions during the flights are desirable to avoid negative effects on the data quality (e.g. Assmann et al., 2018). In practice, one often has to face the trade-off between data availability and data quality. From our experience, it can sometimes be difficult to find an optimal date for conducting a UAS campaign in the desired phase of the grassland development stage. Therefore, we also have to accept changing illumination conditions (e.g. due to passing clouds, different sun angle) in order to have at least an acquisition even if the data quality might not be optimal. The present study represents such a real-world case where we were searching for a window of good weather and where we had to coordinate quite a lot of people from different institutions. Finally, we had sub-optimal illumination conditions at one of our sampling sites due to passing clouds.
In addition, the duration of UAS data acquisition on the first sampling day was quite long, as we flew at two sites and the field team was not yet practised. Here, we identified a clear potential for optimization. Measuring the position of the GCPs and the centres of the subplots with the GNSS in "topo point mode" (a measurement just takes a few seconds, but it requires mobile-phone coverage) instead of static mode would save a lot of time. Furthermore, the workflows in the field could be improved to require less time. With these optimizations the duration of the flights could be shortened and conducted around solar noon. However, the trade-off between the number of flights (i.e. number of different sites covered) and data quality aspects due to the varying sun angle partly remains.

Data quality
Issues with the quality of low-cost UAS sensors and radiometric calibration have been reported in the literature (e.g. Aasen et al., 2018;Assmann et al., 2018;Olsson et al., 2021;Poncet et al., 2019). In our study, the difference in the spectral profiles of subplots between the two used multispectral sensors raises questions related to the quality of the obtained data but could not finally be addressed. The placement of spectral-reflectance targets in the overflight area (Aasen et al., 2018;Assmann et al., 2018) and the simultaneous collection of field spectrometer measurement would allow for a better assessment of the quality of the obtained UAS data. The practical feasibility of the latter option might be a constraint, especially in view of the required field personal and duration of fieldwork. The lack of a standard quality control information layer provided by the data processing workflow of Pix4D as compared to certain satellite data products is a drawback for the user.
In summary, we think that there are quite a few measures to improve the quality of UAS data, but not all can be considered at all times in practical applications. At the end the user of UAS data needs to accept that the quality cannot be as good as for satellite imagery and should consider this aspect in the interpretation of the derived products. However, extending campaign periods and increasing the replication and the sampling frequency could actually lead to a reduction in uncertainty especially under a limited budget (Kim et al., 2022). Thus we may well advocate the use of low-cost sensors in a range of applications which require high-spatial resolution and flexible application options.
To date, using low-cost UAS data is the only affordable way to acquire individual-level spatial information for a specific location and time. In precision farming, such finegrained spatial information supports the optimization of fertilizer application, weed and disease management, harvest, and irrigation (e.g. Tsouros et al., 2019). For such applications, the value of low-cost sensors is rather high even if their spectral quality is not at the level of satellite or highprecision sensors. Spatial patterns acquired from a low-cost sensor product can be directly used to derive spatial gradients and as a complement to satellite products.

Conclusions
Spatially explicit information on grassland biomass and quality could improve local farm management and support regional-scale assessments, e.g. on nitrogen cycling. This study aimed to develop, assess, and apply models to estimate DM and plant N concentration of pre-Alpine grasslands on the field scale with UAS-based multispectral data and canopy height information. We tested two different sensors, three statistical modelling approaches, and six input data sets with respect to their effect on model performance using in situ data from 10 permanent grasslands. Our results indicate that ML algorithms are able to estimate DM and plant N concentration, whereby DM models showed better performance in terms of R 2 and RMSE. The combined use of spectral and canopy height information in the predictor set significantly improved the prediction for DM but not plant N concentration. Including VI was also beneficial for DM prediction but to a lesser extent. Data from the REM sensor yielded significantly better model performance results for DM estimation, while SEQ data were significantly better for plant N concentration estimation. Overall, machine learning algorithms utilizing UAS-based multispectral data and canopy height information proved to be a promising tool for the estimation of DM and plant N concentration in pre-Alpine grasslands. Further research should address the transferability of approaches (e.g. by extending the calibration and validation database), the improvement of the models (e.g. by the incorporation of texture parameters), and the spatial upscaling through the utilization of satellite data.
Appendix A   Table A1. Results of the non-parametric statistical tests between parameter pairs on R 2 cv and RMSE cv . Three different tests were carried out: Wilcoxon test for sensors and algorithms (N treat = 2), Kruskal-Wallis test for predictor sets overall, and Dunn's test between predictor sets (see details in Sect. 2.3.5). Note that all the tests were done for paired samples. Code availability. The code used in the preparation of this paper is available upon request from the authors.
Data availability. The field data set used in this study is available in the PANGAEA repository at https://doi.org/10.1594/PANGAEA.920600 (Schucknecht et al., 2020b).
Author contributions. AS conceptualized the project and its methodology; performed data curation, data visualization, and the formal analysis; conducted the investigation; administered the project; and wrote the original draft of the paper. BS performed data curation and the formal analysis, conducted the investigation, designed the project's methodology, implemented software for the investigation, validated the results, visualized the data, and wrote the original draft of the paper. AK performed the formal analysis, acquired funding, conducted the investigation, provisioned resources, and reviewed and edited the paper. SA designed the project's methodology and wrote the original draft of the paper. CA reviewed and edited the paper. RK acquired funding, conducted the investigation, administered the project, and reviewed and edited the paper.
Competing interests. The contact author has declared that neither they nor their co-authors have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Acknowledgements. We thank our colleagues from IMK-IFU, KIT, for their great support during the field and laboratory work and the IT department for supporting data processing and model running; our partners from the Technical University of Munich for the N analysis; Kerstin Grant from Agricultural Centre Baden-Württemberg (LAZBW) for the determination of the phenological stages of the plots; SAPOS for providing correction data; and the local farmers for their cooperation. We also gratefully acknowledge the computational and data resources provided by the Leibniz Supercomputing Centre (https://www.lrz.de, last access: 19 May 2022; project no. pn69tu). Furthermore, we thank Andreas Ibrom for handling the paper as an editor and Francesco Fava and one anonymous referee for their valuable comments.
Financial support. This research has been supported by the German Federal Ministry of Education and Research (grant nos. 031B0027A, 031B0516A, 031B0027E, 031B0516E, and 031B0516F) via the SUSALPS project (https://www.susalps.de/, last access: 19 May 2022). BS has been supported by the Seventh framework programme of the European Community (grant no. 603416) and the Helmholtz Association.
The article processing charges for this open-access publication were covered by the Karlsruhe Institute of Technology (KIT).