Articles | Volume 23, issue 7
https://doi.org/10.5194/bg-23-2235-2026
© Author(s) 2026. This work is distributed under the Creative Commons Attribution 4.0 License.
Uncertainty Assessment in Deep Learning-based Plant Trait Retrievals from Hyperspectral data
Download
- Final revised paper (published on 08 Apr 2026)
- Supplement to the final revised paper
- Preprint (discussion started on 09 Apr 2025)
- Supplement to the preprint
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2025-1284', Anonymous Referee #1, 22 Jul 2025
- AC1: 'Reply on RC1', Eya Cherif, 27 Aug 2025
-
RC2: 'Comment on egusphere-2025-1284', Anonymous Referee #2, 26 Jul 2025
- AC2: 'Reply on RC2', Eya Cherif, 27 Aug 2025
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
ED: Reconsider after major revisions (02 Sep 2025) by Mirco Migliavacca
AR by Eya Cherif on behalf of the Authors (07 Oct 2025)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (08 Oct 2025) by Mirco Migliavacca
RR by Anonymous Referee #1 (23 Oct 2025)
RR by Anonymous Referee #2 (17 Nov 2025)
ED: Publish subject to minor revisions (review by editor) (18 Nov 2025) by Mirco Migliavacca
AR by Eya Cherif on behalf of the Authors (16 Dec 2025)
Author's response
Author's tracked changes
Manuscript
ED: Reconsider after major revisions (23 Dec 2025) by Mirco Migliavacca
ED: Publish subject to technical corrections (24 Jan 2026) by Mirco Migliavacca
AR by Eya Cherif on behalf of the Authors (31 Jan 2026)
Author's response
Manuscript
Post-review adjustments
AA – Author's adjustment | EA – Editor approval
AA by Eya Cherif on behalf of the Authors (17 Mar 2026)
Author's adjustment
Manuscript
EA: Adjustments approved (30 Mar 2026) by Mirco Migliavacca
GENERAL COMMENTS
This is an interesting work that deals with the capability of inferring the uncertainty of machine (i.e., Deep) learning models’ predictions based on the dissimilarity between the seen and input data of the model. The proposed methodology is tested using a complete dataset of observations, and the results suggest that, particularly for unseen data, the uncertainty estimates are more accurate or at least more conservative than those provided by other methods, which appear to underestimate uncertainty systematically. The results depend on the biophysical variable predicted by the model, with large differences in performance or behaviour in some cases. The manuscript and the results are well presented, and the discussion is consistent with them; still, some relevant questions remain to be clarified. Overall, the work is relevant to the domain of remote sensing and machine learning, and the proposed method appears to improve upon the state-of-the-art alternatives.
The methodology could be more clearly presented (e.g., including equations and a flowchart). Additionally, some results, particularly those regarding LAI, require a deeper inspection that justifies the hypothesis presented by the authors to justify their findings. Some points made in the discussion should be reviewed or linked to the methodological choices made by the authors.
SPECIFIC COMMENTS
Lines 162-166: Clarify to the reader that spectral data correspond to proximal sensing and airborne canopy reflectance factors so that it is not necessary to access Table S3 to know this detail. Also, specify whether these datasets were gathered from open-access repositories or were privately lent by the producers for this study.
Lines 170-172: While the approach is generally accepted (e.g., the ASD field spectroradiometers interpolate to one nm step in their output), I wonder how the authors pondered this choice. Overall, interpolation will not be able to improve the information of the datasets with the coarsest spectral resolution. What impact do the authors expect from this imbalance in the information rendered in each dataset? Do they expect that the information contained in the narrow spectral features, which remain only in the datasets with the highest spectral performance, will be learned by the model, even if not present in all the datasets, or would this mixture just be a source of confusion and uncertainty for the training? In the second case, wouldn’t it be more robust to downgrade the spectral resolution to the lowest among the datasets?
Lines 172-174: Spectra with different noise levels are smoothed with the same window width. Likely, the field spectroradiometers are less noisy than airborne imagers; thus, there’s a risk of over-smoothing already smooth data. Overall, the aim should be to achieve a comparable noise level for each dataset. Could processing data according to their noise levels improve the learning process? Considering this further, I understand that the levels of uncertainty are unknown, but perhaps different degrees of reliability could be provided for field spectroscopy and airborne imagery. Would giving different weights to each type of data improve the results?
Section 2.1.2: This section’s clarity could benefit from some equations. Additionally, a flowchart summarizing the methodology would provide a clearer view for the reader at an early stage of the manuscript.
Lines 205-207, Section 4.4: The absolute value of the errors is taken. Do the authors expect the errors to be symmetric and centered on zero, or was this checked? Would any knowledge be gained if the error and the 2.5- and 97.5-quantile regressions were applied instead? (e.g., biases in specific directions for specific vegetation types). In the discussion (Section 4.4), they precisely raise the issue of assuming or forcing symmetric error distributions, but their analyses start from absolute error values. Could they comment on the potential impact of their choice in the context of symmetric distributions, and maybe foresee future lines of research at least?
Dis_UN model and training, and lines 440-445: The performance of this model’s training (and test) is not presented; therefore, the reader can’t know whether the predictions were expected to be accurate or precise when applied. Despite being more “conservative” than the other methods, Dis_UN predictions are also uncertain. Fig. 3 compares the absolute value of the error with the expected 95 % of their distribution predicted by Dis_UN, but the 68 % (one standard deviation) for the others, which may not be the most appropriate comparison. The statement in lines 440-445 raises the question of whether this comparison is then fair, or whether, comparing the same uncertainty coverages, the difference between the uncertainty predictions would become lower. Perhaps the 95% tail should be calculated and compared for all methods (e.g., multiplying the Ens_UN and MCdrop_UN estimates by 1.96), or the 68-quantile regression (one standard deviation) used for Dis_UN.
Lines 431-433: Perhaps “correlation” is not the most representative term for the problem; for example, ens_UN might feature higher Pearson coefficient correlations than the other methods (Fig. 3). The coefficient of determination might neither represent the achievement the authors report, thus, another term should be used instead.
Lines 456-457: If the embedded space misses some spectral information that might be different between vegetated and non-vegetated surfaces, do these differences matter when the traits are predicted?
Lines 460-466: Grasslands are not so simple; they can include non-green elements, such as standing senescent material (e.g., Pacheco-Labrador et al, 2021), flowers (Perrone et al, 2024), and pixels usually mix numerous species (Darvishzadeh et al, 2008), which hamper the relationships between spectral and biophysical properties. Phenology during sampling is not reported in the manuscript, but if all the datasets correspond to the green-peak period, grasses will cover most of the soil, and unlike in the shrublands, the background contribution will be minimized. Lower uncertainties might result from bias in the sampling time of the grasslands towards the green peak, if this is indeed the case (which could be confirmed). Unlike forests and shrublands, grasslands exhibit a significantly lower geometrical BRDF component, which may explain the differences between cover types. Forests, in addition to shrublands, will present a more complex vertical profile with a distinct understory of vegetation. There are arguments to justify the findings, but grasslands should not be regarded as “simple”. The issue of the phenology bias is, in fact, commented on in the discussion (Section 4.3).
References:
Darvishzadeh, R., Skidmore, A., Schlerf, M., and Atzberger, C.: Inversion of a radiative transfer model for estimating vegetation LAI and chlorophyll in a heterogeneous grassland, Remote Sensing of Environment, 112, 2592-2604, https://doi.org/10.1016/j.rse.2007.12.003, 2008.
Pacheco-Labrador, J., El-Madany, T. S., van der Tol, C., Martin, M. P., Gonzalez-Cascon, R., Perez-Priego, O., Guan, J., Moreno, G., Carrara, A., Reichstein, M., and Migliavacca, M.: senSCOPE: Modeling mixed canopies combining green and brown senesced leaves. Evaluation in a Mediterranean Grassland, Remote Sensing of Environment, 257, 112352, https://doi.org/10.1016/j.rse.2021.112352, 2021.
Perrone, M., Conti, L., Galland, T., Komárek, J., Lagner, O., Torresani, M., Rossi, C., Carmona, C. P., de Bello, F., Rocchini, D., Moudrý, V., Šímová, P., Bagella, S., and Malavasi, M.: “Flower power”: How flowering affects spectral diversity metrics and their relationship with plant diversity, Ecological Informatics, 81, 102589, https://doi.org/10.1016/j.ecoinf.2024.102589, 2024.
LAI and saturation: The authors argue that the different performance of Dis_UN for LAI is due to saturation; however, in the training datasets, maximum LAI barely reaches 6. I think this hypothesis should be more robustly explored, which may have been done but not presented to the reader. I am not sure saturation can justify the negative coefficients presented in Table S6. Usually, LAI and canopy-scale vegetation variables are easier to retrieve from remote sensing than leaf-level measurements, which should raise some flags.
The authors could start by checking how well the DL model performs in predicting LAI compared to the other variables; i.e., the training and test statistics of the model could be presented in the supplementary material. I would expect LAI to be more predictable than foliar traits.
If there is saturation, where does it happen? The authors could plot, for example, NDVI (or other index, e.g., NIRv) vs. LAI from their training datasets and explore above which LAI level their dataset saturates. Then, could they check whether the problems for Dis_UN to predict uncertainty occur above the threshold?
In the case of non-vegetated surfaces, the DL model might have learnt to predict low LAI for clouds, buildings, or soils even without having seen them. Under this hypothesis, it would also be worth checking whether the estimation uncertainty is low because, indeed, the LAI prediction is accurate. Therefore, the authors may also want to explore the maps of predicted variables to confirm whether the predicted values are reasonable for any of the variables (i.e., LAI being close to 0) for the OOD pixels.
Uncertainty modeling: The variable-dependence on the capability of Dis_UN to predict the uncertainty might be alleviated by computing the dissimilarity in different spectral regions (E.g., Visible, Red-Edge, NIR, and SWIR). While I do not ask the authors to apply this approach, they might want to consider it in the Outlook section (4.5)
TECHNICAL CORRECTIONS
Supplementary materials’ citations: I am unsure whether the journal requires them to be presented in the order of appearance in the main text, but it might make more sense or facilitate the reader’s search.
Line 245: For the least familiarized readers, define what a “dropout rate of 0.5” means and maybe justify why this rate is chosen.
Line 296: Maybe better: “For clouds delineation we used…”
Table S2, and overall, figures and tables: Provide the units of the variables presented.
Table S4: Indicate that the ratio is expressed in percentage.
Supplementary material: Enhance the presentation and ensure that units and symbols are properly introduced.
Terminology: Review and homogenize terminology in the main text, tables, and figures. For example, in the paper and figures, the terms “Ens_UN” and “ens_UN” can be found.