Towards an ensemble-based evaluation of land surface models in light of uncertain forcings and observations

Arora, Vivek K.; Seiler, Christian; Wang, Libo; Kou-Giesbrecht, Sian

doi:https://doi.org/10.5194/bg-20-1313-2023

Articles | Volume 20, issue 7

https://doi.org/10.5194/bg-20-1313-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/bg-20-1313-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 20, issue 7

Research article

|

06 Apr 2023

Research article |

| 06 Apr 2023

Towards an ensemble-based evaluation of land surface models in light of uncertain forcings and observations

Vivek K. Arora, Christian Seiler, Libo Wang, and Sian Kou-Giesbrecht

Download

Final revised paper (published on 06 Apr 2023)
Preprint (discussion started on 29 Jul 2022)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2022-641', Anonymous Referee #1, 26 Aug 2022

Review of egusphere-2022-641:

Towards an ensemble-based evaluation of land surface models in light of uncertain forcings and observations

Summary

The authors present an evaluation of their updated land model, CLASSIC. The primary update from a previous model is a new nitrogen cycle, although new land cover reference data have been implemented also. Notably, they evaluate the model using two different meteorological driving data sets, and also comparing new vs old land cover data and with/without the new nitrogen cycle. Their evaluation system effectively compares model scores to benchmark scores that are based on observational uncertainty. They conclude that their new model is reasonable compared to both observation and other models, and also that present-day land atmosphere co2 flux is independent from the initial land carbon state, with respect to the variations included in this experiment.

Overall review

I appreciate the authors’ more comprehensive approach to evaluating their updated land model. The evaluation system provides clear results. However, in the end it isn’t clear that the model is better, but it appears that the GSWP3 forcing gives better results than the CRU forcing. I also think that the conclusions regarding the independence of flux from initial land state are overstated, mainly because this is a highly constrained case where the land start and end points are shifted by a similar amount and the land change trajectory is nearly identical (but shifted) between the two cases. I recommend the following main revisions (see below for additional details):

1) The different land cover cases need to be redefined. They are not different reconstructions. They just reflect an update in the present-day reference land cover data that are used to anchor the model’s land trajectory. While this is a reasonable update, it does not represent the uncertainty of land use/cover change.

2) Qualify your conclusions regarding the robustness of model fluxes under different initial carbon states. This is a very specific, highly constrained comparison and there are many factors and uncertainties, particularly in the land space, that are not considered here but have substantial impacts on carbon flux and storage estimates.

3) Complete your background and comparisons with literature on land data uncertainties. Some suggestions are below, but there is more out there showing the complexity of this problem.

4) Swap the more useful appendix figures for the unreadable paper figures. Try to make the figures more readable.

Specific comments/suggestions

Abstract

line 6:

I would not call these two land cover sets different historical reconstructions. You just replaced your current day reference land cover with newer data. There are many other factors that affect the reconstruction, most notably the land use data and the assumptions used to apply land use to land cover.

line 12:

awkward sentence transition. probably do not need the beginning of this sentence; just start with “Simulated area burned…”

Introduction

lines 90-94:

there are other studies on this topic. the most relevant one is probably this one because it addresses uncertainty in land cover in conjunction with the effects of co2, nitrogen deposition, and climate:

A.V. Di Vittorio, J. Mao, X. Shi, L. Chini, G. Hurtt, and W.D. Collins, “Quantifying the effects of historical land cover uncertainty on global carbon and climate estimates”, Geophysical Research Letters. doi: 10.1002/2017GL075124.

This one looks at land change emissions across several land cover representations:

Peng, S., P. Ciais, F. Maignan, W. Li,

J. Chang, T. Wang, and C. Yue (2017),

Sensitivity of land use change emission

estimates to historical land use and land

cover mapping, Global Biogeochem.

Cycles, 31, 626–643, doi:10.1002/

2015GB005360.

but this one is also relevant:

A.V. Di Vittorio, X. Shi, B. Bond-Lamberty, K. Calvin, A. Jones, 2020, “Initial land use/cover distribution substantially affects global carbon and local temperature projections in the integrated Earth system model”, Global Biogeochemical Cycles. doi: 10.1029/2019GB00683.

CLASSIC modeling framework

lines 178-180:

Some folks may disagree here. Different types of trees have different leaf/canopy shapes, orientations, and colors that may affect interception and also radiative processes.

Driving data

line 230:

This may be true in some cases, but the later step of creating the reconstruction by applying the land use trajectory to this static cover map can generate greater uncertainty, not to mention the additional uncertainty in the land use data. See the papers above. See comment below. And also these:

Di Vittorio, A.V., L.P. Chini, B. Bond-Lamberty, J. Mao, X. Shi, J. Truesdale, A. Craig, K. Calvin, A. Jones, W.D. Collins, J. Edmonds, G.C. Hurtt, P. Thornton, and A. Thomson (2014). From land use to land cover: restoring the afforestation signal in a coupled integrated assessment - earth system model and the implications for CMIP5 RCP simulations, Biogeosciences, 11:6435-6450, 2014, doi: 10.5194/bg-11-6435-2014.

Meiyappan, P., and A. K. Jain (2012), Three distinct global estimates of historical land-cover change and land-use conversions for over

200 years, Front. Earth Sci., 6(2), 122–139, doi:10.1007/s11707-012-0314-2.

lines 266-286:

Figure 1 indicates that changing your reference cover map does not generate the greatest uncertainty. The range of vegetation across the other models is much greater than the difference you show for your data. What is the nominal year for your data sets? What about for the other models? What is actually driving the variability in these data across trendy? Are they all using different reference rs data? Or are other factors contributing?

lines 358-387:

This section is unclear, particularly with respect to how the benchmark scores are calculated (the ones comparing the obs). You show the benchmark scores in figure 10, but it isn’t clear how these are calculated. I do like this benchmarking system, though.

line 393:

figures a2-a16 are much more useful than figures 3-9, which are unreadable.

Results

lines 558-559:

It is unclear how the benchmark scores are determined.

Conclusion

lines 649-651:

The key word here is “present-day flux.” The cumulative emissions over time are dependent on the land cover change trajectory, which you do not alter in these scenarios. This is also why your previous statement regarding model response being independent of initial land state makes sense here; you do not have a different transient land path that would change the outcome. Your two land covers are both estimates of “present-day” cover, and as such are not that different from each other. And using the same land cover backcasting your initial state changes by a similar amount. So the flux is also constrained by similarly adjusted endpoints.

Note that there are grammatical typos throughout.

Figures and Tables

Figures 3-5:

I cannot determine which simulations are where. A couple of colors are clear, but the groups of lines are muddled together and I cannot tell which simulations are in which group. If you colored them by output group it would be easier to tell which sims have similar results. Using a temporal average may also help (without the annual values shown, which make it messy).

Figures 6-9:

These are difficult to read. Since output groupings are less apparent here, I suggest selecting colors that reflect the experiment groupings. Temporal averages may also help here (without the annual values shown).

Figures A2-A16:

make sure the axis scales match across all panels in each figure.

Citation: https://doi.org/10.5194/egusphere-2022-641-RC1
- AC1: 'Reply to Referee #1', Vivek Arora, 07 Sep 2022
  
  We thank Referee #1 for their helpful comments. Our replies to their comments are shown in bold below.
  
  Summary
  The authors present an evaluation of their updated land model, CLASSIC. The primary update from a previous model is a new nitrogen cycle, although new land cover reference data have been implemented also. Notably, they evaluate the model using two different meteorological driving data sets, and also comparing new vs old land cover data and with/without the new nitrogen cycle. Their evaluation system effectively compares model scores to benchmark scores that are based on observational uncertainty. They conclude that their new model is reasonable compared to both observation and other models, and also that present-day land atmosphere co2 flux is independent from the initial land carbon state, with respect to the variations included in this experiment.
  Overall review
  I appreciate the authors’ more comprehensive approach to evaluating their updated land model. The evaluation system provides clear results. However, in the end it isn’t clear that the model is better, but it appears that the GSWP3 forcing gives better results than the CRU forcing. I also think that the conclusions regarding the independence of flux from initial land state are overstated, mainly because this is a highly constrained case where the land start and end points are shifted by a similar amount and the land change trajectory is nearly identical (but shifted) between the two cases. I recommend the following main revisions (see below for additional details):
  Thank you for your overall positive review of our manuscript.
  
  1) The different land cover cases need to be redefined. They are not different reconstructions. They just reflect an update in the present-day reference land cover data that are used to anchor the model’s land trajectory. While this is a reasonable update, it does not represent the uncertainty of land use/cover change.
  
  Thank you for noting this. Yes, it is correct that the change in crop area over the historical period in both land cover data sets in our study is the same. In this sense, we agree, that it is not entirely correct to call these land cover data sets two reconstructions. The distinction here is between land use change (LUC) and land cover. We treat LUC similarly despite the differences in the two land cover cases. If we are given the opportunity to revise our manuscript, we will clarify this distinction and redefine the two land cover cases.
  
  2) Qualify your conclusions regarding the robustness of model fluxes under different initial carbon states. This is a very specific, highly constrained comparison and there are many factors and uncertainties, particularly in the land space, that are not considered here but have substantial impacts on carbon flux and storage estimates.
  
  The response of the terrestrial biosphere over the historical period is driven by four primary global change drivers – increasing CO₂, changing climate, land use change (LUC), and N deposition. We agree that in our current framework we haven’t taken into account the uncertainty associated with LUC and yes there are several other uncertainties as well. However, our simulations do allow us to evaluate how the response to the three other global change drivers is dependent on two driving meteorological data, two land cover cases, and two model variations (with and without an interactive N cycle). We will clarify this point when revising our manuscript.
  
  Please also note that our statement about little dependence on the initial land carbon state is in the context of the NET atmosphere-land CO2 flux. The reason why this happens is that the model is first spun up to equilibrium conditions and then forced with time-variant forcings. So while the absolute fluxes (Gross primary productivity and respiratory fluxes) are different, the NET atmosphere-land CO2 flux is similar across simulations in that the net flux from all simulations lies within the uncertainty range from the Global Carbon Project.
  3) Complete your background and comparisons with literature on land data uncertainties. Some suggestions are below, but there is more out there showing the complexity of this problem.
  Thank you for pointing to the additional references that highlight the uncertainty associated with LUC emissions. These additional references will help us highlight and clarify that our framework does not account for uncertainty associated with LUC.
  4) Swap the more useful appendix figures for the unreadable paper figures. Try to make the figures more readable.
  We were not sure if the figures showing the spread across the simulations or the figures showing the effect of land cover, meteorological data, and the inclusion or the absence of the N cycle separately were more useful. We will swap the figures between the appendix and the main text.
  Specific comments/suggestions
  Abstract
  line 6: I would not call these two land cover sets different historical reconstructions. You just replaced your current day reference land cover with newer data. There are many other factors that affect the reconstruction, most notably the land use data and the assumptions used to apply land use to land cover.
  Yes, we agree as mentioned above and we will redefine the two land cover cases.
  line 12: awkward sentence transition. probably do not need the beginning of this sentence; just start with “Simulated area burned…”
  
  Thanks for your suggestion.
  Introduction
  lines 90-94: there are other studies on this topic. the most relevant one is probably this one because it addresses uncertainty in land cover in conjunction with the effects of co2, nitrogen deposition, and climate:
  A.V. Di Vittorio, J. Mao, X. Shi, L. Chini, G. Hurtt, and W.D. Collins, “Quantifying the effects of historical land cover uncertainty on global carbon and climate estimates”, Geophysical Research Letters. doi: 10.1002/2017GL075124.
  This one looks at land change emissions across several land cover representations:
  Peng, S., P. Ciais, F. Maignan, W. Li, J. Chang, T. Wang, and C. Yue (2017), Sensitivity of land use change emission estimates to historical land use and land cover mapping, Global Biogeochem. Cycles, 31, 626–643, doi:10.1002/2015GB005360.
  but this one is also relevant:
  A.V. Di Vittorio, X. Shi, B. Bond-Lamberty, K. Calvin, A. Jones, 2020, “Initial land use/cover distribution substantially affects global carbon and local temperature projections in the integrated Earth system model”, Global Biogeochemical Cycles. doi: 10.1029/2019GB00683.
  Thank you for mentioning these references that will help us highlight and clarify that our framework does not account for uncertainty associated with LUC.
  CLASSIC modeling framework
  lines 178-180: Some folks may disagree here. Different types of trees have different leaf/canopy shapes, orientations, and colors that may affect interception and also radiative processes.
  This statement was made in the context of current formulations used in land surface models which typically only use leaf area index (or plant area index) and a PFT-dependent parameter to calculate the storage capacity of leaves for calculating how much precipitation is intercepted. Such an approach is used in CLASSIC. In this context, the PFT-dependent parameter accounts for leaf shape and orientation but not the underlying deciduous or evergreen phenology of the leaves. We will reword our statement to clarify this.
  Driving data
  line 230: This may be true in some cases, but the later step of creating the reconstruction by applying the land use trajectory to this static cover map can generate greater uncertainty, not to mention the additional uncertainty in the land use data. See the papers above. See comment below. And also these:
  Di Vittorio, A.V., L.P. Chini, B. Bond-Lamberty, J. Mao, X. Shi, J. Truesdale, A. Craig, K. Calvin, A. Jones, W.D. Collins, J. Edmonds, G.C. Hurtt, P. Thornton, and A. Thomson (2014). From land use to land cover: restoring the afforestation signal in a coupled integrated assessment - earth system model and the implications for CMIP5 RCP simulations, Biogeosciences, 11:6435-6450, 2014, doi: 10.5194/bg-11-6435-2014.
  Meiyappan, P., and A. K. Jain (2012), Three distinct global estimates of historical land-cover change and land-use conversions for over 200 years, Front. Earth Sci., 6(2), 122–139, doi:10.1007/s11707-012-0314-2.
  We take your point. In fact, there are two sets of uncertainties here. The first is converting 20-40 land cover classes to a much smaller set of plant functional types (PFTs) that a model simulates, and as we showed in our manuscript this affects the pre-industrial state of vegetation and soil carbon (899 vs 1171 Pg C in our case) but also the magnitude of the current terrestrial sink. The second set of uncertainties is related to incorporating LUC data into a model’s land cover over the historical period (as you highlighted) which is what leads to uncertainties in LUC emissions and therefore also affects the terrestrial sink. These two sets of uncertainties affect model behaviour differently. This latter set of uncertainties is not taken into account in our framework and we will revise our manuscript to clarify this.
  lines 266-286: Figure 1 indicates that changing your reference cover map does not generate the greatest uncertainty. The range of vegetation across the other models is much greater than the difference you show for your data. What is the nominal year for your data sets? What about for the other models? What is actually driving the variability in these data across trendy? Are they all using different reference rs data? Or are other factors contributing?
  Yes, while in terms of total vegetated area the GLC2000 and ESA-CCI based land covers are not that different, the difference is large for the area of grasses and this leads to the 899 vs 1171 Pg C soil carbon difference. This is noted in the manuscript. The data are averaged over the period 1992-2018 for CLASSIC and all TRENDY models. The reason for the variability in vegetated area, and area of trees and grasses across models, is the subjectiveness in the process of mapping/reclassifying 20-40 land cover classes in land cover products to a selected number of PFTs in land models and that land modelling groups use different land cover products. We will clarify this when revising our manuscript.
  
  lines 358-387: This section is unclear, particularly with respect to how the benchmark scores are calculated (the ones comparing the obs). You show the benchmark scores in figure 10, but it isn’t clear how these are calculated. I do like this benchmarking system, though.
  The benchmarking scores and how they are calculated are explained in the following paper.
  
  Seiler, C., Melton, J. R., Arora, V., Sitch, S., Friedlingstein, P., Arneth, A., Goll, D. S., Jain, A., Joetzjer, E., Lienert, S., Lombardozzi, D., Luyssaert, S., Nabel, J. E. M. S., Tian, H., Vuichard, N., Walker, A. P., Yuan, W., and Zaehle, S. 2022. Are terrestrial biosphere models fit for simulating the global land carbon sink? Journal of Advances in Modeling Earth Systems, p.e2021MS002946. https://doi.org/10.1029/2021MS002946.
  Short of including the whole description in the main text we will include the details in our appendix when revising our manuscript so that a reader doesn’t have to refer to the above paper.
  
  line 393: figures a2-a16 are much more useful than figures 3-9, which are unreadable.
  We will flip the figures in the main text versus the appendix when revising our manuscript.
  Results
  lines 558-559: It is unclear how the benchmark scores are determined.
  An additional section in the appendix of the revised manuscript will explain the benchmarking process in more detail as mentioned above.
  Conclusion
  lines 649-651: The key word here is “present-day flux.” The cumulative emissions over time are dependent on the land cover change trajectory, which you do not alter in these scenarios. This is also why your previous statement regarding model response being independent of initial land state makes sense here; you do not have a different transient land path that would change the outcome. Your two land covers are both estimates of “present-day” cover, and as such are not that different from each other. And using the same land cover backcasting your initial state changes by a similar amount. So the flux is also constrained by similarly adjusted endpoints.
  Since we do not take into account different LUC trajectories, it is also implied that had we not taken LUC into account at all (i.e. simulations were driven with only increasing CO2, changing climate, and increasing N deposition) even then the net atmosphere-land CO2 flux would have been similar across different simulations. This suggests that the present-day net land-atmosphere CO2 flux indeed is largely independent of the pre-industrial land carbon state in so far as the response to three other global drivers is concerned (CO2, climate, and N deposition). As mentioned earlier, we agree that the caveat related to LUC emissions has to be made more clear in our manuscript, and we note your feedback that we have overstated our conclusion related to the independence of present-day net atmosphere-land CO2 flux. We will reword this conclusion and tone down the message.
  
  If we were to plot the cumulative emissions from 1960 onwards so that they can be compared to estimates from the Global Carbon Project (GCP), similar to Figure 9a in the manuscript, even then all eight simulations lie within the uncertainty of estimates from the GCP as shown below.
  Note that there are grammatical typos throughout.
  Thank you for noting these. We will address all these when revising our manuscript.
  Figures and Tables
  Figures 3-5:
  I cannot determine which simulations are where. A couple of colors are clear, but the groups of lines are muddled together and I cannot tell which simulations are in which group. If you colored them by output group it would be easier to tell which sims have similar results. Using a temporal average may also help (without the annual values shown, which make it messy).
  Figures 6-9:
  These are difficult to read. Since output groupings are less apparent here, I suggest selecting colors that reflect the experiment groupings. Temporal averages may also help here (without the annual values shown).
  The purpose of these figures is to give the reader an idea of the spread across the eight simulations. Even if the colours are wisely chosen there will be lines that overlap other lines for some variables. As suggested by you we will move these figures to the appendix, and the appendix figures to the main text.
  
  Figures A2-A16:
  make sure the axis scales match across all panels in each figure.
  Thank you for noting this. We will make the y-axis scale similar for all sub-panels of a figure.
  
  Citation: https://doi.org/10.5194/egusphere-2022-641-AC1
RC2:
'Comment on egusphere-2022-641', Anonymous Referee #2, 02 Sep 2022
General comments:

Arora et al. evaluate land model uncertainty using an ensemble of simulations with different model structure, forcings, and observations. This type of model study is useful to understanding quantities like the land carbon sink in the context of these uncertainties. The results show that biogeophysical variables like runoff and sensible heat flux are most impacted by meteorological forcing, while biogeochemical variables like vegetation biomass are most impacted by having an interactive nitrogen cycle. This is not necessarily surprising, but useful to have summarized here. The results on net atmosphere-land CO2 flux being independent of land carbon state are interesting and could be highlighted more visually. The benchmarking is also useful, and hopefully the AMBER tool can be shared more widely.

The premise that each of the 8 simulations is “equally probable” (abstract) or “equally likely” (introduction) needs more explanation. For example, is the model simulation without carbon-nitrogen coupling as likely as the simulation with this coupling? Are the different datasets equally plausible representations given the details discussed in sections 3.1-3.2? Some discussion on these points would be helpful. While the NBP results (Figure 9) show that the simulations are all within the historical uncertainty range, it is unclear how you would know that a priori to determine which structure, forcings, and observations to sample in an ensemble like this.

There are also a large number of figures and condensing or selecting the most salient results as main text figures would help with length and clarity. For example, do the full timeseries plots of all variables need to be included in the main text? Especially since there are clear groupings of variables that show more sensitivity to, for example, meteorological forcing vs. N cycle. I also felt that some of the ensemble mean figures (i.e., Figures A2-A16) were more interesting than the timeseries plots with all 8 simulations (i.e., Figures 3-9) because they nicely summarize the strongest effects for different variables. In general, the figure organization, number of figures, and placement of figure in main vs. appendix could be improved to highlight the most interesting results.

In general, the presentation quality needs improvement to better communicate the results before I can recommend this paper for publication. I have highlighted some areas in the specific comments below.

Specific comments:

Line 33 and following: Some useful references to include here would be:

Fisher and Koven 2020, https://doi.org/10.1029/2018MS00145

Kyker-Snowman et al. 2022, https://doi.org/10.1111/gcb.15894

Bonan and Doney 2018, https://doi.org/10.1126/science.aam8328

Lines 110-111: Is AMBER available to the community? Here it is listed as “open-source”, but following the link in Seiler et al. (2021b) leads to a dead end: https://cran.r-project.org/web/packages/amber/index.html. Suggest adding an updated link for AMBER to the Code/data availability section in this manuscript.

Lines 177-180: There are other physical processes that could benefit from using a larger number of PFTs, for example, sensible and latent heat flux calculations.

Section 3.1: Here I started to get a little confused with specific land cover datasets. Line 219 states “two observation-based data sets are used”, a remotely-sensed product (assuming that is ESA CCI) and the “LUH product as part of TRENDY”. The next paragraph describes the process of generating land cover data with the “older” GLC 2000 product, with some information from LUH. Figure 1 compares these three datasets and a fourth one which is based on ESA CCI. Then line 337 (section 3.4) states that the two land cover reconstructions used in the model simulations are GLC 2000 and ESA CCI. Please clarify in the text which datasets are used for land cover in this study and how they are used.

Line 255: What land cover data is used for years prior to 1992 in the case of the simulations with ESA CCI?

Lines 349-351: Do the different end dates for the simulations with different meteorological forcings affect the analysis?

Lines 367-368: Some justification for the doubled weighting of S_rmse would be helpful here, even if a brief sentence/reference. One could argue the other scores also have “importance”.

Lines 450-452: This sentence should be rephrased/expanded on since it doesn’t add much on its own. Or it could be removed, as Table 3 summarizes the differences in cv values.

Line 529: Curious why the simulations show a land carbon source in the 1930s? Is that realistic?

Line 571-573: Is there a reason to include the SG250m dataset here since the model compares better with HWSD?

Line 574 and following: More discussion of Figure 11 is needed – there are a lot of model/data comparisons here that are summarized very briefly as “the model is overall able to capture the latitudinal distribution of most land surface quantities”. For example, the aboveground biomass observations are very different from each other, and different from the model spread.

Line 599: The fact that the interactive N cycle degrades model performance for certain variables is an interesting result that merits some discussion. Some readers may be surprised that something that is essentially a model improvement for more realistic process representation doesn’t necessarily improve performance.

Lines 601-602: Thanks for including the full AMBER results. This sentence and link should probably be moved (or repeated) in the Code/data availability section.

Conclusions section: The first two paragraphs of this section could benefit from linking back to specific results in this study with the results placed in the context of other studies (as is done in the third paragraph). Especially for the second paragraph, since model tuning was not covered in detail in the introduction.

Line 642 and following: Curious why the effect of the interactive N cycle is discussed here but the other factors are not?

Code/data availability: There are no references to the code/data used in this manuscript (e.g., the simulation output, how to access the observational datasets, or the code used to generate the analysis and figures.)

Table 2: Suggest also grouping by variables, so it is easy for the reader to see which variables have multiple globally gridded and/or in situ sources. This relates to the calculation of benchmark scores in Lines 383-385 where “at least two sets of observation-based data for a given quantity” are needed.

Table 3: Suggest adding the dominant source(s) of spread for each variable (e.g., met forcing, land cover data, N cycle) to summarize that information across variables. Figure 12 does some of this, but it could be improved for presentation quality as noted below. In addition, there are 14 variables listed in this table, while Figure 12 includes 16 variables and the text mentions 19 variables used for benchmarking and calculating scores. Is there a reason for these differences?

Figure 3 (and following analogous figures): Suggest specifying the exact years shown in these timeseries plots. I believe the end years are different for the different metrological forcings (e.g., 2016 vs. 2019) but it is difficult to see because the lines are very small. Also, please describe in the figure caption what the numbers are in the upper right part of the plot. What is the difference between the bold colored lines and the lighter/less bold lines?

Figures 3-5 (and A2-A5): The data in these figures for 1701-1900 is very repetitive, given the fact that these years use the meteorological forcing from 1901-1925 repeated. The timeseries plots could be shortened to show only 1900 onwards to focus on the most interesting parts of the historical timeseries.

Figure 9: Please add something about the TRENDY models / grey boxes in panel a) to the caption here.

Figure 10: Some additional explanation (in figure caption or text) on how the horizontal and vertical whiskers were calculated would be helpful.

Figure 11: The colors are confusing here. The caption says the model mean is in “dark purple” but it looks more like magenta/purple-red and the dark purple line looks like it is showing an observational dataset (e.g., GEOCARBON in panel a)). Line 577 lists additional colors in regard to this figure. Suggest adding dashes to the observational lines to better distinguish from the model and avoid relying on interpreting specific color choices. Please also explain the box plots in the figure caption.

Figure 12: This figure is a helpful summary of the results, but it needs improvement for presentation quality. For example, the “GLC2000-ESACCI”, etc. does not need to be repeated down the column and may not even need to be included since the orange shading denotes that section as “Effect of land cover”. The average scores could be incorporated visually to “provide context” (which was somewhat unclear from the figure caption). Please explain the error bars in the figure caption.

Technical corrections:

Line 87: Community Land Model should be capitalized.

Line 93: Tian et al. (2004) used CLM2 coupled to CAM2, whereas Lawrence and Chase (2007) used CLM3 within CCSM3. Suggest rephrasing this to clarify.

Line 107: Should be “simulated” instead of “simulate”.

Lines 353-354: Missing units for grid size (km?).

Line 407: Here the supplemental figures started being referred to as Figure SX instead of Figure AX – please adjust to match.

Lines 518-520: Move “or the net atmosphere-land CO2 flux” up to the first mention of NBP/Figure 9 since Figure 9 uses the latter term and not NBP.

Line 546: Should be “differences” instead of “difference”.

Lines 622-624: Check figure references here, I believe both are incorrect.

Line 624: Should be “20 years” instead of “20-year”.
Citation: https://doi.org/10.5194/egusphere-2022-641-RC2
- AC2: 'Reply to Referee #2', Vivek Arora, 14 Sep 2022
  
  Reviewer # 2
  
  We thank Referee #2 for their helpful comments. Our replies to their comments are shown in bold below.
  
  Arora et al. evaluate land model uncertainty using an ensemble of simulations with different model structure, forcings, and observations. This type of model study is useful to understanding quantities like the land carbon sink in the context of these uncertainties. The results show that biogeophysical variables like runoff and sensible heat flux are most impacted by meteorological forcing, while biogeochemical variables like vegetation biomass are most impacted by having an interactive nitrogen cycle. This is not necessarily surprising, but useful to have summarized here. The results on net atmosphere-land CO2 flux being independent of land carbon state are interesting and could be highlighted more visually. The benchmarking is also useful, and hopefully the AMBER tool can be shared more widely.
  
  Thank you for your overall positive feedback. That the net atmosphere-land CO2 flux is largely independent of the land carbon state is downplayed by referee #1 and they suggested that this conclusion is overstated. While our framework studies land cover uncertainty, it does not take into account land use change (LUC) uncertainty (as noted by referee #1) and therefore we would like to both clarify this caveat to our conclusions and err on the side of caution and not highlight this conclusion more visually.
  
  The premise that each of the 8 simulations is “equally probable” (abstract) or “equally likely” (introduction) needs more explanation. For example, is the model simulation without carbon-nitrogen coupling as likely as the simulation with this coupling? Are the different datasets equally plausible representations given the details discussed in sections 3.1-3.2? Some discussion on these points would be helpful. While the NBP results (Figure 9) show that the simulations are all within the historical uncertainty range, it is unclear how you would know that a priori to determine which structure, forcings, and observations to sample in an ensemble like this.
  
  If given the opportunity to revise our manuscript, we will expand on this. This is indeed the conundrum, it is not known a priori which model structure, forcing, and observation to sample. It is difficult to conclude which meteorological driving data are more reliable, which land cover is more realistic (as the large spread in the vegetated, tree, and grass area from TRENDY models illustrates), and which model version is indeed better. Hence, the conclusion is that an ensemble-based approach allows a more robust evaluation of a model.
  
  In our case, some general conclusions can be made. For example, we have more confidence in the ESA-CCI based land cover than in the GLC2000 based land cover because the reclassification of 37 ESA-CCI land cover classes to CLASSIC’s nine plant functional types has been more thoroughly vetted against high-resolution land cover data compared to the GLC2000. We will, however, remove the phrase “equally probable” since it is difficult to defend and include discussion around this when revising our manuscript.
  
  There are also a large number of figures and condensing or selecting the most salient results as main text figures would help with length and clarity. For example, do the full time-series plots of all variables need to be included in the main text? Especially since there are clear groupings of variables that show more sensitivity to, for example, meteorological forcing vs. N cycle. I also felt that some of the ensemble mean figures (i.e., Figures A2-A16) were more interesting than the time-series plots with all 8 simulations (i.e., Figures 3-9) because they nicely summarize the strongest effects for different variables. In general, the figure organization, number of figures, and placement of figure in main vs. appendix could be improved to highlight the most interesting results.
  
  Swapping figures in the appendix with those in the main text is also suggested by referee #1. We will swap these figures, condense the total number of figures by combining zonal and time series plots for a given variable, and drop some less interesting variables.
  
  In general, the presentation quality needs improvement to better communicate the results before I can recommend this paper for publication. I have highlighted some areas in the specific comments below.
  
  Specific comments:
  Line 33 and following: Some useful references to include here would be:
      Fisher and Koven 2020, https://doi.org/10.1029/2018MS00145
      Kyker-Snowman et al. 2022, https://doi.org/10.1111/gcb.15894
      Bonan and Doney 2018, https://doi.org/10.1126/science.aam8328
  
  Thank you for pointing out these additional references which we will include.
  
  Lines 110-111: Is AMBER available to the community? Here it is listed as “open-source”, but following the link in Seiler et al. (2021b) leads to a dead end: https://cran.r-project.org/web/packages/amber/index.html. Suggest adding an updated link for AMBER to the Code/data availability section in this manuscript.
  
  We have now replaced the CRAN link with the following link using Zenodo and will include this when revising our manuscript.
  
  https://doi.org/10.5281/zenodo.5670387
  
  The link provides the source code as well as the scripts required for reproducing the computational environment, which takes care of all dependencies with other R-packages.
  
  Lines 177-180: There are other physical processes that could benefit from using a larger number of PFTs, for example, sensible and latent heat flux calculations.
  
  Yes, in theory, a larger number of PFTs should allow the modelling of PFT-dependent processes more realistically. However, for CLASSIC (and for other LSMs too) latent heat flux is primarily a function of available energy and precipitation. In CLASSIC, large changes in leaf area index (LAI) do not change total latent heat flux considerably since the partitioning of evapotranspiration into its sub-components (transpiration, soil evaporation, and evaporation/sublimation of intercepted rain/snow) changes. For example, a decrease in transpiration and evaporation of intercepted precipitation, due to a decrease in LAI, is compensated by an increase in soil evaporation. As such then biogeochemical processes benefit more in terms of realism than physical processes when the number of PFTs is increased. We will add this clarification around these sentences when revising our manuscript.
  
  Section 3.1: Here I started to get a little confused with specific land cover datasets. Line 219 states “two observation-based data sets are used”, a remotely-sensed product (assuming that is ESA CCI) and the “LUH product as part of TRENDY”. The next paragraph describes the process of generating land cover data with the “older” GLC 2000 product, with some information from LUH. Figure 1 compares these three datasets and a fourth one which is based on ESA CCI. Then line 337 (section 3.4) states that the two land cover reconstructions used in the model simulations are GLC 2000 and ESA CCI. Please clarify in the text which datasets are used for land cover in this study and how they are used.
  
  The methodology for generating land cover from 1700 to the present day requires a remotely sensed snapshot of present-day land cover (e.g. GLC 2000 (which represents the year 2000) and ESA CCI (1992-2018) land cover products with their 20-40 land cover classes that need to be remapped/reclassified to model’s nine PFTs) and an estimate of the change in crop area over the historical period (1700-2018) (LUH). These are the three data sets used in our study. Figure 1 in the manuscript shows the model’s total vegetated, tree, and grass area when using the GLC 2000 (blue line) and ESA CCI (red line) land cover products compared against Li et al. (2018) (dotted black line) and other TRENDY models (grey lines). We can see how this can be confusing and will clarify this when revising our manuscript.
  
  Line 255: What land cover data is used for years prior to 1992 in the case of the simulations with ESA CCI?
  
  As mentioned above the 1992-2018 ESA CCI land cover provides a snapshot of the present-day land cover. Although there is some interannual variability between these years overall the total vegetated area doesn’t change much. We chose the year 1992 from these data, reclassified its 37 land cover classes to CLASSIC’s nine PFTs, replaced the crop PFTs for the present day with those from the LUH product, adjusted the natural PFTs accordingly, and then went back in time to 1700 using crop area from the LUH data. This yields a reconstruction of the historical land cover with CLASSIC’s nine PFTs based on the ESA CCI land cover and crop area changing over the historical period from the LUH product. We will clarify this when revising our manuscript.
  
  Lines 349-351: Do the different end dates for the simulations with different meteorological forcings affect the analysis?
  
  No, they do not. However, we will make the time period the same for a consistent comparison.
  Lines 367-368: Some justification for the doubled weighting of S_rmse would be helpful here, even if a brief sentence/reference. One could argue the other scores also have “importance”.
  
  We agree that the decision to give twice as much weight to S_rmse is somewhat subjective. This follows from Collier et al. (2018) but we will make a note of subjectiveness in this decision.
  
  Collier, N. et al. 2018. “The International Land Model Benchmarking (ILAMB) System: Design, Theory, and Implementation.” Journal of Advances in Modeling Earth Systems 10 (11): 2731–54.
  
  Lines 450-452: This sentence should be rephrased/expanded on since it doesn’t add much on its own. Or it could be removed, as Table 3 summarizes the differences in cv values.
  
  We will remove this sentence when revising our manuscript.
  
  Line 529: Curious why the simulations show a land carbon source in the 1930s? Is that realistic?
  
  In Figure 9a, before about 1970 the model simulates both a land carbon sink and source in response to interannual variability in meteorological data.
  
  Line 571-573: Is there a reason to include the SG250m dataset here since the model compares better with HWSD?
  
  There are multiple observations available for certain variables. There are times when it is obvious which observation-based data set is better or more appropriate but there are times when it is not. In the case of soil carbon, since CLASSIC and most land models do not include processes to represent peatland and permafrost carbon at high latitudes, it is clear that the HWSD data set is more appropriate for comparison with the model output. The idea behind using both data sets is to illustrate this concept and we will clarify this.
  
  Line 574 and following: More discussion of Figure 11 is needed – there are a lot of model/data comparisons here that are summarized very briefly as “the model is overall able to capture the latitudinal distribution of most land surface quantities”. For example, the aboveground biomass observations are very different from each other, and different from the model spread.
  
  Thank you for pointing this out. We will discuss all panels of Figure 11 in the revised manuscript. In the context of aboveground biomass, the GEOCARBON data set uses two products, one for the extratropics and the other for the tropics to create a global aboveground biomass product. The Zhang product is based on 10 biomass maps. Both products are described in detail in section 2.3.3. of the following paper. We will include this information when revising our manuscript.
  
  Seiler, C., et al. (2022) Are terrestrial biosphere models fit for simulating the global land carbon sink? Journal of Advances in Modeling Earth Systems, p.e2021MS002946. https://doi.org/10.1029/2021MS002946.
  
  Line 599: The fact that the interactive N cycle degrades model performance for certain variables is an interesting result that merits some discussion. Some readers may be surprised that something that is essentially a model improvement for more realistic process representation doesn’t necessarily improve performance.
  
  Yes, we will include a discussion about why the inclusion of the N cycle degrades the model performance for some variables. The inclusion of the N cycle changes the maximum photosynthetic rate (Vcmax) to a prognostic variable for each PFT as opposed to being specified based on observations. This is analogous to running an atmospheric model with specified sea surface temperatures (SST) and sea ice concentrations (SIC) as opposed to using a full 3D ocean. Using a dynamic ocean allows future projections (since future SSTs and SICs are not known) but invariably degrades a model’s performance for the present day since simulated SSTs and SICs will have their biases. Similarly, using an interactive N cycle allows to project future changes in Vcmax (based on changes in N availability) but also degrades CLASSIC’s performance for the present day since simulated Vcmax has its own biases.
  
  Lines 601-602: Thanks for including the full AMBER results. This sentence and link should probably be moved (or repeated) in the Code/data availability section.
  
  We will move the AMBER link to the Code/data availability section.
  
  Conclusions section: The first two paragraphs of this section could benefit from linking back to specific results in this study with the results placed in the context of other studies (as is done in the third paragraph). Especially for the second paragraph, since model tuning was not covered in detail in the introduction.
  
  Thank you for this suggestion. We will cover model tuning early on in the manuscript and link it to the second paragraph in the conclusions.
  
  Line 642 and following: Curious why the effect of the interactive N cycle is discussed here but the other factors are not?
  
  The effect of the interactive N cycle is discussed explicitly to outline model limitations and room for improvement when the interactive N cycle is switched on. We will also include a discussion of meteorological forcings and land cover when revising our manuscript.
  
  Code/data availability: There are no references to the code/data used in this manuscript (e.g., the simulation output, how to access the observational datasets, or the code used to generate the analysis and figures.)
  
  The model code and documentation are available at https://cccma.gitlab.io/classic_pages/ and this is mentioned in the manuscript. Observation-based data sets are available from their respective sources. If providing model output and the scripts used for analysis is a requirement for Biogeosciences we will upload these to Zenodo and provide a DOI.
  
  Table 2: Suggest also grouping by variables, so it is easy for the reader to see which variables have multiple globally gridded and/or in situ sources. This relates to the calculation of benchmark scores in Lines 383-385 where “at least two sets of observation-based data for a given quantity” are needed.
  
  Thanks for your suggestion. We will group by variable and reorganize Table 2 according to whether the data are globally gridded and/or in situ, or make note of this in an additional column.
  
  Table 3: Suggest adding the dominant source(s) of spread for each variable (e.g., met forcing, land cover data, N cycle) to summarize that information across variables. Figure 12 does some of this, but it could be improved for presentation quality as noted below. In addition, there are 14 variables listed in this table, while Figure 12 includes 16 variables and the text mentions 19 variables used for benchmarking and calculating scores. Is there a reason for these differences?
  
  Indicating the dominant source of spread for each variable is a good suggestion. The reason for the different numbers of variables in Table 2 and Figure 12 is that the variables are bundled slightly differently. In Table 2 heterotrophic and autotrophic respiration are evaluated separately but in Figure 12 they are bundled in ecosystem respiration for comparison with observations. In addition, some variables are repeated in Figure 12 (e.g. ecosystem respiration and net ecosystem exchange) because their scores are statistically different when evaluating the effect of land cover, meteorological forcing, and an interactive N cycle. We will make note of this when revising our manuscript. The full list of 19 variables also includes the albedo and leaf area index. We did not include albedo since it shows very little variability across the eight simulations (cv=0.007) and the leaf area index is somewhat similar to vegetation biomass. We will add these two variables to Table 2 for clarity and completeness. In the current manuscript Figures A18 and 11 do compare zonally-averaged values of albedo and leaf area index, respectively, with observations. We will clarify this when revising our manuscript and when discussing Figure 11 in more detail.
  
  Figure 3 (and following analogous figures): Suggest specifying the exact years shown in these timeseries plots. I believe the end years are different for the different metrological forcings (e.g., 2016 vs. 2019) but it is difficult to see because the lines are very small. Also, please describe in the figure caption what the numbers are in the upper right part of the plot. What is the difference between the bold colored lines and the lighter/less bold lines?
  
  We will make the time periods consistent for comparison for GSWP3 and CRU-JRA driven runs, and make it explicitly clear that the GSWP3 and CRU-JRA end in different years in the figure caption. We will also mention in the figure caption that the thin lines are individual years and the thick line is their 10-year running mean. We will also clarify the numbers in the upper right part of the plot (which are the mean from 1700-1720, the mean over the last 20 years of the historical period, and the difference between these values).
  
  Figures 3-5 (and A2-A5): The data in these figures for 1701-1900 is very repetitive, given the fact that these years use the meteorological forcing from 1901-1925 repeated. The timeseries plots could be shortened to show only 1900 onwards to focus on the most interesting parts of the historical timeseries.
  
  Yes, we agree to make this change.
  
  Figure 9: Please add something about the TRENDY models / grey boxes in panel a) to the caption here.
  
  We will clarify in the figure caption that the grey boxes are the estimates based on the Global Carbon Project.
  
  Figure 10: Some additional explanation (in figure caption or text) on how the horizontal and vertical whiskers were calculated would be helpful.
  
  The vertical whiskers show the range of eight model scores when a given variable from all eight model simulations is compared to an observation-based data set. The horizontal whiskers show the range when three or more observation-based are compared to each other. When only two observation-based data sets are compared to each other there is only one benchmark score, and therefore there is no range. We will clarify this when revising our manuscript.
  
  Figure 11: The colors are confusing here. The caption says the model mean is in “dark purple” but it looks more like magenta/purple-red and the dark purple line looks like it is showing an observational dataset (e.g., GEOCARBON in panel a)). Line 577 lists additional colors in regard to this figure. Suggest adding dashes to the observational lines to better distinguish from the model and avoid relying on interpreting specific color choices. Please also explain the box plots in the figure caption.
  
  We will modify these figures to make the colours more obvious and/or use lines with different patterns.
  
  Figure 12: This figure is a helpful summary of the results, but it needs improvement for presentation quality. For example, the “GLC2000-ESACCI”, etc. does not need to be repeated down the column and may not even need to be included since the orange shading denotes that section as “Effect of land cover”. The average scores could be incorporated visually to “provide context” (which was somewhat unclear from the figure caption). Please explain the error bars in the figure caption.
  
  We will modify Figure 12 to remove GLC2000-ESACCI and other similar wordings. We will also think about how best to incorporate the average scores visually in the figure. The error bars denote the 95% confidence interval. We will add this info to the figure caption.
  
  Technical corrections:
  Line 87: Community Land Model should be capitalized.
  Line 93: Tian et al. (2004) used CLM2 coupled to CAM2, whereas Lawrence and Chase (2007) used CLM3 within CCSM3. Suggest rephrasing this to clarify.
  Line 107: Should be “simulated” instead of “simulate”.
  Lines 353-354: Missing units for grid size (km?).
  Line 407: Here the supplemental figures started being referred to as Figure SX instead of Figure AX – please adjust to match.
  Lines 518-520: Move “or the net atmosphere-land CO2 flux” up to the first mention of NBP/Figure 9 since Figure 9 uses the latter term and not NBP.
  Line 546: Should be “differences” instead of “difference”.
  Lines 622-624: Check figure references here, I believe both are incorrect.
  Line 624: Should be “20 years” instead of “20-year”.
  
  Thank you for noting these minor corrections. We will incorporate these when revising our manuscript.
  
  Citation: https://doi.org/10.5194/egusphere-2022-641-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (17 Sep 2022) by Ben Bond-Lamberty

AR by Vivek Arora on behalf of the Authors (09 Nov 2022) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (10 Nov 2022) by Ben Bond-Lamberty

RR by Anonymous Referee #1 (14 Nov 2022)

RR by Anonymous Referee #2 (23 Nov 2022)

Suggestions for revision or reasons for rejection

General comments:

The authors have done a good job responding to most of the reviewer comments, though it appears that not all the changes mentioned in the “Reply to Referee” comments have made it into the revised manuscript. This led to some difficulty assessing the revised manuscript in the context of the “Reply to Referee” comments, which were published before the manuscript was revised. I think what is missing is a clear link from reviewer comments to specific parts of the manuscript that were revised to address those comments (e.g., with new line numbers). The “Author’s Response” document associated with the revised manuscript does not specifically mention what changes were made to address the reviewer comments, and instead points to the older “Reply to Referee” comments in the online discussion. I noted some discrepancies in the specific comments below, and hopefully the authors can clarify with future revisions.

Specific comments:

Line 18: Please clarify “relative to their mean” – ensemble means? Annual means?

Lines 182-194: I appreciate the clarification here regarding PFT-specific information for physical vs. biogeochemical processes. However, I’m not entirely convinced by this line: “For example, the interception of rain and snow by canopy leaves (that is typically modelled as a function of LAI and a PFT-dependent parameter that accounts for leaf orientation and shape) does not depend on the underlying evergreen or deciduous nature of the leaf phenology.”
True for phenology, but I would think that evergreen vs. deciduous would matter for interception via the PFT-dependent parameter accounting for leaf orientation and shape (e.g., needle vs. circular shapes). I’ll also note that Reviewer #1 had a similar comment about how leaf/canopy shapes, orientations, and colors impact not only interception but also radiation, which can impact sensible heat flux. I think this section could be clarified further, perhaps by focusing in specific terms on what the CLASSIC model is doing (versus phrasing about LSMs in general).

Lines 255-256: It would be helpful to specify here what “a snapshot” means – is it matching the time period of the remotely sensed land cover product (e.g., 2000)? How do you go forward in time to create the historical land cover from 2000 to 2018? A brief line describing that process here would be helpful to round out this description.

Lines 329-338: These references are a welcome addition, though they may fit better in the introduction.

Table 1: The simulation labels (A, B, C, etc.) seem useful but I couldn’t determine whether they were actually used in the manuscript. It would make sense to use them in the legends of some of the appendix figures (e.g., Figure A3 and similar).

Line 410: In the “Reply to Referee” document in response to a question about different end dates for simulations with different meteorological forcings, it is mentioned that the authors “will make the time period the same for a consistent comparison”. However, this line still mentions different end years for the different forcing data: 2018 vs. 2016. The timeseries figures do appear to end at the same year, with a consistent 1997-2016 average calculated.

Line 433: 19 variables are mentioned but only 18 are listed - is ecosystem respiration missing from the list?

Table 2: I’m not seeing the changes to Table 2 that were mentioned in the “Reply to Referee” document: “We will group by variable and reorganize Table 2 according to whether the data are globally gridded and/or in situ, or make note of this in an additional column”. I’m also not seeing a table entry for fire CO2 emissions (as mentioned in line 436).

Lines 463-466: Suggest adding a reference to Table 3 here since that is where the cv results are reported.

Line 615: Heterotrophic respiration does not seem affected by meteorological data, but it’s interesting that autotrophic respiration is a little affected (Fig A8 panel f), almost on a similar scale to the impact of an interactive N cycle (Fig A8 panel b). It could be worth noting that point as well as the fact that the impacts vary over time which is somewhat different than the other carbon variables.

Lines 626-629: Similar to my comment in the first round about the sentence at the end of section 4.1, I think this sentence could also be removed for conciseness since Table 3 covers the summary of cv values and drivers.

Lines 665-668: This explanation is helpful to understand what is shown in Table 3 vs. Table 2. It might be worth listing here the other variables that are not included in Table 3 (net radiation, net ecosystem exchange) for completeness.

Lines 669-671: These sentences seem a little out of place, and are perhaps better suited to the conclusions section.

Line 843: In the “Reply to Referee” document in response to a comment about linking the conclusions to specific results, it is mentioned that the authors “will cover model tuning early on in the manuscript and link it to the second paragraph in the conclusions”. However, I was unable to find where this topic had been added to the revised manuscript. There is Line 670-671 that seems better suited to this paragraph in the conclusions (as mentioned above), but nothing in the introduction.

Lines 880-884: It would be helpful to briefly summarize here the reasons for the effects of different forcing datasets (e.g., I believe precipitation/wind differences are mentioned in the results).

Technical corrections:

Line 362: Suggest changing the precipitation units to mm/month to match the values in Figure A1.

Line 379: The phrase “time-invariant monthly lightning” is unclear, perhaps “monthly climatological lightning” or “prescribed monthly lightning”?

Line 414: Suggest specifying “128 x 64 grid cells”.

Line 674: Please check the years here – Figure A11 caption says the zonal means are averaged over 1997-2016.

NBP Figures (A11 and 9): It would help to decrease the y-axis scales of the NBP figures so that the differences can be shown more clearly. The legends can be moved outside of the figure area if needed in order to decrease the y-axis upper limit. This is somewhat of an issue throughout all the figures in the manuscript, but especially apparent here where it’s difficult to distinguish the NBP variations.

Zonal mean Figures: Not all the zonal mean figures specify over which time period the zonal mean averaging was computed (e.g., Figures 5, 6, 7, 8, 9, A10). At first I thought this was the entire time period, but some of the appendix figures (e.g., Figures A6, A7, A9, A11) specify that the zonal mean values are over the present day period. Some clarification in the figure captions is needed.

Line 728: Suggest moving this line (“The range in model scores…”) up to Line 723 before the additional text describing the whiskers.

Line 833: Something is off grammatically here, maybe add “and how the response is dependent…”

Hide

ED: Publish subject to minor revisions (review by editor) (21 Dec 2022) by Ben Bond-Lamberty

AR by Vivek Arora on behalf of the Authors (02 Feb 2023) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (07 Feb 2023) by Ben Bond-Lamberty

AR by Vivek Arora on behalf of the Authors (20 Feb 2023) Manuscript

Short summary

The behaviour of natural systems is now very often represented through mathematical models. These models represent our understanding of how nature works. Of course, nature does not care about our understanding. Since our understanding is not perfect, evaluating models is challenging, and there are uncertainties. This paper illustrates this uncertainty for land models and argues that evaluating models in light of the uncertainty in various components provides useful information.