Evaluation of gradient boosting and random forest methods to model subdaily variability of the atmosphere–forest CO2 exchange
- 1Weather and Climate Change Impact Research, Finnish Meteorological Institute, Helsinki, Finland
- 2Institute for Atmospheric and Earth System Research/Forest Sciences, Faculty of Agriculture and Forestry, University of Helsinki, Helsinki, Finland
- 3Institute for Atmospheric and Earth System Research / Physics, Faculty of Science, University of Helsinki, Helsinki, Finland
- 4Climate System Research, Finnish Meteorological Institute, Helsinki, Finland
- 1Weather and Climate Change Impact Research, Finnish Meteorological Institute, Helsinki, Finland
- 2Institute for Atmospheric and Earth System Research/Forest Sciences, Faculty of Agriculture and Forestry, University of Helsinki, Helsinki, Finland
- 3Institute for Atmospheric and Earth System Research / Physics, Faculty of Science, University of Helsinki, Helsinki, Finland
- 4Climate System Research, Finnish Meteorological Institute, Helsinki, Finland
Abstract. Accurate estimates of the net ecosystem CO2 exchange (NEE) would improve the understanding of the natural carbon sources and sinks and their role in the regulation of the global atmospheric carbon. In this work, we use and compare the random forest (RF) and the gradient boosting (GB) machine learning (ML) methods for predicting the year-round 6 hourly NEE over 1996–2018 in a pine-dominated boreal forest in southern Finland and analyze the predictability of the NEE. Additionally, aggregation to weekly NEE values was applied to get information about longer term behavior of the method. The meteorological ERA5 reanalysis variables were used as predictors. Spatial and temporal neighborhood (predictor lagging) was used to provide the models more data to learn from, which was found to improve the accuracy compared to using only the nearest grid cell and time step. Both ML methods can explain the temporal variability of the NEE in the observational site of this study with the meteorological predictors, but the GB method was more accurate. It was more effective in separating the important predictors from non-important ones, showing no signs of overfitting despite many redundant variables. The accuracy of the GB (RF), here measured mainly using cross-validated Pearson correlation coefficient between the model result and the observed NEE, was high (good), reaching a best estimate value of 0.96 (0.94) and the root mean square value of 1.18 µmol m⁻² s⁻¹ (1.35 µmol m⁻² s⁻¹). We recommend using GB instead of RF for modeling the CO2 fluxes of the ecosystems due to its better performance.
Matti Kämäräinen et al.
Status: final response (author comments only)
-
RC1: 'Comment on bg-2022-108', Anonymous Referee #1, 02 Jun 2022
This study by Kämäräinen et al. compares two different machine learning algorithms for the prediction of CO2 net ecosystem exchange of a boreal forest using ERA5 reanalysis data. Getting accurate estimates of CO2 exchange outside of spatiotemporal domains covered by eddy covariance measurements is an important task and hence a relevant topic for Biogeosciences. The analysis of spatial and temporal neighborhood predictors and the emulation of less complete time series are interesting concepts and the study is overall well-written and structured. However, I have some general and specific concerns and questions listed below that should be addressed in a round of major revisions before publication.
General comments:
Introduction/Discussion: I think a more complete consideration of state-of-the-art literature could better present the novelty of this study which is not clear in the current form, and that literature should also be considered more for discussing the results. For example, please show what improvements can be expected from gradient boosting in view of previous research, e.g. Tramontana et al. 2016 (https://doi.org/10.5194/bg-13-4291-2016), who already compared various ML-algorithms for NEE and GPP prediction. Please also cite literature regarding spatiotemporal neighborhood predictors, if there is any.
As this study seems to be conducted in view of the general goal of estimating carbon fluxes for points in time and especially in space without direct flux observations, it would be more interesting to test the models also with spatially independent data, i.e., for a comparable boreal EC station (from Fluxnet for example) fully excluded from model training. Otherwise, it should be pointed out more clearly that the predictive error likely is much higher when the models are applied to new locations, see e.g., Roberts et al., 2017 https://doi.org/10.1111/ecog.02881
While I see that ML-models do not require causal relations, I’m still not convinced by the inclusion of negatively lagged, i.e. future, meteo-variables. Did the exclusion of negatively lagged variables actually deteriorate model accuracy or are they just redundant with spurious correlations? In any case, the explanation regarding advection requires references as a theoretical basis supporting it. To me, it makes sense only for grid cells downwind from cells representing the station well (e.g. the central cell). However, I still don’t see what additional information can be gained as all the “advected” information already is contained in the non-time-lagged data of the more representative grid cells. Furthermore, it makes no sense for all meteo-variables, e.g., radiation and soil temperature, two of the most important variables, certainly are not advected directly. Hence, I think this explanation requires a more profound basis, i.e., by analyzing the importance of negatively lagged variables by grid cell in relation to wind direction, and by meteo variable. Otherwise, negatively lagged variables should rather be excluded from the analysis in my opinion.
The Pearson correlation coefficient is insensitive to magnitude, so it does not tell how accurate the predicted values are and hence is not very meaningful for model evaluations. I recommend to focus on R2 instead.
Specific comments:
L24-25: This recommendation is too general, as the models have been evaluated just for one specific ecosystem.
L45-48: Please add a reference here.
L65: Is the reference to kaggle really necessary? This rather comes across as an advertisement for a company, so please remove it.
L80-81: Please make clear that EddyUH (and REddyProc?) processing was not done within this study but before data was acquired.
L81-83: Rather explain NEE when the term is first introduced in Section 1.
L85: What constitutes a missing value? Were flux data filtered according to a certain quality control strategy, e.g. a test on stationarity, well-developed turbulence, footprint etc.?
L90 (Fig. 1): Why is 1998 written below Jan? The title seems superfluous, rather add NEE to the y-axis.
L95-96: Were missing values gap-filled or just omitted for the calculation of the weekly means? The latter would likely introduce a bias towards more negative NEE values as likely more nighttime data are missing compared to daytime data. This could at least be mentioned.
L107: Are you sure 1° is the spatial resolution? I think it’s 0.1°, otherwise it would be really coarse.
L110: Some of the abbreviations appear quite bulky, rather use more common ones like H, LE and rH. Also, is diffuse or total shortwave radiation not available in the ERA5 product?
L123-127 (Fig. 2): in the caption, temporal lags from -2 to +2 are stated, though in the figure lags from +3 to -1 are visualized. Please also make the numbering uniform between Fig. 2 and A2, i.e. that the central cell is number 13 in both figures and so on. As some of the most important grid cells are outside the inner circle (10, 11 ,21), I think it’s necessary and less confusing to show them all.
L129-135: Were all 23752 operations carried out in the end (as PCA was not used) despite technically being too laborious? Please clarify.
L165-168: Please add a reference.
L171-172: How many data points were included in each of the 10³ bootstrap samples?
L201-202: Are 00 and 06 UTC the start or the end of the averaging period? (also relevant for Fig. 4)
L242: Rather write “cope better” as this is not a yes-no question. To evaluate which one copes better, wouldn’t it also be necessary to compare the decrease in prediction skill of each model to its own 100% CORR and RMSE values?
L258: Are highly correlated variables an issue for gain? I know they induce a bias for permutation importance, so can this be excluded for gain?
L266: direct or total SW radiation?
L269 (& L312-313): I think this could be worth some more detailed analysis. To what percentage was (pine) forest the dominant land cover in each grid cell? Does this correlate with the importance statistics? Is there any spatial pattern? (Figure A2 could be visualized as a map for this).
L275 (Fig. 6): The figure would be more readable if the importance results were averaged over the five models with a measure of variation. Alternatively, swap the figure with one or all figures of the Appendix, as they are more easily recognizable and hence more valuable to the reader. Also, an x-axis label is missing.
L276-279: the same results for grid cells and time lags would also be interesting.
L302: Are there any papers investigating the model accuracy for out-of-range data?
L309-310: “more of a proxy-like” sounds clumsy. Suggestion: “represented by proxy”.
L339: I think the quoted FLUXCOM approach by Jung et al. already is a global flux model for this very purpose. Hence, the formulation “could act as a first step” sounds rather misleading to me.
Technical comments:
Articles are sometimes used excessively for generic nouns, e.g. L12-16, L.78, L.242-243
Write CO2 with a subscript 2
- AC1: 'Reply on RC1', Matti Kämäräinen, 18 Aug 2022
-
RC2: 'Comment on bg-2022-108', Anonymous Referee #2, 03 Jun 2022
This paper (GCB-21-2684) evaluated the predictive skill of two machine learning models for estimating sub-daily net ecosystem exchange (NEE) in a long-term boreal forest site. Although using machine learning to model NEE is not a new topic, this study provides informative results on model choice (XGBoost vs commonly used random forest), the use of climatic data solely to estimate NEE, and the benefits of incorporating spatial and temporal autocorrelated information. These results are potentially helpful to carbon flux modeling with machine learning. I have several outstanding questions and suggestions that I hope the authors would consider.
Major comments:
1. The introduction should provide more background on the use of machine learning to model eddy covariance measured NEE and identify the knowledge gap that this paper tries to fill. Many studies have employed machine learning models to upscale eddy covariance NEE, and global products such as FLUXCOM are available. Therefore, what makes this study significant or informative when it models NEE with machine learning in a single site? This paper looks at novel aspects which were not discussed in the introduction, such as comparing GB vs. RF; incorporating spatial and temporal information.2. A more rigorous model evaluation procedure would help improve the robustness of the model comparison results. This could include 1) using different types of goodness-of-fit metrics (e.g. NSE and bias), 2) estimating uncertainties of model performance from repeated cross-validation with random splitting and model initialization. Please see my specific comments.
3. It would be interesting to look at how incorporating neighboring temporal and spatial information affects the predictability of NEE by the machine learning models since previous studies usually only focus on concurrent and collocated measurements/inputs. While the feature importance analysis shed light on the benefits of spatiotemporal information, the importance metrics are difficult to interpret for tree-based models, given that many features are highly correlated. A direct comparison between models with and without spatial/temporally neighboring information would be appreciated.
4. Global feature importance metrics are sometimes unstable and difficult to interpret for tree ensemble methods, especially when features are highly correlated. I suggest evaluating feature importance using SHAP as an additional metric to get a more rigorous quantification of importance. See some discussions about feature importance here (Yasodhara et al., 2021, https://link.springer.com/chapter/10.1007/978-3-030-84060-0_19#Sec), here (https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27), and an example using SHAP here (Green et al., 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/gcb.16139).
5. Data-driven models of carbon fluxes often use satellite observed structural vegetation information as a major input. Therefore, it is interesting to see in this paper, that climate variables (from ERA5) could explain 95% of temporal dynamics of NEE in a site. Moreover, the level of accuracy from this paper is considerably higher than those from similar studies, both from a single site and from spatial upscaling over multiple sites. Could you please provide more discussion on the model performance and feature selection of this study in the context of previous results from the literature?
Specific comments:
Abstract
L18-19: This is an informative finding. But the manuscript doesn’t have an experiment that directly compares a model with spatial and temporal information to a model without such features.
L20-22: Both GB and RF rely on the same theoretical approach to identify features that are important to minimize the loss function since they are both tree-based algorithms. The fact that GB is more accurate than RF demonstrates the effectiveness of the “boosting” technique, but there is no direct evidence that GB identifies “more important features” than RF or is more resistant to overfitting.
Introduction
L50-56: Background on the reanalysis is informative, but is this necessary for this paper, given that most readers may already have a general knowledge.
Methods
L134-135: Does this result apply to both RF and GB? This is an interesting finding to me and could be highlighted in the result/discussion/conclusion.L145: Could you please elaborate on the benefits of transforming the target variable to Gaussian?
Figure2: Showing 25 grid cells would be helpful (maybe remove the notations “X” since the plot will be more compact.)
L170: I suggest adding bias and the Nash-Sutcliffe model efficiency (NSE) (https://en.wikipedia.org/wiki/Nash–Sutcliffe_model_efficiency_coefficient) (or R2 score, coefficient of determination, common in machine learning applications) to the evaluation metrics, so it is easy to compare the results in this study with other papers.
L173: Use 1000 instead of 103 for easy reading.
L175: Hyperparameter tuning through a grid search or other techniques is a common procedure to obtain the optimal accuracy of a machine learning model. It is an essential step to create a fair game when benchmarking different models. Often hyperparameters are determined for each cross-validation fold (see Tramontana et al., 2016 for an example). Although it might be true that significant improvement in the model performance is not likely, it is important to include sufficient justification about your tuning process. For example, what was the search space of parameters? How many sets of parameters were evaluated?
L180: Another suggestion is to perform repetition experiments (e.g. 30 or 50 repeated experiments for each algorithm, each with a different random split, and random state in the models) to estimate uncertainties from randomness in the cross-validation split and model initializations. See Besnard et al. (2019) for an example. In this way, the model comparison is robust to algorithm and splitting randomness. Confidence intervals of RMSE/R2 can also be derived this way, instead of bootstrapping within the samples.
Results
L189: Do you mean 1,000 samples?Figure 4: 1000 bootstrap samples?
L222-224 (Figure 4.): The variation of accuracy between years can also be related to the random split of years during cross-validation. For a test year, if years with similar climate conditions are in the training set, the testing accuracy is likely higher than otherwise. To this end, the repeated model runs would help eliminate this effect.
L232: do you mean “sub-sampling” here?
L230-240: The description of methods should be in Section 2, and here you may present the results.
L265: I suggest placing Figure A1-3 to the main text, and Figure 6 can be presented in Appendix. Figure A1-3 summarizes the importance of individual ERA5 variables, different spatial grid cells, and information from different temporal windows respectively. They are easier to interpret and provide a clearer comparison than Figure 6.
L269: It is interesting but somewhat surprising that the nearest grid cell is not the most important in the model. Further investigation and explanation would be needed here. What is the size of the tower footprint? How heterogeneous is this area? Is the tower close to cell 9, which may have a similar plant composition as the tower footprint? Is this related to lateral flows? What is the dominant wind direction?
L282-283: It is interesting that sensible heat and soil temperature alone could explain 90% of the variance in NEE. Is this for the 6-hourly or weekly model? This could be because diurnal and seasonal cycles dominate the temporal dynamics of NEE. Could you please provide more information on this analysis? For example, provide a figure like the heatmaps in Figure 4 to show if the accuracy of interannual variabilities drops when using only two variables.
Discussion and conclusions
L324: By “exclude”, do you mean that the redundant variables have low feature importance? It might be misleading to say the model excludes a variable.
- AC2: 'Reply on RC2', Matti Kämäräinen, 18 Aug 2022
Matti Kämäräinen et al.
Model code and software
Gradient boosting and random forest tools for modeling the NEE Kämäräinen, M., Lintunen, A., Kulmala, M., Tuovinen, J., Mammarella, I., Aalto, J., Vekuri, H., and Lohila, A. https://zenodo.org/badge/latestdoi/368864113
Matti Kämäräinen et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
350 | 124 | 19 | 493 | 6 | 4 |
- HTML: 350
- PDF: 124
- XML: 19
- Total: 493
- BibTeX: 6
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1