the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Using automated machine learning for the upscaling of gross primary productivity
Abstract. Estimating gross primary productivity (GPP) over space and time is fundamental for understanding the response of the terrestrial biosphere to climate change. Eddy-covariance flux towers provide in situ estimates of GPP at the ecosystem scale, but their sparse geographical distribution limits larger scales inference. Machine learning (ML) techniques have been used to address this problem by extrapolating local GPP measurements over space using satellite remote sensing data. However, the accuracy of the regression model can be affected by uncertainties introduced by model selection, parametrization, and choice of predictor features. Recent advances in automated ML (AutoML) provide a novel automated way to select and synthesize different ML models. In this work, we explore the potential of AutoML by training three major AutoML frameworks on eddy-covariance measurements of GPP at 243 globally distributed sites. We compared their ability to predict GPP and its spatial and temporal variability based on different sets of remote sensing predictor variables. Predictor variables from only MODIS surface reflectance data and photosynthetically active radiation explained over 70 % of the monthly variability in GPP, while satellite-derived proxies for land surface temperature, evapotranspiration, soil moisture and plant functional types, and climate variables from reanalysis (ERA5-Land) further improved the frameworks' predictive ability. We found that the AutoML framework AutoSklearn consistently outperformed other AutoML frameworks as well as a classical Random Forest regressor in predicting GPP, reaching an overall r2 of 0.75. In addition, we deployed AutoSklearn to generate global wall-to-wall maps highlighting GPP patterns in good agreement with satellite-derived reference data. This research benchmarks the application of AutoML in GPP estimation and assesses its potential and limitations in quantifying global photosynthetic activity.
- Preprint
(2183 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on bg-2023-141', Anonymous Referee #1, 29 Sep 2023
Max Gaber and colleagues investigate the effect of several technical choices in the process of predicting GPP from eddy-covariance measurements and satellite(-derived) data sets using machine learning. The focus is on novel methods in the field of automated machine learning applied to predict monthly GPP at site level from different sets of predictor variables, as well as on the effect of their spatial resolution. The authors demonstrate the applicability of AutoML, and of AutoSklearn in particular, and show that in the global upscaled product, spatiotemporal patterns reasonably compare to other products. They also illustrate the importance of adequate spatial resolution of the predictor variables by increased model performance at site level, when part of the predictor variables are fed into the machine learning at 500m instead of at 0.05deg resolution.
Given the growing number of research studies that implement such data-driven approaches at global and regional scales (large part of whom are cited in the paper) and the still unquantified importance of several technical choices in the set-up, this study is timely and definitely of relevance. The fact that the overall R2 at site level is similar to or slightly higher than from a plane random forest or the results in comparable upscaling exercises at monthly scale (Jung et al. 2011) is interesting, and highlights that tuning the machine-learning set-up may not be the most promising way forward to improving the performance of data-driven models, but rather more informative predictor variables (at least Fig.5 may be interpreted in this way) . I find this a very valuable finding which may also deserve to be communicated/ highlighted more clearly (like you did for example in l. 384-389, but not in the abstract or elsewhere). At the moment the differences in significance are stressed more than the very similar magnitude of performance between the different AutoML methods. Also, your finding that the AutoML does not help to reproduce interannual changes (l. 267) is an important finding, because it is a common problem in data-driven upscaling and very relevant question in the carbon cycle community, and therefore in my opinion deserves to be stressed more.
I suggest publication of the paper after addressing the following major questions/ comments:
-
What is the reason for doing this analysis at a monthly temporal scale when structural vegetation changes dominate rather than finer temporal resolution? I would expect higher gains from AutoML and also more differentiated contributions between predictor variables (especially meteorological features) at higher temporal resolution. This is also the time scale which is more relevant to be able to properly represent seasonal and anomalous trajectories. I would expect large potential from automated model tuning especially for short extreme events, which are relevant for the carbon uptake and hard to represent in a data-driven model set-up, but clearly smeared out at a monthly time step. Much of the discussion in section 4.2 does neglect the coarse time step when for example LUE changes are not expected to play major role.
-
Data sets:
-
A number of predictor variables are model outputs themselves, relying on input data and model assumptions. This is not discussed at all.
-
What is the reason for ingesting both SIF and instantaneous SIF, or both PAR and RSDN?
-
How is the temporal aggregation done?
-
How do you handle data gaps?
-
Handling of bad data quality is only mentioned for the site-level fluxes, what about the explanatory variables?
-
Specify more clearly the data sources, e.g. for the CCI soil moisture, which version did you use? Presumably, FluxCom v6 refers to the FluxCom set up with RSonly ( only satellite-based predictors using MODIS collection 6), which is 8-daily and at high spatial resolution?
3. Spatial resolution: Why not also ingest tower meteorology instead of the coarser ERA5-Land? The scale mismatch could be further discussed, especially between a 0.05deg pixel and the tower footprint. The way the authors approach the analysis suggests using the 0.05deg pixel is the generally accepted default, which is not the case.
4. In parts the manuscript uses very technical language and describes key concepts only in a very short manner. I suggest to rephrase certain passages to make the manuscript better accessible to a wider audience which may also not be very familiar with the newest developments in the machine learning world – or at least expand more in the supporting information. Examples of very technical sentences in my opinion are l.160-161, l.165-166, l.170-172, l.177-181, l.243-246
5. I am afraid, but I cannot follow the meaning of Fig.6.
Minor comments for clarification:
-
Throughout the manuscript: The analysis is not done on climatological time scales, so VPD, precipitation and temperature are meteorological variables, it’s not climate data.
-
l.22: I suggest to stress in the abstract already the small differences between the AutoML frameworks, eg. by writing ‘...AutoSklearn consistently but marginally outperformed other AutoML frameworks…’
-
l.49 and later in the manuscript: In the literature the term ‘variable importance’ is used with very different meanings. Please clearly state that for your work, importance refers to the contribution of a variable to model accuracy.
-
l.49-56: I am not convinced that the conclusions of the different cited papers are strictly comparable because the analyses have been done at different temporal scales, from daily to monthly, and using different feature sets. Although the machine learning results are analysed which do not necessarily need to obey conceptual understanding, the contributions of different features are expected to differ between time scales.
-
l.66 (and later as well, eg l.146, 149, 319, 325): Could you clarify/ give examples of what is meant by ‘pipeline creation’ and ‘data processing steps’? The legend of Fig.A2 is hardly understandable for the non-expert without any further context or info.
-
l.81: ‘predictive contribution’ to what? To prediction accuracy?
-
l. 202: Is there a reason for leaving out the VIs?
-
l. 232: So you compute a linear trend also for time series of just 2 years?
-
l. 241: What value does the critical difference take?
-
Section 3.4: So the main take-away is that the patterns from AutoML in general make sense when compared to other upscaling products? Or do you want to convey another message?
-
l.465: the deforestation is mentioned the first time here and I cannot follow what is meant.
-
l.519-525: This last part may be slightly overstating, I do not see very clear indications of more robust and accurate GPP predictions yet.
Citation: https://doi.org/10.5194/bg-2023-141-RC1 - AC1: 'Reply on RC1', Max Gaber, 12 Nov 2023
-
-
CC1: 'Comment on bg-2023-141', Jiangong Liu, 01 Oct 2023
Gaber et al. present a comprehensive evaluation of using automated ML (AutoML) to estimate and upscale ecosystem GPP using four sets of remote sensing and reanalysis products. The comparative analysis of three AutoML frameworks reveals that AutoSklearn consistently outperforms the other frameworks and a baseline Random Forest model in reproducing spatial patterns, temporal variability, and trends in the observed GPP. Notably, the use of higher-resolution remote sensing products further enhances model performance, attributed to footprint matching. Additionally, the authors have produced a global wall-to-wall map of GPP (monthly, 0.05 deg) using AutoSklearn and a suite of remote sensing predictors, which agrees well with two other ML-based global GPP products.
The study highlights the potential of AutoML in quantifying global GPP, capturing its temporal and spatial variability and trend, and provides insights into feature selection for monthly GPP estimation. This topic matches the interests of the readers of Biogeosciences. While the manuscript is exceptionally well-written and the implementation of ML models is robust, several notable concerns, particularly regarding model interpretability, feature selection, and sources of uncertainty, warrant additional exploration and discussion.
Major comments:
- When comparing estimations derived from "RS" and "RS + meteo", and observing no substantial improvement in model performance with additional meteorological predictors, the assertion that this is because meteorological data contains no additional information or the reanalysis data quality is not good might need further exploration (Lines 435-440). Given that several predictors from "RS + meteo" might contain overlapping information on a monthly scale (e.g., VIs, LAI, SIF, ET, and meteorological data), it might be premature to conclude that the inclusion of meteorological data yields marginal enhancement in modeling monthly GPP.
- I am puzzled by the decision to leave out radiation (BESS_Rad) in the 'RS meteo' (Figure 3) and curious about the thinking behind splitting data sources into remote sensing and reanalysis, instead of classifying them into physical (BESS_Rad, ESA CCI, MODIS LST, and ERA5-Land) and biological (MODIS VI/LAI, CSIF, and ALEXI ET) controls. Also, I think it would be worthwhile to discuss whether SIF should be included as a predictor since it is commonly used as a GPP proxy.
- While the Discussion does touch on various potential sources of uncertainties (e.g., section 4.2), it seems to overlook the potential for bias inherent in the eddy covariance GPP. The authors used night-time partitioned GPP, relying quite a bit on a temperature dependency function of night-time NEE. But there is still some debate about whether this dependency is exponential (Chen et al., 2023), if it can be extrapolated to the daytime (Keenan et al., 2019), and whether it should be referenced to air or soil temperature (Wohlfahrt & Galvagno, 2017). Given that AutoML isn’t the easiest to interpret (Line 330), I am wondering if its top-notch performance is partly because it is picking up on some error structures during NEE partitioning.
- I am excited about a new global GPP product. Would the authors like to give it an official name, and give the name a spotlight in the Title or Abstract? Additionally, it is recommended that the authors articulate both the interannual variability and the annual magnitude of GPP relative to the new product, as such information would likely be invaluable to the flux community. I am also curious about why the authors did not use the high-resolution RS data (500 m) for the product, considering it seems to pull better performance.
Minor comments:
- Line 90: Since negative outliers are in a unit of "gC m-2 d-1", did the authors aggregate daily values to monthly for both fluxes and their predictors? More details should be provided for the quality control.
- Line 100: Add the source/reference for IGBP here, and also in Fig 2.
- Line 115: It is a very minor point, but I think terminology for explanatory variables/predictor (e.g., Table 1)/feature (e.g., line 40) is used a bit random in the manuscript. Though they share the same meaning, readers might get confused.
- Line 130-140: It might be worthwhile to relocate this paragraph concerning the challenges with CASH to the Introduction to serve as an additional motivation statement. In the current Introduction, the authors highlighted the advantages of using AutoML, which are "... to overcome the challenges of algorithm selection, hyperparameter tuning, and pipeline creation through an automated approach". They introduced well the existing problem of feature selection. However, the knowledge gaps in the existing ML-based products of fluxes regarding algorithm selection and hyperparameter tuning should also be clarified.
- Line 255: Offering details about the calculation of trends, seasonality, across-site variability, and anomalies in the Methodology section, prior to Figure 10, might enhance comprehension. I am also unsure what R2 values mean for trend comparison, as trends are the fitted slopes.
- Figure 7: what do R2 values smaller than -1 mean?
- Line 490: While the models also underestimate large GPP values (Line 305), further discussion on this aspect may provide additional insight.
- Line 520: I appreciate the authors raising this point about the cautious use of AutoML. The inherently 'black-box' nature of AutoML, which presents challenges in interpretability as indicated (Line 330), is a notable issue.
Citation: https://doi.org/10.5194/bg-2023-141-CC1 -
CC2: 'Reply on CC1', Jiangong Liu, 01 Oct 2023
Here are the references:
Chen, W., Wang, S., Wang, J., Xia, J., Luo, Y., Yu, G., & Niu, S. (2023). Evidence for widespread thermal optimality of ecosystem respiration. Nature Ecology & Evolution, 7(9), 1379–1387. https://doi.org/10.1038/s41559-023-02121-w
Keenan, T. F., Migliavacca, M., Papale, D., Baldocchi, D., Reichstein, M., Torn, M., & Wutzler, T. (2019). Widespread inhibition of daytime ecosystem respiration. Nature Ecology and Evolution, 3(3), 407–415. https://doi.org/10.1038/s41559-019-0809-2
Wohlfahrt, G., & Galvagno, M. (2017). Revisiting the choice of the driving temperature for eddy covariance CO2 flux partitioning. Agricultural and Forest Meteorology, 237–238, 135–142. https://doi.org/10.1016/j.agrformet.2017.02.012
Citation: https://doi.org/10.5194/bg-2023-141-CC2 - AC3: 'Reply on CC1', Max Gaber, 12 Nov 2023
-
RC2: 'Comment on bg-2023-141', Anonymous Referee #2, 23 Oct 2023
Gaber et al. test the ability of multiple automated machine learning (AutoML) approaches, each based on multiple individual machine learning methods, to upscale gross primary production (GPP) with remote sensing. They specifically test three different AutoML methods (as well as a random forest model as a baseline) with different subsets of remote sensing and meteorological data, finding that they provide very similar performance, with r^2 ranging from ~0.7-0.75 at monthly scale. They also find similar abilities to capture trends, spatial variation, and seasonality across most approaches but that none of them is particularly effective at capture monthly GPP anomalies. The best models were typically based on a combination of MODIS surface reflectance with additional remote sensing-based estimates of LAI/FPAR, land surface temperature, soil moisture, evapotranspiration, and solar-induced fluorescence (SIF); adding meteorological reanalyses of precipitation, temperature, and vapor pressure deficit did not notably improve model performance.
Overall, the manuscript presents an interesting comparison of some cutting edge approaches to automated machine learning and adds a new dimension to ongoing discussions of flux upscaling. It’s also a well written and well-constructed study. The fact that the approaches achieve similar results to each other and to other upscaled products is itself interesting and perhaps suggests that further improvement in upscaled GPP estimates may come from avenues aside from just algorithmic optimization (e.g., better and more extensive ground data, improved remotely sensed data streams). I have a few suggestions for improved presentation and additional analysis, but overall I think this is likely to be a high quality contribution.
General/major comments:
1) My main suggestion for the analysis would be provide, if possible, a more refined and specific assessment of the importance of individual variables. The analysis of the different subsets is interesting, but I think the impact of the study could be enhanced by assessing specifically which variables within those subsets are giving the most “bang for the buck.” I know random forests, for example, provide variable importance metrics and perhaps those are doable from the AutoML approaches as well? I’m curious, for example, in the RS subsets, which variables added the most predictive skill beyond what was achieved with RSmin? How important were LST and soil moisture? Did the ET and SIF data, which are themselves modeled from remote sensing data, add any additional independent information? The CSIF product, for example, is itself an upscaled SIF product based on machine learning of MODIS NBAR data, so it seems like it wouldn’t necessarily add anything beyond what the methods were able to get directly from the NBAR data.
2) I think the Discussion could use a little improvement in places. I think it would be especially helpful to improve how the findings are contextualized in light of previous literature. I’ll provide more specific suggestions below.
3) I find Fig. 6 very difficult to interpret. Is it possible to present those results in a more intuitive form?
Specific comments:
-L12: should that be “scale” instead of “scales”?
-L14: parameterization is misspelled (missing an “e”)
-Fig. 2: Just to clarify, this is showing number of sites, not site-years, correct? If so, I wonder if it would be more relevant to show site-years since that’s a better representation of how much training data is available in each biome?
-L122: I think it would be worth expanding more on these different sources, including references. Especially since some of these (ET and SIF) are themselves modeled based on remote sensing. Given that, what would you expect them to add beyond what would be coming from the NBAR data itself? Would they actually be providing independent information?
-L274-275: These may be “statistically different,” but to me, it seems like an r^2 of say 0.74 is not particularly different from an r^2 of 0.75 in any meaningful sense. The authors do a good job stating this later in the paper, but I do think it’s worth not overinterpreting small differences even if they are “statistically significant.” Any difference, however small, could be “significant” given a large enough sample size, but that doesn’t necessarily make it a meaningful difference.
-L286-297 (but also in other places throughout the results): There are places here that could use references to specific figures or panels within figures. Sometimes it’s hard to tell where the results as described are shown in the figures.
-L304-305: The overestimation of low values and underestimation of high values is interesting and consistent (I think) with some of the early studies of MODIS GPP (perhaps from David Turner and/or Faith Ann Heinsch, if I’m remembering correctly?). Some reference to those earlier works here would provide valuable context. The fact that we’re still trying to solve long-standing problems is itself interesting!
-L390-399: This paragraph (about differences among approaches) seems to slightly contradict the previous one (about how there aren’t really major differences). I’m not suggesting that the authors do a complete rewrite of the paragraph or anything, but I do think it might be worth making sure that they are sending a consistent message: that the differences are generally pretty slight.
-L401-407: It could also be that the quality of the eddy covariance data itself is a limiting factor. EC GPP is used as the ground truth in this case, but it’s not a perfect representation of GPP: EC data has sources of noise and EC GPP is a modeled quantity from the more directly measured NEE. I imagine there may therefore be upper limits to the performance metrics that we can expect when upscaling EC GPP just because of uncertainties in what we’re using as “truth.”
-Section 4.2: I think this section would definitely benefit from a more thorough dive into the variable importance, as suggested in general comments. Also, I don’t think there’s any mention of SIF in this section while other variables composing the RS subset are discussed?
-L433-439: The authors mention this at the end of the paragraph, but I think it could be more up front: reanalysis data (especially for precip) can be very flawed. So maybe temperature and VPD do matter (precipitation probably less so since soil moisture is already included in the model and ultimately it’s soil moisture, not precipitation, that gets directly used by plants) but the reanalysis data just doesn’t do a good job capturing it. Could also be worth a citation to previous literature that has assessed reanalysis data.
-L444: I’d suggest rephrasing “It is to be explored.” That’s somewhat awkward, passive phrasing.
-L463-466: This paragraph is kind of light on citations and the final sentence feels out of place and incomplete, like there’s something more that should be coming that connects the first part of the paragraph to this final thought.
-L477-484: This paragraph is also pretty light on citations. A couple suggestions: Smith et al. 2019 (Remote Sensing of Environment) on challenges specifically in dry regions and the early MODIS papers by Turner that assessed biome differences in MODIS GPP performance. It’d be interested to see the results here contextualized with the challenges that have faced remote sensing of productivity for a long time!
- L481: It’s unclear what’s meant by “high proportion of biomass” or how that would affect productivity estimation. To me, it seems like it’s not high biomass that would lead to good performance but rather high seasonal variation in leaf area (which both DBF and MF have).
-L484: A little unclear what’s meant by “complex biophysical and environmental characteristics.” I think it’d be worth expanding on this and being more specific.
-L487: I think “It is to further research to…” is also somewhat awkward and passive phrasing and would suggest rewording.
-L490: This is another good place to cite Smith et al. 2019, which also shows that drylands are underrepresented in flux networks relative to their global proportion. Haughton et al. 2018 (Biogeosciences) could be a good one too since they showed that drylands are more “unique” (meaning less easy to apply a globally-trained model to an unseen site) than most other systems, which may be partly why the underrepresentation of dryland sites in flux networks can be such a problem for upscaling in those regions.
-L504: For the Conclusions section, it might be worth expanding on what’s meant by “RS” here. That’s referring to a specific subset of the variables but for readers who are skimming and skip to the conclusions section, they might miss what that subset refers to.
-L519-520: Maybe to some extent, but it’s interesting to note that RF (not automated and with, I think, some amount of subjectivity in choices) performed nearly as well as the AutoML methods.
Citation: https://doi.org/10.5194/bg-2023-141-RC2 - AC2: 'Reply on RC2', Max Gaber, 12 Nov 2023
Status: closed
-
RC1: 'Comment on bg-2023-141', Anonymous Referee #1, 29 Sep 2023
Max Gaber and colleagues investigate the effect of several technical choices in the process of predicting GPP from eddy-covariance measurements and satellite(-derived) data sets using machine learning. The focus is on novel methods in the field of automated machine learning applied to predict monthly GPP at site level from different sets of predictor variables, as well as on the effect of their spatial resolution. The authors demonstrate the applicability of AutoML, and of AutoSklearn in particular, and show that in the global upscaled product, spatiotemporal patterns reasonably compare to other products. They also illustrate the importance of adequate spatial resolution of the predictor variables by increased model performance at site level, when part of the predictor variables are fed into the machine learning at 500m instead of at 0.05deg resolution.
Given the growing number of research studies that implement such data-driven approaches at global and regional scales (large part of whom are cited in the paper) and the still unquantified importance of several technical choices in the set-up, this study is timely and definitely of relevance. The fact that the overall R2 at site level is similar to or slightly higher than from a plane random forest or the results in comparable upscaling exercises at monthly scale (Jung et al. 2011) is interesting, and highlights that tuning the machine-learning set-up may not be the most promising way forward to improving the performance of data-driven models, but rather more informative predictor variables (at least Fig.5 may be interpreted in this way) . I find this a very valuable finding which may also deserve to be communicated/ highlighted more clearly (like you did for example in l. 384-389, but not in the abstract or elsewhere). At the moment the differences in significance are stressed more than the very similar magnitude of performance between the different AutoML methods. Also, your finding that the AutoML does not help to reproduce interannual changes (l. 267) is an important finding, because it is a common problem in data-driven upscaling and very relevant question in the carbon cycle community, and therefore in my opinion deserves to be stressed more.
I suggest publication of the paper after addressing the following major questions/ comments:
-
What is the reason for doing this analysis at a monthly temporal scale when structural vegetation changes dominate rather than finer temporal resolution? I would expect higher gains from AutoML and also more differentiated contributions between predictor variables (especially meteorological features) at higher temporal resolution. This is also the time scale which is more relevant to be able to properly represent seasonal and anomalous trajectories. I would expect large potential from automated model tuning especially for short extreme events, which are relevant for the carbon uptake and hard to represent in a data-driven model set-up, but clearly smeared out at a monthly time step. Much of the discussion in section 4.2 does neglect the coarse time step when for example LUE changes are not expected to play major role.
-
Data sets:
-
A number of predictor variables are model outputs themselves, relying on input data and model assumptions. This is not discussed at all.
-
What is the reason for ingesting both SIF and instantaneous SIF, or both PAR and RSDN?
-
How is the temporal aggregation done?
-
How do you handle data gaps?
-
Handling of bad data quality is only mentioned for the site-level fluxes, what about the explanatory variables?
-
Specify more clearly the data sources, e.g. for the CCI soil moisture, which version did you use? Presumably, FluxCom v6 refers to the FluxCom set up with RSonly ( only satellite-based predictors using MODIS collection 6), which is 8-daily and at high spatial resolution?
3. Spatial resolution: Why not also ingest tower meteorology instead of the coarser ERA5-Land? The scale mismatch could be further discussed, especially between a 0.05deg pixel and the tower footprint. The way the authors approach the analysis suggests using the 0.05deg pixel is the generally accepted default, which is not the case.
4. In parts the manuscript uses very technical language and describes key concepts only in a very short manner. I suggest to rephrase certain passages to make the manuscript better accessible to a wider audience which may also not be very familiar with the newest developments in the machine learning world – or at least expand more in the supporting information. Examples of very technical sentences in my opinion are l.160-161, l.165-166, l.170-172, l.177-181, l.243-246
5. I am afraid, but I cannot follow the meaning of Fig.6.
Minor comments for clarification:
-
Throughout the manuscript: The analysis is not done on climatological time scales, so VPD, precipitation and temperature are meteorological variables, it’s not climate data.
-
l.22: I suggest to stress in the abstract already the small differences between the AutoML frameworks, eg. by writing ‘...AutoSklearn consistently but marginally outperformed other AutoML frameworks…’
-
l.49 and later in the manuscript: In the literature the term ‘variable importance’ is used with very different meanings. Please clearly state that for your work, importance refers to the contribution of a variable to model accuracy.
-
l.49-56: I am not convinced that the conclusions of the different cited papers are strictly comparable because the analyses have been done at different temporal scales, from daily to monthly, and using different feature sets. Although the machine learning results are analysed which do not necessarily need to obey conceptual understanding, the contributions of different features are expected to differ between time scales.
-
l.66 (and later as well, eg l.146, 149, 319, 325): Could you clarify/ give examples of what is meant by ‘pipeline creation’ and ‘data processing steps’? The legend of Fig.A2 is hardly understandable for the non-expert without any further context or info.
-
l.81: ‘predictive contribution’ to what? To prediction accuracy?
-
l. 202: Is there a reason for leaving out the VIs?
-
l. 232: So you compute a linear trend also for time series of just 2 years?
-
l. 241: What value does the critical difference take?
-
Section 3.4: So the main take-away is that the patterns from AutoML in general make sense when compared to other upscaling products? Or do you want to convey another message?
-
l.465: the deforestation is mentioned the first time here and I cannot follow what is meant.
-
l.519-525: This last part may be slightly overstating, I do not see very clear indications of more robust and accurate GPP predictions yet.
Citation: https://doi.org/10.5194/bg-2023-141-RC1 - AC1: 'Reply on RC1', Max Gaber, 12 Nov 2023
-
-
CC1: 'Comment on bg-2023-141', Jiangong Liu, 01 Oct 2023
Gaber et al. present a comprehensive evaluation of using automated ML (AutoML) to estimate and upscale ecosystem GPP using four sets of remote sensing and reanalysis products. The comparative analysis of three AutoML frameworks reveals that AutoSklearn consistently outperforms the other frameworks and a baseline Random Forest model in reproducing spatial patterns, temporal variability, and trends in the observed GPP. Notably, the use of higher-resolution remote sensing products further enhances model performance, attributed to footprint matching. Additionally, the authors have produced a global wall-to-wall map of GPP (monthly, 0.05 deg) using AutoSklearn and a suite of remote sensing predictors, which agrees well with two other ML-based global GPP products.
The study highlights the potential of AutoML in quantifying global GPP, capturing its temporal and spatial variability and trend, and provides insights into feature selection for monthly GPP estimation. This topic matches the interests of the readers of Biogeosciences. While the manuscript is exceptionally well-written and the implementation of ML models is robust, several notable concerns, particularly regarding model interpretability, feature selection, and sources of uncertainty, warrant additional exploration and discussion.
Major comments:
- When comparing estimations derived from "RS" and "RS + meteo", and observing no substantial improvement in model performance with additional meteorological predictors, the assertion that this is because meteorological data contains no additional information or the reanalysis data quality is not good might need further exploration (Lines 435-440). Given that several predictors from "RS + meteo" might contain overlapping information on a monthly scale (e.g., VIs, LAI, SIF, ET, and meteorological data), it might be premature to conclude that the inclusion of meteorological data yields marginal enhancement in modeling monthly GPP.
- I am puzzled by the decision to leave out radiation (BESS_Rad) in the 'RS meteo' (Figure 3) and curious about the thinking behind splitting data sources into remote sensing and reanalysis, instead of classifying them into physical (BESS_Rad, ESA CCI, MODIS LST, and ERA5-Land) and biological (MODIS VI/LAI, CSIF, and ALEXI ET) controls. Also, I think it would be worthwhile to discuss whether SIF should be included as a predictor since it is commonly used as a GPP proxy.
- While the Discussion does touch on various potential sources of uncertainties (e.g., section 4.2), it seems to overlook the potential for bias inherent in the eddy covariance GPP. The authors used night-time partitioned GPP, relying quite a bit on a temperature dependency function of night-time NEE. But there is still some debate about whether this dependency is exponential (Chen et al., 2023), if it can be extrapolated to the daytime (Keenan et al., 2019), and whether it should be referenced to air or soil temperature (Wohlfahrt & Galvagno, 2017). Given that AutoML isn’t the easiest to interpret (Line 330), I am wondering if its top-notch performance is partly because it is picking up on some error structures during NEE partitioning.
- I am excited about a new global GPP product. Would the authors like to give it an official name, and give the name a spotlight in the Title or Abstract? Additionally, it is recommended that the authors articulate both the interannual variability and the annual magnitude of GPP relative to the new product, as such information would likely be invaluable to the flux community. I am also curious about why the authors did not use the high-resolution RS data (500 m) for the product, considering it seems to pull better performance.
Minor comments:
- Line 90: Since negative outliers are in a unit of "gC m-2 d-1", did the authors aggregate daily values to monthly for both fluxes and their predictors? More details should be provided for the quality control.
- Line 100: Add the source/reference for IGBP here, and also in Fig 2.
- Line 115: It is a very minor point, but I think terminology for explanatory variables/predictor (e.g., Table 1)/feature (e.g., line 40) is used a bit random in the manuscript. Though they share the same meaning, readers might get confused.
- Line 130-140: It might be worthwhile to relocate this paragraph concerning the challenges with CASH to the Introduction to serve as an additional motivation statement. In the current Introduction, the authors highlighted the advantages of using AutoML, which are "... to overcome the challenges of algorithm selection, hyperparameter tuning, and pipeline creation through an automated approach". They introduced well the existing problem of feature selection. However, the knowledge gaps in the existing ML-based products of fluxes regarding algorithm selection and hyperparameter tuning should also be clarified.
- Line 255: Offering details about the calculation of trends, seasonality, across-site variability, and anomalies in the Methodology section, prior to Figure 10, might enhance comprehension. I am also unsure what R2 values mean for trend comparison, as trends are the fitted slopes.
- Figure 7: what do R2 values smaller than -1 mean?
- Line 490: While the models also underestimate large GPP values (Line 305), further discussion on this aspect may provide additional insight.
- Line 520: I appreciate the authors raising this point about the cautious use of AutoML. The inherently 'black-box' nature of AutoML, which presents challenges in interpretability as indicated (Line 330), is a notable issue.
Citation: https://doi.org/10.5194/bg-2023-141-CC1 -
CC2: 'Reply on CC1', Jiangong Liu, 01 Oct 2023
Here are the references:
Chen, W., Wang, S., Wang, J., Xia, J., Luo, Y., Yu, G., & Niu, S. (2023). Evidence for widespread thermal optimality of ecosystem respiration. Nature Ecology & Evolution, 7(9), 1379–1387. https://doi.org/10.1038/s41559-023-02121-w
Keenan, T. F., Migliavacca, M., Papale, D., Baldocchi, D., Reichstein, M., Torn, M., & Wutzler, T. (2019). Widespread inhibition of daytime ecosystem respiration. Nature Ecology and Evolution, 3(3), 407–415. https://doi.org/10.1038/s41559-019-0809-2
Wohlfahrt, G., & Galvagno, M. (2017). Revisiting the choice of the driving temperature for eddy covariance CO2 flux partitioning. Agricultural and Forest Meteorology, 237–238, 135–142. https://doi.org/10.1016/j.agrformet.2017.02.012
Citation: https://doi.org/10.5194/bg-2023-141-CC2 - AC3: 'Reply on CC1', Max Gaber, 12 Nov 2023
-
RC2: 'Comment on bg-2023-141', Anonymous Referee #2, 23 Oct 2023
Gaber et al. test the ability of multiple automated machine learning (AutoML) approaches, each based on multiple individual machine learning methods, to upscale gross primary production (GPP) with remote sensing. They specifically test three different AutoML methods (as well as a random forest model as a baseline) with different subsets of remote sensing and meteorological data, finding that they provide very similar performance, with r^2 ranging from ~0.7-0.75 at monthly scale. They also find similar abilities to capture trends, spatial variation, and seasonality across most approaches but that none of them is particularly effective at capture monthly GPP anomalies. The best models were typically based on a combination of MODIS surface reflectance with additional remote sensing-based estimates of LAI/FPAR, land surface temperature, soil moisture, evapotranspiration, and solar-induced fluorescence (SIF); adding meteorological reanalyses of precipitation, temperature, and vapor pressure deficit did not notably improve model performance.
Overall, the manuscript presents an interesting comparison of some cutting edge approaches to automated machine learning and adds a new dimension to ongoing discussions of flux upscaling. It’s also a well written and well-constructed study. The fact that the approaches achieve similar results to each other and to other upscaled products is itself interesting and perhaps suggests that further improvement in upscaled GPP estimates may come from avenues aside from just algorithmic optimization (e.g., better and more extensive ground data, improved remotely sensed data streams). I have a few suggestions for improved presentation and additional analysis, but overall I think this is likely to be a high quality contribution.
General/major comments:
1) My main suggestion for the analysis would be provide, if possible, a more refined and specific assessment of the importance of individual variables. The analysis of the different subsets is interesting, but I think the impact of the study could be enhanced by assessing specifically which variables within those subsets are giving the most “bang for the buck.” I know random forests, for example, provide variable importance metrics and perhaps those are doable from the AutoML approaches as well? I’m curious, for example, in the RS subsets, which variables added the most predictive skill beyond what was achieved with RSmin? How important were LST and soil moisture? Did the ET and SIF data, which are themselves modeled from remote sensing data, add any additional independent information? The CSIF product, for example, is itself an upscaled SIF product based on machine learning of MODIS NBAR data, so it seems like it wouldn’t necessarily add anything beyond what the methods were able to get directly from the NBAR data.
2) I think the Discussion could use a little improvement in places. I think it would be especially helpful to improve how the findings are contextualized in light of previous literature. I’ll provide more specific suggestions below.
3) I find Fig. 6 very difficult to interpret. Is it possible to present those results in a more intuitive form?
Specific comments:
-L12: should that be “scale” instead of “scales”?
-L14: parameterization is misspelled (missing an “e”)
-Fig. 2: Just to clarify, this is showing number of sites, not site-years, correct? If so, I wonder if it would be more relevant to show site-years since that’s a better representation of how much training data is available in each biome?
-L122: I think it would be worth expanding more on these different sources, including references. Especially since some of these (ET and SIF) are themselves modeled based on remote sensing. Given that, what would you expect them to add beyond what would be coming from the NBAR data itself? Would they actually be providing independent information?
-L274-275: These may be “statistically different,” but to me, it seems like an r^2 of say 0.74 is not particularly different from an r^2 of 0.75 in any meaningful sense. The authors do a good job stating this later in the paper, but I do think it’s worth not overinterpreting small differences even if they are “statistically significant.” Any difference, however small, could be “significant” given a large enough sample size, but that doesn’t necessarily make it a meaningful difference.
-L286-297 (but also in other places throughout the results): There are places here that could use references to specific figures or panels within figures. Sometimes it’s hard to tell where the results as described are shown in the figures.
-L304-305: The overestimation of low values and underestimation of high values is interesting and consistent (I think) with some of the early studies of MODIS GPP (perhaps from David Turner and/or Faith Ann Heinsch, if I’m remembering correctly?). Some reference to those earlier works here would provide valuable context. The fact that we’re still trying to solve long-standing problems is itself interesting!
-L390-399: This paragraph (about differences among approaches) seems to slightly contradict the previous one (about how there aren’t really major differences). I’m not suggesting that the authors do a complete rewrite of the paragraph or anything, but I do think it might be worth making sure that they are sending a consistent message: that the differences are generally pretty slight.
-L401-407: It could also be that the quality of the eddy covariance data itself is a limiting factor. EC GPP is used as the ground truth in this case, but it’s not a perfect representation of GPP: EC data has sources of noise and EC GPP is a modeled quantity from the more directly measured NEE. I imagine there may therefore be upper limits to the performance metrics that we can expect when upscaling EC GPP just because of uncertainties in what we’re using as “truth.”
-Section 4.2: I think this section would definitely benefit from a more thorough dive into the variable importance, as suggested in general comments. Also, I don’t think there’s any mention of SIF in this section while other variables composing the RS subset are discussed?
-L433-439: The authors mention this at the end of the paragraph, but I think it could be more up front: reanalysis data (especially for precip) can be very flawed. So maybe temperature and VPD do matter (precipitation probably less so since soil moisture is already included in the model and ultimately it’s soil moisture, not precipitation, that gets directly used by plants) but the reanalysis data just doesn’t do a good job capturing it. Could also be worth a citation to previous literature that has assessed reanalysis data.
-L444: I’d suggest rephrasing “It is to be explored.” That’s somewhat awkward, passive phrasing.
-L463-466: This paragraph is kind of light on citations and the final sentence feels out of place and incomplete, like there’s something more that should be coming that connects the first part of the paragraph to this final thought.
-L477-484: This paragraph is also pretty light on citations. A couple suggestions: Smith et al. 2019 (Remote Sensing of Environment) on challenges specifically in dry regions and the early MODIS papers by Turner that assessed biome differences in MODIS GPP performance. It’d be interested to see the results here contextualized with the challenges that have faced remote sensing of productivity for a long time!
- L481: It’s unclear what’s meant by “high proportion of biomass” or how that would affect productivity estimation. To me, it seems like it’s not high biomass that would lead to good performance but rather high seasonal variation in leaf area (which both DBF and MF have).
-L484: A little unclear what’s meant by “complex biophysical and environmental characteristics.” I think it’d be worth expanding on this and being more specific.
-L487: I think “It is to further research to…” is also somewhat awkward and passive phrasing and would suggest rewording.
-L490: This is another good place to cite Smith et al. 2019, which also shows that drylands are underrepresented in flux networks relative to their global proportion. Haughton et al. 2018 (Biogeosciences) could be a good one too since they showed that drylands are more “unique” (meaning less easy to apply a globally-trained model to an unseen site) than most other systems, which may be partly why the underrepresentation of dryland sites in flux networks can be such a problem for upscaling in those regions.
-L504: For the Conclusions section, it might be worth expanding on what’s meant by “RS” here. That’s referring to a specific subset of the variables but for readers who are skimming and skip to the conclusions section, they might miss what that subset refers to.
-L519-520: Maybe to some extent, but it’s interesting to note that RF (not automated and with, I think, some amount of subjectivity in choices) performed nearly as well as the AutoML methods.
Citation: https://doi.org/10.5194/bg-2023-141-RC2 - AC2: 'Reply on RC2', Max Gaber, 12 Nov 2023
Model code and software
AutoML for GPP upscaling v1.0 Max Gaber https://doi.org/10.5281/zenodo.8262618
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
408 | 187 | 29 | 624 | 20 | 22 |
- HTML: 408
- PDF: 187
- XML: 29
- Total: 624
- BibTeX: 20
- EndNote: 22
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1