the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Technical note: Flagging inconsistencies in flux tower data
Martin Jung
Jacob Nelson
Mirco Migliavacca
Tarek El-Madany
Dario Papale
Markus Reichstein
Sophia Walther
Thomas Wutzler
Download
- Final revised paper (published on 15 Apr 2024)
- Supplement to the final revised paper
- Preprint (discussion started on 16 Aug 2023)
Interactive discussion
Status: closed
-
RC1: 'Comment on bg-2023-110', Dennis Baldocchi, 10 Sep 2023
Technical Note: Flagging inconsistencies in flux tower data
This is definitely a technical note, and as long as the editors of this journal are willing to consider it, I will give it due diligence as a referee. Otherwise, this document could be used as a white paper or grey literature to supplement processing on the ICOS or Fluxnet web sites.
I appreciate the utility of having a set of well defined and accepted flags for data by the community, as data has certain quality due to time and place. Sadly we are aware that many data users who are not involved in the details and rigors of the measurements, processing and interpretation often ignore these flags. But no harm in producing and providing them.
What is unique and distinct here is the production of a complementary set of consistency flags (C2F) for flux tower data, which rely on multiple indications of inconsistency among variables, along with a methodology to detect discontinuities in time series. I am a fan of multiple constraints, so I am willing to read the case before me.
As I read this work I have some disagreements with conditions which they may flag.
For example it is stated that most frequent flags were associated with photosynthesis and respiration during rain pulses under dry and hot conditions. I have spent a good part of my career studying such pulses. They are real and sustained following rain events. To remove them is faulty and will cause biases in sums. Yes, I concur during the rain event itself data such be flagged when sensors are wet. But following the rain, huge amounts of respiration can and will occur.
2 Materials and methods
As I read this paper back and forth, I wonder about the wisdom of its organization. In the Methods section there are 8 figures or so. This seems more like a Results and Discussion. I also suspect much of the excess material could be in an Appendix or Supplemental material. For a Technical Note, this paper is really long and excessive.
Section 2.2.1. It is important to benchmark and monitor the relation between PAR and Rg. It is our experience that quantum sensors tend to drift over time, if not frequently calibrated. PAR is used to upscale fluxes with remote sensing and if those relationships are built on faulty values of PAR, the derived products in time, space and trends will be in error. Hence, looking at these flags can have important implications. Too often this issue has been overlooked, so it would be important to know the consequences of improving these data.
The relation between Rn and Rg is important to examine, but realize it will change with season as albedo and surface temperature changes. So be careful and do not make your flags by using one annual dataset for the site.
Comparing GPP and Reco with Day vs Night partitioning methods may be interesting, but not sure which is right. We know there is down regulation in dark respiration during the day, it is hard to measure reliable CO2 fluxes at night under stable conditions and with tall vegetation and appreciable storage and or sloping terrain. These can be points of reference and maybe the daily sum is better than hour by hour measurements, as some errors cancel.
I have learned from Dario the value of plotting CO2 flux vs u* and developed a matlab subroutine to do so. The threshold can be uncertain as u* has some autocorrelation with the flux. Of course we don’t want to set high thresholds as they are based on a diminishing number of data points as high u* values are rare compared to low ones.
I see one of the constraints is LE +H vs Rn..What about G or storage in water column? I think these tests are only instructional. We know that there are many differences in sampling areas and representative of radiation and fluxes. It is dangerous to indict one or the other. And with wetlands it is really hard to measure water storage. We have a data set with nearly closed energy balance and then flooded the system and it all went to hell. Same sensors, same processing, ideal fetch and site. Just water moves heat in and out and it is hard as hell to sample well and well enough.
I must admit I am having a problem coming up with a salient point of this paper and how it will help me do better. I am at the point where an outlier score is proposed. It seems ok, but it is a lot like the college ratings, that depend upon an arbitrary set of metrics and scores.
I often advise use the set of sites that help you ask and answer the questions you are asking, relating to climate, function and structure. Just because these sites and data are in the fluxnet database does not mean we have to use them all. Maybe this should be the point of this paper.
Figure 1. It is a comparison between machine learning and flux data. Not sure what I am to learn and extract here. Which is right or wrong? Machine learning ultimately is a fancy least squared fit to a bunch of transfer and nodes.
Here is a set of data comparing annual carbon fluxes with machine learning methods from my sites. They are almost indistinguishable from the direct flux measurements they are derived from. In this case we know our site and develop the machine learning model with the most appropriate and representative biophysical forcings. In the figure given in this paper, I have no idea how appropriate the machine learning model may be for this situation, as the answer is based on independent variables they chose to use or omit.
Regarding the comparison of radiometers I know during some seasons our guy wires may shade the quantum sensor for certain angles of the sun. surely those data are not fit and I hope such a method may help detect these biases and errors.
Figure 2 seems to be a nice case study to show the attributes of your ideas. Maybe start with that one first. It is clear and more understandable, as we know PAR and Rg are closely related. So when there are differences it can help us think about why and which is more plausible and better.
Fig 3. Maybe I am just tired, or thick, but I don’t follow the logic and rationale of the flag for light used for GPP. It would only give me pause on the accuracy of the machine learning calculations, but not the eddy fluxes.
Fig 5. I am trying to get my head around the issue of the comparison of the daytime vs nighttime methods. Again, I would argue one is better than the other. Personally, I like the idea of multiple constraints and see if the two methods are converging for confidence, more than anything. Not sure what you all are doing, but in early days working with Eva Falge, we estimated respiration during the day by the extrapolation of the CO2 flux vs light response curve. Now one of the limits is basing a regression and extrapolation on only a few points when the response function is linear, and the fact that during the sunrise sunset period steady state conditions don’t hold. It is these reasons why I argue against one being better or worse, but if they both converge at least we may assume the fluxes may be good enough.
The reality is that pulses due to rain or insects passing through the path of the IRGA or sonic are problematic. Or those from electrical noise (a rarity today). We also see problems with CO2 fluxes over open water as there is a covariance with w and RSSI of the sensor that yields fluxes in the wrong direction and that are not physical. Those should be filtered. But I don’t hear about that here.
Fig 6 seems to align with my suggestions that some sites may not be the best for some analyses and just toss them. Nothing lost as we oversample in many situations.
Fig 7. Curious as to why there is a systematic jump in LE. Eddy covariance should be immune from just a jump as we are doing mean removal. So even if sensors change and they are properly calibrated we should not expect such a marked difference. This is not like comparing two separate sensors, that can have offsets.
Fig 8. Illustration of the outlier score. This is needed to support the method described here. Has taken a long time to get to this point. Line 350!
Results
Fig9 demonstrates the point of this method. As expected met variable values tend to have few outliers.
Fig 10 provides a needed diagnostic as to when data may be rejected
Fig 11. Would think this would be a function of open vs closed path sensors
Fig 13. The jumps in NEE seem to be with site management. So Know Thy Site. Just don’t blindly process long term data. This is why we have phenocams at our tower, to look at the vegetation when things are ‘weird’.
Jumps in sensors can, will and do happen. This is why we make big efforts to write notes and log our sensor systems. Users have to remember Cavet Emptor and use the data wisely and when there are jumps look to reasons, and not mis interpret the data. Us data providers cant hand hold all users. They must do due diligence when using data too. Getting back to my point one should not use all the data. Use what is best and most fitting.
Fig14. Interesting
Discussion
Factors for potential false positive and false negative flagging
Glad to see something on this. But it leaves begging the point I make that respiration pulses are real.
Detection and interpretation of discontinuities in the time series
As I have mentioned, these are expected with long term sites as management can make changes..The site history needs to be considered too.
4.3.1 Flagged data points
I have already made my point about the danger of flagging rain pulses that are real. We have studied this with eddy fluxes, chambers, soil probes and they are consistent.
4.3.2 Flagged discontinuities in time series
It is reasonable to flag discontinuities, but aren’t they flagged already?
Concluding points
I find this paper on the opaque side. It is a slog to read through, very engineering in spirit, style and narrative.
I must confess given the energy and time to write any paper, this is one I would not have spent writing.
I am missing the ‘so what’ message and being convinced I need to apply another set of flags to what I am already doing or what is being done in fluxnet, especially something that is automated and may not be applicable for the sites I may need in my synthesis.
The scoring method seems on the arbitrary side and reminds me of the scoring system for the ‘best’ world universities. Each scoring system yields a different ranking and group. I suspect this would apply to the application of this method, too.
I want to know how often this automated method suffers from type 2 errors, calling an error when there really isn’t one.
I want to know how often this automated method suffers from type 2 errors, calling an error when there really isn’t one. This concern also revolves around my complaints about flagging real respiration rain pulses. These pulses are real and sustained and should not be flagged (except for the period when the sensors are wet).
At this point I really feel it is up to the editor whether or not they are interested in publishing such a paper. My suspicion is that it may not be cited much, but again I may be wrong. As I look at the data from a different perspective being a data generator and knowing what to belief and accept as reasonable.
-
AC1: 'Reply on RC1', Martin Jung, 22 Sep 2023
We would like to thank Dennis Baldocchi for his critical assessment on our manuscript which helps to improve clarity in the revised version. There are three main issues that have emerged from his review that we would like clarify:
- Relevance, purpose, and clarity
Thankfully, FLUXNET has become a data stream to validate and calibrate process-based land surface models, machine learning models, and retrievals from remote sensing. These synergistic global data streams feed global assessments of ecosystems and climate, and associated scientific progress. Scientists leveraging flux tower data from the full network are often non-eddy covariance experts, and take the data for granted, partly because available data quality flags are limited to a few variables and are limited in their scope.
The purpose of our work is to provide a methodological framework and tool to assess inconsistencies in flux tower data according to a set of well defined criteria. The resulting complementary consistency flags can be used additionally in network wide synthesis studies by non EC-experts, e.g. for analyzing the robustness of results to the inclusion of questionable data points. The tool can assist in and accelerate data quality checking done individually by PIs or by centralized processing – it cannot replace current practices but it can catch issues that occasionally slip through, and eventually will help to save time. We have demonstrated in the manuscript that the obtained flagging patterns point to issues in the data that are not really surprising to PIs but which can have very large relevance for important scientific questions if those data issues were not flagged or were ignored. Since no such flags have been available yet, we think that our framework and tool to assess inconsistencies in flux tower data is a useful contribution to the community.
We apologize for the apparent lack of clarity in framing the paper and we aim at improving this aspect in the revision. The presented material is very technical by nature. We tried to make it accessible by including educational examples and many figures in the methods section, while bundling details in the last subsection to keep the clarity of the overall framework. It is clearly a delicate balance that we are willing to improve according to reviewer suggestions in the revised version. We thank the reviewer for his comments which help us to improve the framing and clarity of the revised paper.
- Respiration rain pulse response
We apologize for an apparent misunderstanding that we clarify in the revised version of the manuscript. Clearly, rain induced respiration pulses are real, interesting, and relevant phenomena evident in the measured NEE. However, flux partitioning methods struggle to work accurately under these conditions of abrupt change in ecosystem sensitivity to moisture, temperature, and light such that derived GPP and Reco estimates can be inaccurate. We hope that flagging GPP and Reco, when they appear to be inconsistent, helps for a better interpretation and usage of GPP and Reco data, motivates and helps in the development of improved flux partitioning methods, and helps in better understanding respiration rain pulses.
- Methodological design
We apologize for the apparent lack of clarity of the methodological rationale and advantages that we aim to improve in the revised version of the manuscript. The method follows a clearly defined logic for identifying inconsistencies in flux tower data based on a set of criteria (constraints) that are conceptually and empirically justified (see Table 2 in the manuscript) along with a careful conceptual distinction between soft and hard constraints. The strictness parameter relates to the boxplot rule and thus has an interpretable basis that is familiar to everyone with a basic understanding of statistics. The methodology is fully automatized and there is no data screening based on subjective visual inspection. The methodology is setup to be flexible in terms of modifying the criteria (constraints) and strictness – a key advantage to tailor the methodology for specific questions and applications since “there is no free lunch”.
Below we reply point by point to the comments
This is definitely a technical note, and as long as the editors of this journal are willing to consider it, I will give it due diligence as a referee. Otherwise, this document could be used as a white paper or grey literature to supplement processing on the ICOS or Fluxnet web sites.
See major point 1).
I appreciate the utility of having a set of well defined and accepted flags for data by the community, as data has certain quality due to time and place. Sadly we are aware that many data users who are not involved in the details and rigors of the measurements, processing and interpretation often ignore these flags. But no harm in producing and providing them.
We appreciate the positive comment on the general usefulness of flags for eddy covariance data users. We hope that a clear and peer-reviewed documentation raises the awareness of the users and can facilitate a more appropriate usage of the data.
What is unique and distinct here is the production of a complementary set of consistency flags (C2F) for flux tower data, which rely on multiple indications of inconsistency among variables, along with a methodology to detect discontinuities in time series. I am a fan of multiple constraints, so I am willing to read the case before me.
As I read this work I have some disagreements with conditions which they may flag.
See below for responses on specific conditions.
For example it is stated that most frequent flags were associated with photosynthesis and respiration during rain pulses under dry and hot conditions. I have spent a good part of my career studying such pulses. They are real and sustained following rain events. To remove them is faulty and will cause biases in sums. Yes, I concur during the rain event itself data such be flagged when sensors are wet. But following the rain, huge amounts of respiration can and will occur.
See major point 2).
2 Materials and methods
As I read this paper back and forth, I wonder about the wisdom of its organization. In the Methods section there are 8 figures or so. This seems more like a Results and Discussion. I also suspect much of the excess material could be in an Appendix or Supplemental material. For a Technical Note, this paper is really long and excessive.
See major point 3). We can agree to move e.g. methodological details to supplementary material based on reviewers’ and editor’s advice. Our initial decision to keep the details in the manuscript was because we considered this appropriate for a ‘Technical Note’.
Section 2.2.1. It is important to benchmark and monitor the relation between PAR and Rg. It is our experience that quantum sensors tend to drift over time, if not frequently calibrated. PAR is used to upscale fluxes with remote sensing and if those relationships are built on faulty values of PAR, the derived products in time, space and trends will be in error. Hence, looking at these flags can have important implications. Too often this issue has been overlooked, so it would be important to know the consequences of improving these data.
We agree with the reviewer that issues in PAR data could easily propagate or deteriorate downstream analysis, including the flux partitioning. We hope that our C2F flags will be contributing to a better usage of PAR data from FLUXNET that could be also implemented in the first steps of the data processing.
The relation between Rn and Rg is important to examine, but realize it will change with season as albedo and surface temperature changes. So be careful and do not make your flags by using one annual dataset for the site.
We agree with the reviewer on these conceptual issues of the Rn-Rg relationship which we had mentioned explicitly in table 2. For these reasons, the relationship was classified as soft constraint meaning that additional constraints need to indicate outliers for the same data points to cause flagging. The median correlation between daily Rn and Rg is about 0.95 which provides an empirical justification for considering this constraint for deriving inconsistency flags for radiation variables.
Comparing GPP and Reco with Day vs Night partitioning methods may be interesting, but not sure which is right. We know there is down regulation in dark respiration during the day, it is hard to measure reliable CO2 fluxes at night under stable conditions and with tall vegetation and appreciable storage and or sloping terrain. These can be points of reference and maybe the daily sum is better than hour by hour measurements, as some errors cancel.
We agree with the reviewer on these points and would mention these factors as potential sources of inconsistencies between flux partitioning methods in the revised version of the manuscript. As explained above, the aim of these flags is not to identify which one is correct but to identify data where there are multiple indications of inconsistency that suggest issues such that is able to make a choice to include these data or not e.g. for modeling activities. Inconsistencies between results from partitioning methods is one of these cases.
I have learned from Dario the value of plotting CO2 flux vs u* and developed a matlab subroutine to do so. The threshold can be uncertain as u* has some autocorrelation with the flux. Of course we don’t want to set high thresholds as they are based on a diminishing number of data points as high u* values are rare compared to low ones.
Due to the reason’s outlined by the reviewer we used the u* uncertainty quantified according to Pastorello et al. 2020 (that is an extension of Papale et al. 2006 and more robust to noise) as a soft constraint for assessing NEE.
I see one of the constraints is LE +H vs Rn..What about G or storage in water column? I think these tests are only instructional. We know that there are many differences in sampling areas and representative of radiation and fluxes. It is dangerous to indict one or the other. And with wetlands it is really hard to measure water storage. We have a data set with nearly closed energy balance and then flooded the system and it all went to hell. Same sensors, same processing, ideal fetch and site. Just water moves heat in and out and it is hard as hell to sample well and well enough.
We agree with the reviewer that accounting for storage is important for assessing the energy balance constraint for diurnal data and when flooding and associated lateral transport of energy would violate the assumption behind the LE+H vs Rn constraint.
Our approach is based on daily integrals, where for most sites and conditions, the ground heat storage change G, is quantitatively very small compared to the magnitude of daily Rn or LE+H and their uncertainties. Furthermore daily mean G tends to scale with daily Rn which further reduces the effect of neglecting G for the performance of the linear fit. This is illustrated by the median correlation for this constraint being 0.95 which provides an empirical justification. Furthermore, in sites where non-trivial variations of G would contribute to a weaker correlation between LE+H and Rn (e.g. for wetland sites) then the outlier score should not be impacted by that because outliers are identified relative to the variance of residuals (which would be bigger). For better comparability among sites we had decided to not account for G because it is comparatively frequently missing. While we hope we could make a convincing case for keeping this energy balance constraints we agree with the conceptual concerns raised by Dennis and would consider it as soft constraint rather than as hard constraint in the revised version. Thank you again for the constructive comment that helps improving the methodology.
I must admit I am having a problem coming up with a salient point of this paper and how it will help me do better. I am at the point where an outlier score is proposed. It seems ok, but it is a lot like the college ratings, that depend upon an arbitrary set of metrics and scores.
See major point 1) and 3). Essentially the objectives of the paper are to present the methodology/tool and to show the usefulness of the flags by pointing to data issues in FLUXNET2015 that are relevant to consider for important scientific questions. We will make this more clear in the revised version of the manuscript.
I often advise use the set of sites that help you ask and answer the questions you are asking, relating to climate, function and structure. Just because these sites and data are in the fluxnet database does not mean we have to use them all. Maybe this should be the point of this paper.
Indeed, we hope we can contribute to a better selection of data for specific FLUXNET based analysis for example involving modeling and empirical upscaling. We will try to make that more clear in the revised version of the manuscript. See also major point 1).
Figure 1. It is a comparison between machine learning and flux data. Not sure what I am to learn and extract here. Which is right or wrong? Machine learning ultimately is a fancy least squared fit to a bunch of transfer and nodes.
With Figure 1 we aimed at illustrating in an educational way how the outlier score works and behaves, in particular in the context of heteroscedasticity. We will improve the clarity of this part in the revised manuscript. See also major point 3).
Here is a set of data comparing annual carbon fluxes with machine learning methods from my sites. They are almost indistinguishable from the direct flux measurements they are derived from. In this case we know our site and develop the machine learning model with the most appropriate and representative biophysical forcings. In the figure given in this paper, I have no idea how appropriate the machine learning model may be for this situation, as the answer is based on independent variables they chose to use or omit.
We thank the reviewer for this important point and for sharing his results with us. The median correlation between machine learning based predictions (cross-validated within site) varies between 0.93 and 0.99 depending on the target variable (Table 2), which suggests that the chosen predictor variables listed in section 2.4.2 are appropriate. We had invested energy into designing and calculating a set of water availability metrics for improved modelling of water stress effects that are typically more difficult to get (section xx). The cross-validated model predictions as well as performance metrics are by default stored and accessible by the interested user in the tool and data. We would like to emphasize again that the outlier score is always calculated with reference to the spread of residuals which means that a bad model and thus large absolute residuals would not cause elevated outlier scores (discussed in section 2.4.2). Furthermore, the machine learning constraints are considered as soft constraints, i.e. multiple indications of inconsistencies are needed to cause flagging. We will improve the clarity on these aspects in the revised version of the manuscript.
Regarding the comparison of radiometers I know during some seasons our guy wires may shade the quantum sensor for certain angles of the sun. surely those data are not fit and I hope such a method may help detect these biases and errors.
Good point. We will add this to the discussion.
Figure 2 seems to be a nice case study to show the attributes of your ideas. Maybe start with that one first. It is clear and more understandable, as we know PAR and Rg are closely related. So when there are differences it can help us think about why and which is more plausible and better.
Thank you for your suggestion! We will consider this carefully for the revised version of the manuscript.
Fig 3. Maybe I am just tired, or thick, but I don’t follow the logic and rationale of the flag for light used for GPP. It would only give me pause on the accuracy of the machine learning calculations, but not the eddy fluxes.
SW_IN is an input of the daytime based flux partitioning method to fit a light-response curve. Therefore, GPP_DT is affected by errors in SW_IN. For this reason GPP_DT get flagged when SW_IN was flagged (see also Fig. 4). We apologize if that was not clear enough and will improve this aspect in the revisions.
Fig 5. I am trying to get my head around the issue of the comparison of the daytime vs nighttime methods. Again, I would argue one is better than the other. Personally, I like the idea of multiple constraints and see if the two methods are converging for confidence, more than anything. Not sure what you all are doing, but in early days working with Eva Falge, we estimated respiration during the day by the extrapolation of the CO2 flux vs light response curve. Now one of the limits is basing a regression and extrapolation on only a few points when the response function is linear, and the fact that during the sunrise sunset period steady state conditions don’t hold. It is these reasons why I argue against one being better or worse, but if they both converge at least we may assume the fluxes may be good enough.
The main intention of Figure 5 was to illustrate how the flagging works and that we can diagnose which constraints have contributed to or caused flagging. In addition, we wanted to illustrate how respiration pulses evident in NEE (and not flagged) are associated with inconsistent GPP and Reco partitioning methods such as systematic negative GPP_NT. If that figure adds more confusion than clarity we would consider to move it to supplementary material. See also major point 3).
The reality is that pulses due to rain or insects passing through the path of the IRGA or sonic are problematic. Or those from electrical noise (a rarity today). We also see problems with CO2 fluxes over open water as there is a covariance with w and RSSI of the sensor that yields fluxes in the wrong direction and that are not physical. Those should be filtered. But I don’t hear about that here.
Thank you for these points. We will fold those into the discussion.
Fig 6 seems to align with my suggestions that some sites may not be the best for some analyses and just toss them. Nothing lost as we oversample in many situations.
Agree.
Fig 7. Curious as to why there is a systematic jump in LE. Eddy covariance should be immune from just a jump as we are doing mean removal. So even if sensors change and they are properly calibrated we should not expect such a marked difference. This is not like comparing two separate sensors, that can have offsets.
According to the BADM the break coincides with a major change in instrumentation including a change in the gas analyzer, the sonic anemometer, and the measurement height (see also Fig.13). It is an example where these tests could also help the PI to identify at least a missing communication in terms of metadata about the setup.
Fig 8. Illustration of the outlier score. This is needed to support the method described here. Has taken a long time to get to this point. Line 350!
We will consider moving it up in the revised manuscript. The trade-off between methodological details and overall clarity seems very delicate. See also major point 3).
Results
Fig9 demonstrates the point of this method. As expected met variable values tend to have few outliers.
This is mostly correct while radiation variables show comparatively frequent flagging too.
Fig 10 provides a needed diagnostic as to when data may be rejected
It was intended to illustrate issues of currently existing flux-partitioning methods in particular for dry and rain pulse conditions.
Fig 11. Would think this would be a function of open vs closed path sensors
Interesting hypothesis! We’ll look into this for the revised version.
Fig 13. The jumps in NEE seem to be with site management. So Know Thy Site. Just don’t blindly process long term data. This is why we have phenocams at our tower, to look at the vegetation when things are ‘weird’.
Agree. Management is clearly an important factor that can cause flags for temporal discontinuities as shown in figure 13 and discussed in section 4.1.2. Changes in instrumentation or processing by the PI seem to be another important, and apparently the more important cause of temporal discontinuities (see Figure 13 and its caption and section 4.1.2). As discussed in section 4.2.1 the purpose of these flags is to make users aware of temporal discontinuities and then the decision of how to interpret or deal with them is left to the user according to the requirements.
Jumps in sensors can, will and do happen. This is why we make big efforts to write notes and log our sensor systems. Users have to remember Cavet Emptor and use the data wisely and when there are jumps look to reasons, and not mis interpret the data. Us data providers cant hand hold all users. They must do due diligence when using data too. Getting back to my point one should not use all the data. Use what is best and most fitting.
While we agree in principle we think that we should try our best to facilitate an appropriate usage of FLUXNET data for non-EC experts (in particular when large number of sites are used in an automatic ingestion system) to ultimately facilitate a wide and solid use of the data. Since we are addressing this problem explicitly we hope our flags will be useful for the community of data users. A conceptual advantage of our system compared to visual expert judgement is that the approach is standardized and automated, avoids subjectivity and avoids potential confirmation bias in selecting or filtering out data.
Fig14. Interesting
Thank you.
Discussion
Factors for potential false positive and false negative flagging
Glad to see something on this. But it leaves begging the point I make that respiration pulses are real.
See major point 2).
Detection and interpretation of discontinuities in the time series
As I have mentioned, these are expected with long term sites as management can make changes..The site history needs to be considered too.
Agreed. The flags can help pointing to management effects on fluxes as discussed in section 4.1.2 and 4.2.1, and the user can consult the BADMs and the PI to find out what happened specifically. Likewise, changes in instrumentation seem to play another major role unfortunately. This demonstrates once more the importance of the management and instrumentation change data that are often not shared. If systematically available an attribution of the temporal discontinuities could be facilitated in a future version of C2F. We will clarify this in the new version.
4.3.1 Flagged data points
I have already made my point about the danger of flagging rain pulses that are real. We have studied this with eddy fluxes, chambers, soil probes and they are consistent.
See major point 2).
4.3.2 Flagged discontinuities in time series
It is reasonable to flag discontinuities, but aren’t they flagged already?
No, they are not. Therefore we think our methodology and tool will be useful.
Concluding points
I find this paper on the opaque side. It is a slog to read through, very engineering in spirit, style and narrative.
See major point 1) and 3). We’ll try to improve.
I must confess given the energy and time to write any paper, this is one I would not have spent writing.
See major point 1).
I am missing the ‘so what’ message and being convinced I need to apply another set of flags to what I am already doing or what is being done in fluxnet, especially something that is automated and may not be applicable for the sites I may need in my synthesis.
See major point 1).
The scoring method seems on the arbitrary side and reminds me of the scoring system for the ‘best’ world universities. Each scoring system yields a different ranking and group. I suspect this would apply to the application of this method, too.
Since we aim at flagging inconsistencies in flux tower data based on a clear rationale and a set of conceptually and empirically justified criteria we think the analogy to college ratings does not apply here. See also major point 3).
I want to know how often this automated method suffers from type 2 errors, calling an error when there really isn’t one.
We fully agree with the reviewer that this would be relevant to know. To facilitate such analysis we either need labels for the real data or synthetic data. We would love to have them but don’t. It is clearly a function of the chosen niqr threshold (see discussion in section 4.1.1) – when increasing niqr (allowing for more loose consistency) type 2 errors will decrease. We encouraged that future efforts should try to establish such a benchmarking data set such that we can objectively evaluate type 1 and type 2 errors and ultimately improve the method (conclusion section, line 772-774).
This concern also revolves around my complaints about flagging real respiration rain pulses. These pulses are real and sustained and should not be flagged (except for the period when the sensors are wet).
See major point 2).
At this point I really feel it is up to the editor whether or not they are interested in publishing such a paper. My suspicion is that it may not be cited much, but again I may be wrong. As I look at the data from a different perspective being a data generator and knowing what to belief and accept as reasonable.
See major point 1). We hope we could convince the reviewer, and the editor to see the value of our contribution for the scientific community. Classic papers on eddy covariance data processing like Papale et al. 2006, Reichstein et al. 2005, Desai et al. 2008, Lasslop et al. 2010, Wutzler et al. 2018, Pastorello et al. 2020 count hundreds to thousands of citations each so far implying that advances in this domain are likely to have impact.
Citation: https://doi.org/10.5194/bg-2023-110-AC1
-
AC1: 'Reply on RC1', Martin Jung, 22 Sep 2023
-
RC2: 'Comment on bg-2023-110', Anonymous Referee #2, 02 Oct 2023
Title: Technical Note: Flagging inconsistencies in flux tower data
Manuscript number: bg-2023-110
SUGGESTION: major revisions
GENERAL COMMENTS
In this manuscript a new algorithm for quality flagging eddy covariance (EC) flux data is proposed. Long EC flux time series are already available from several measurement sites and the FLUXNET datasets consists of observations from hundreds of sites. Such datasets are an invaluable source of information for studies focusing on land-atmosphere interactions. However, there can be spurious differences between sites (e.g. due to instrumentation or different data processing pipelines) and spurious temporal discontinuities in long time series which complicate the usage of this data. Such problems have been minimized in research infrastructures such as ICOS and NEON, where instrumentation and data processing have been standardized and sufficient metadata are available. However, for older data and data stemming outside these standardized infrastructures these problems may still persist. This manuscript tries to find a solution to these problems with additional quality flagging of EC flux time series.
The manuscript is within the scope of BG (although might fit better to AMT or GI due to its technical nature) and presents novel ideas for solving an existing problem (which would not however exist if all the needed metadata would be available). The scientific quality of the work is good, but to a large extent the presentation quality is not. My main criticism is directed towards how the new algorithm is presented, in particular towards Sect. 2 in the manuscript. The section is very difficult to follow, and the reader needs to constantly jump back and forth between subsections when reading the text. For instance, when I reached Sect. 2.4 I realized that I had read the whole section already, since I needed to read it simultaneously with the sections 2.2. and 2.3 in order to understand the text. It took me quite long time to understand how the whole algorithm works. Hence, I strongly suggest rethinking the structure of the text in Sect. 2.
I suggest accepting this manuscript after major revisions, mainly due to the way the algorithm is presented in the manuscript.
As a sidenote, this is one of those manuscripts where it would be really helpful for the reviewer if the underlying code and a small example dataset would be available already during the review. However, currently it does not seem to be prerequisite for manuscript submission in this journal and hence do not expect the authors to make such material available at this stage.
SPECIFIC COMMENTS
Row 40: ”spectral corrections” plus other processing steps, e.g. coordinate rotation.
Rows 75-78: I suggest that you add references for these research questions. They are clearly related to prior work. Now it reads like that the reader should already know that e.g. interannual variability of sensible heat flux can be predicted better than interannual variability of latent heat flux or that the reader knows what is “the issue to model drought effects in GPP”.
Row 86: FLUXCOM not introduced anywhere, you need to briefly tell what it is.
Rows 102-102: I have not used FLUXNET2015 data, is it originally with daily time step or did you average it to daily values? Please mention in the text.
Rows 102-103: What is fqcOK, a variable in FLUXNET2015 dataset? Consider removing it from the text and just write that you removed those days from the analysis for which more than 20 % of data were not measured or gapfilled with high confidence.
Row 112: “expected relationship”, this can be dangerous as you are enforcing a certain dependence between variables. By doing this you may inadvertently quality flag (and screen out) scientifically interesting periods. This ought to be discussed this in the text.
Row 159: I suggest adding a site ID in parentheses after “United States”.
Row 161: what is F15? Please clarify
Row 319: “This made the outlier score too sensitive to very small residuals even.” What does this mean? Please clarify
Row 349: Does the machine learning (ML) model predictive performance have an impact on these ML constraints? Do you assume that the model residuals are only related to noise in the measurements, i.e. the model is perfect? I suggest discussing this briefly in the text, e.g. here, in Sect. 2.4.8 or other suitable location. ML models typically perform worse for NEE than e.g. for GPP (see e.g. Tramontana et al., 2016) and hence this constraint might not work similarly for all variables.
Row 357: you need to introduce scikit-learn
Rows 373-374: “Six variants of tCWDt C with different C values of 15,50,100,150,200,250 mm were calculated.” Already mentioned above on rows 368-369. Please remove
Row 377 (Table 5): These variable names most likely follow FLUXNET2015, but you need to tell the reader what these variables are. Currently, they are not all introduced in the text.
Row 604: I would argue that false negative is not as bad as false positive (conservative approach).
TECHNICAL CORRECTIONS
Row 52: extra “)” after “Drought2018”
Row 77: replace ”.” with “?”
Row 87 and maybe elsewhere: you use both ONEFLUX and ONEFlux. Use only one of these two, check Pastorello et al. (2020).
Row 95: replace “FLUXNET” with “FLUXNET2015”
Row 102: replace “table” with “Table”.
Row 115: should “2.2.2” be replaced with “2.2.1”?
Row 232: replace “Turing” with “Turning”
Row 592 (Figure 14): Colorbar label is incomplete in the right plot
REFERENCES
Pastorello, G., Trotta, C., Canfora, E., Chu, H., Christianson, D., Cheah, Y.-W., Poindexter, C., Chen, J., Elbashandy, A., Humphrey, M., Isaac, P., Polidori, D., Reichstein, M., Ribeca, A., van Ingen, C., Vuichard, N., Zhang, L., Amiro, B., Ammann, C., Arain, M. A., Ardö, J., Arkebauer, T., Arndt, S. K., Arriga, N., Aubinet, M., Aurela, M., Baldocchi, D., Barr, A., Beamesderfer, E., Marchesini, L. B., Bergeron, O., Beringer, J., Bernhofer, C., Berveiller, D., Billesbach, D., Black, T. A., Blanken, P. D., Bohrer, G., Boike, J., Bolstad, P. V., Bonal, D., Bonnefond, J.-M., Bowling, D. R., Bracho, R., Brodeur, J., Brümmer, C., Buchmann, N., Burban, B., Burns, S. P., Buysse, P., Cale, P., Cavagna, M., Cellier, P., Chen, S., Chini, I., Christensen, T. R., Cleverly, J., Collalti, A., Consalvo, C., Cook, B. D., Cook, D., Coursolle, C., Cremonese, E., Curtis, P. S., D’Andrea, E., da Rocha, H., Dai, X., Davis, K. J., Cinti, B. D., de Grandcourt, A., Ligne, A. D., De Oliveira, R. C., Delpierre, N., Desai, A. R., Di Bella, C. M., di Tommasi, P., Dolman, H., Domingo, F., Dong, G., Dore, S., Duce, P., Dufrêne, E., Dunn, A., Dušek, J., Eamus, D., Eichelmann, U., ElKhidir, H. A. M., Eugster, W., Ewenz, C. M., Ewers, B., Famulari, D., Fares, S., Feigenwinter, I., Feitz, A., Fensholt, R., Filippa, G., Fischer, M., Frank, J., Galvagno, M., et al.: The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data, Scientific Data, 7, 225, https://doi.org/10.1038/s41597-020-0534-3, 2020.
Tramontana, G., Jung, M., Schwalm, C. R., Ichii, K., Camps-Valls, G., Ráduly, B., Reichstein, M., Arain, M. A., Cescatti, A., Kiely, G., Merbold, L., Serrano-Ortiz, P., Sickert, S., Wolf, S., and Papale, D.: Predicting carbon dioxide and energy fluxes across global FLUXNET sites with regression algorithms, Biogeosciences, 13, 4291–4313, https://doi.org/10.5194/bg-13-4291-2016, 2016.
Citation: https://doi.org/10.5194/bg-2023-110-RC2 -
AC2: 'Reply on RC2', Martin Jung, 25 Oct 2023
GENERAL COMMENTS
In this manuscript a new algorithm for quality flagging eddy covariance (EC) flux data is proposed. Long EC flux time series are already available from several measurement sites and the FLUXNET datasets consists of observations from hundreds of sites. Such datasets are an invaluable source of information for studies focusing on land-atmosphere interactions. However, there can be spurious differences between sites (e.g. due to instrumentation or different data processing pipelines) and spurious temporal discontinuities in long time series which complicate the usage of this data. Such problems have been minimized in research infrastructures such as ICOS and NEON, where instrumentation and data processing have been standardized and sufficient metadata are available. However, for older data and data stemming outside these standardized infrastructures these problems may still persist. This manuscript tries to find a solution to these problems with additional quality flagging of EC flux time series.
The manuscript is within the scope of BG (although might fit better to AMT or GI due to its technical nature) and presents novel ideas for solving an existing problem (which would not however exist if all the needed metadata would be available). The scientific quality of the work is good, but to a large extent the presentation quality is not. My main criticism is directed towards how the new algorithm is presented, in particular towards Sect. 2 in the manuscript. The section is very difficult to follow, and the reader needs to constantly jump back and forth between subsections when reading the text. For instance, when I reached Sect. 2.4 I realized that I had read the whole section already, since I needed to read it simultaneously with the sections 2.2. and 2.3 in order to understand the text. It took me quite long time to understand how the whole algorithm works. Hence, I strongly suggest rethinking the structure of the text in Sect. 2.
We thank the reviewer for this assessment which helps improving the quality of the manuscript. We agree with the reviewer that the presentation of the algorithm should be improved for better clarity. The current structure of bundling methodological details in section 2.4 was the result of internal iterations of the manuscript among co-authors where it seemed that the description of many technical details if contained in the previous sections made it difficult to follow the general principle and approach. We will carefully revisit section 2 and incorporate important points from section 2.4. in previous sections where necessary, along with improving clarity overall. We think a division of the methodological part in a first more generic section to give an overview of the algorithms and a more detailed section only for readers interested in the details improves clarity while we agree with the reviewer that it is important to find an adequate balance here.
I suggest accepting this manuscript after major revisions, mainly due to the way the algorithm is presented in the manuscript.
As a sidenote, this is one of those manuscripts where it would be really helpful for the reviewer if the underlying code and a small example dataset would be available already during the review. However, currently it does not seem to be prerequisite for manuscript submission in this journal and hence do not expect the authors to make such material available at this stage.
The code together with example data will be made available with the revised version.
SPECIFIC COMMENTS
Row 40: ”spectral corrections” plus other processing steps, e.g. coordinate rotation.
Agree. We will accommodate this accordingly.
Rows 75-78: I suggest that you add references for these research questions. They are clearly related to prior work. Now it reads like that the reader should already know that e.g. interannual variability of sensible heat flux can be predicted better than interannual variability of latent heat flux or that the reader knows what is “the issue to model drought effects in GPP”.
Agree. We will add context and references here.
Row 86: FLUXCOM not introduced anywhere, you need to briefly tell what it is.
Agree. We will follow the suggestion.
Rows 102-102: I have not used FLUXNET2015 data, is it originally with daily time step or did you average it to daily values? Please mention in the text.
Sorry for this missing information. The original data are (typically) half-hourly – we will clarify that.
Rows 102-103: What is fqcOK, a variable in FLUXNET2015 dataset? Consider removing it from the text and just write that you removed those days from the analysis for which more than 20 % of data were not measured or gapfilled with high confidence.
Agree. We will follow this suggestion.
Row 112: “expected relationship”, this can be dangerous as you are enforcing a certain dependence between variables. By doing this you may inadvertently quality flag (and screen out) scientifically interesting periods. This ought to be discussed this in the text.
We apologize that our formulation was a bit misleading because the detection of an outlier from an expected relationship is only one indication for inconsistency, while the inconsistency score and the flagging consider multiple independent indications of inconsistency. We will clarify this in the revised version. As the reviewer suggests, we have discussed this rationale extensively in section 2.2 and 4.1.
Row 159: I suggest adding a site ID in parentheses after “United States”.
OK
Row 161: what is F15? Please clarify
F15 stands for the FLUXNET 2015 data set. We will avoid the abbreviation in the revised version.
Row 319: “This made the outlier score too sensitive to very small residuals even.” What does this mean? Please clarify
We thank the review for pointing out insufficient clarity. Basically the problem is that during the calculation of the outlier score we divide by the interquartile range of the residuals, which can be (close to ) zero in some occasional cases. In these rare cases of extremely small interquartile range of residuals the outlier score could become very large even for a tiny absolute residual. We will clarify this accordingly in the revised version.
Row 349: Does the machine learning (ML) model predictive performance have an impact on these ML constraints? Do you assume that the model residuals are only related to noise in the measurements, i.e. the model is perfect? I suggest discussing this briefly in the text, e.g. here, in Sect. 2.4.8 or other suitable location. ML models typically perform worse for NEE than e.g. for GPP (see e.g. Tramontana et al., 2016) and hence this constraint might not work similarly for all variables.
These are all important considerations that were incorporated in the design of the method, while it seems that we did not communicate these well. We explicitly assume ML models to be imperfect models with underlying assumptions – for this reason the ML models are classified as soft constraints as mentioned in Table 2. We will clarify this further during the revisions.
The performance of a constraint, i.e. how well a variable can be predicted, essentially determines how meaningful or useful this constraint is for detecting inconsistencies. Let’s consider two cases where the correlation is a) very high implying a small variability of residuals, or b) very low implying a large variability of residuals. The outlier score is calculated relative to the interquartile range of residuals which means that overall the magnitude of the outlier score is comparable for both cases and insensitive to performance. However, if we have a poor correlation for a constraint it means that we are very limited in detecting inconsistencies implying a higher risk for false negatives (i.e. that ‘bad’ data may not get flagged eventually because they could not be identified as an ‘inconsistent’ outlier). Thus the higher the performance for a constraint the lower is the expected false negative rate. The false positive rate should not be affected by performance. For these reasons we have reported the median correlation for each chosen constraint in table 2 which are all descent (>0.84) and are high for all machine learning constraints (> 0.92). While the reviewer is correct that it is generally more difficult to model NEE compared to GPP, we are training models here separately for each site, where the performance is a lot better compared to predicting unseen sites like in Tramontana et al., 2006). Furthermore, it is important to stress once more that the machine learning constraints are defined as soft constraints, i.e. that these provide only one indication for inconsistency such that another and independent constraint needs to provide additional indication of inconsistency to cause flagging.
We thank the reviewer again for these points that help us to improve clarity by thoroughly revising and restructuring the description of the methodology and its assumptions in section 2 and the discussion in section 4.
Row 357: you need to introduce scikit-learn
Ok, will do.
Rows 373-374: “Six variants of tCWDt C with different C values of 15,50,100,150,200,250 mm were calculated.” Already mentioned above on rows 368-369. Please remove
Ok.
Row 377 (Table 5): These variable names most likely follow FLUXNET2015, but you need to tell the reader what these variables are. Currently, they are not all introduced in the text.
Sorry, we will introduce the acronyms in the revised version.
Row 604: I would argue that false negative is not as bad as false positive (conservative approach).
In general, we agree with the reviewer and in fact the design of the methodology tried to minimize false positives by introducing soft vs hard constraints, by requiring 2 or more indications of inconsistency from independent constraints, by accounting for heteroscedasticity, and by choosing a default value of nIQR=3. However, there are certainly also potential applications where a very strict data filtering may be desired and we provide this flexibility here by allowing to vary the nIQR strictness parameter to accommodate this. We will expand on this and improve clarity respectively in the revised version.
TECHNICAL CORRECTIONS
We thank the reviewer for the thorough check of our manuscript and for providing these technical corrections which we will implement all in the revised version of the manuscript.
Row 52: extra “)” after “Drought2018”
Row 77: replace ”.” with “?”
Row 87 and maybe elsewhere: you use both ONEFLUX and ONEFlux. Use only one of these two, check Pastorello et al. (2020).
Row 95: replace “FLUXNET” with “FLUXNET2015”
Row 102: replace “table” with “Table”.
Row 115: should “2.2.2” be replaced with “2.2.1”?
Row 232: replace “Turing” with “Turning”
Row 592 (Figure 14): Colorbar label is incomplete in the right plot
Citation: https://doi.org/10.5194/bg-2023-110-AC2
-
AC2: 'Reply on RC2', Martin Jung, 25 Oct 2023