A robust data cleaning procedure for eddy covariance flux measurements

Integration of long-term eddy covariance (EC) flux datasets over regional and global scales requires high degree of comparability of flux data measured at different stations, which entails not only similar-performing instrumentation and their appropriate deployment, but also standardized and reproducible data processing and quality control (QC) procedures. This work focuses on the latter topic and, in particular, on the development of a robust data cleaning procedure. The proposed strategy includes a set of tests aimed at detecting the presence of specific sources of systematic error in the data, as well as 5 an outlier detection procedure aimed at identifying aberrant flux values. Results from tests and outlier detection are integrated in such a way as to leave a large degree of flexibility in the choice of tests and of test threshold values without losing in efficacy and, at the same time, to avoid the use of subjective criteria in the decision rule that specifies whether to retain or reject flux data of dubious quality. Tests development was rooted on advanced time series analysis techniques that consider the stochastic properties of both raw, high-frequency EC data and of flux time series, such as complex dynamics, high persistence 10 and possible presence of stochastic trends. The performance of each proposed test is evaluated by means of Monte Carlo simulations on synthetic datasets, whereas their impact on observed times series was evaluated on a selection of EC datasets distributed by the ICOS research infrastructure. Simulation results evidenced that the proposed tests have a better performance compared to alternative existing QC routines, showing lower false positive and false negative error rates. The application of the proposed tests on real datasets led to an effective cleaning of EC flux data retaining the maximum number of good quality data. 15 Although there is still room for improvement, in particular with the development of new QC tests, we think that the proposed data cleaning procedure can serve as a basis towards a unified QC strategy for EC datasets which i) includes only completely data-driven routines and is therefore suitable for automatic and centralized data processing pipelines, ii) guarantees results reproducibility and iii) is flexible and scalable to accommodate new and additional tests that makes the approach also suitable for other greenhouse gases. 20 1 https://doi.org/10.5194/bg-2019-270 Preprint. Discussion started: 25 July 2019 c © Author(s) 2019. CC BY 4.0 License.


Introduction
In the last decades, the number of Eddy Covariance (EC) stations for measuring biosphere-atmosphere exchanges of energy and greenhouse gases (mainly CO 2 and H 2 O, followed by CH 4 and N 2 O) increased worldwide, contributing to expand regional (e.g. ICOS, AmeriFlux, NEON, TERN) and global (e.g. FLUXNET) monitoring networks. Integration of long-term flux datasets over regional and global scales enables the evaluation of climate-ecosystem feedbacks and the study of the com-25 plex interactions between terrestrial ecosystems and the atmosphere. The use of the EC technique involves a set of complex choices. Selection of the measurement site and of the instrumentation, design of the data acquisition strategy, deployment and maintenance of the EC system as well as design of the data processing pathway are only some examples of such choices.
Over time, the EC community has developed guidelines and best practices aimed at "standardizing" the methodology, with the overarching goal of increasing comparability and integrability of flux datasets across different stations, thereby improving 30 robustness and accurateness of resulting synthesis, analysis and models. Examples of efforts in this direction can be found in the EC handbooks by Lee et al. (2005) and by Aubinet et al. (2012), in publications describing standardized EC systems for entire networks (e.g. Franz et al., 2018) and in software intercomparisons aimed at reconciling the complex EC processing chain and explain/quantify any discrepancy (e.g. Mauder et al., 2006;Fratini and Mauder, 2014;Metzger et al., 2017). As part of this effort, large regional networks have recently invested significant resources in the definition of protocols and measurement 35 methods (e.g. Sabbatini and Papale, 2017;Rebmann et al., 2018;Nicolini et al., 2018) and in the development of centralized data processing pipelines (e.g. Sabbatini et al., 2018).
Integral part of the EC method is the definition of Quality Assurance (QA) and Quality Control (QC) procedures. Quality assurance refers to the set of measures aimed at preventing errors and therefore concerns design of the experimental setup, selection of the site, choice of instrumentation and its deployment, maintenance scheduling. Quality control, instead, refers to 40 the ensemble of procedures aimed at: (1) identifying and eliminating errors in resulting datasets (i.e. data cleaning) and (2) characterizing the uncertainty associated to flux measurements. This paper is concerned with the definition of QC procedures for EC datasets, in particular for error identification and data cleaning.
In the context of EC, a thorough QC scheme should aim at detecting errors caused by instrumental issues as well as by violations of the assumptions underlying the method (see Foken et al., 2012b;Richardson et al., 2012), which inevitably 45 occur during the measuring campaign. Surface heterogeneities and anisotropies, occurrence of poorly developed turbulence conditions, advection fluxes as well as instrumental malfunctions (leading to e.g. spikes or discontinuities in the raw data), miscalibrations or operation below the detection limit of the instruments are common sources of systematic errors introducing biases in resulting fluxes. Systematic errors can also occur during the data processing stage, due to inappropriate choices of the processing options, poor parameterization of corrections (e.g. a poor estimation of spectral attenuations in an EC system) or 50 to edge cases that lead algorithms to diverge (e.g. when the denominator of a ratio tends to zero). An effective QC procedure should in principle be able to identify all such occurrences and hence allow the elimination of the corresponding flux values.
It is worth noting that EC measurements, as any measurement process, are also subject to a number of sources of unavoidable random errors causing noise in flux data, due for example to sampling a 3D stochastic process at a single point, or due to where e t is an error term that can be further specified as: where β t represents the total systematic error given by the sum of all individual systematic error components, β i,t , biasing the variable, and ε t is the random measurement error. The standard deviation characterizing the distribution of random errors, σ ε , 70 provides a measure of the random uncertainty.
With this formalization in mind, QA procedures are aimed at preventing all sources of systematic errors (β i → 0 ∀i) and at making sure σ ε is as small as possible, while QC procedures are aimed at detecting and rejecting any y obs t data points, for which the effect of any systematic error on flux estimates is not negligible (i.e. β t = 0).
To avoid confusion, it is worth stressing the difference between random and systematic errors affecting EC flux measure-75 ments. By definition, systematic errors in an individual flux measurement are those due to sources that are avoidable, at least in principle. As an example, avoidable instrumental errors can lead to spikes in the raw, high-frequency time-series. Such spikes, regardless of their distribution and amplitude, lead to a biased flux and are therefore to be regarded as systematic errors. On the contrary, random errors are by definition unavoidable and undetectable. It follows from those definitions that long-term biases in flux time-series (say, a systematic underestimation) can only be caused by systematic errors. However, systematic errors do 80 not necessarily lead to long-term biases, because they can vary with time, both in absolute value and relative to the true flux values. Random errors, instead, never lead to long-term biases (Moncrieff et al., 1996;Richardson et al., 2012).
QC procedures developed to identify EC fluxes affected by significant errors can be broadly classified in partially and completely data-driven (or automatic) approaches. The former entail at least some degree of subjective evaluation in the decision process. They rely on the ability of the analyst to make a final call on whether a data point should be retained or rejected, 85 allowing the researcher to exploit the accumulated knowledge about the site and the dataset, in order to discern what is physically or ecologically implausible. Such call is usually made on the basis of some preliminary error detection algorithm and, in practice, is typically performed via visual inspection. As an example, Vickers and Mahrt (1997) proposed a suite of tests to detect problems with high-frequency raw data. There, each test results in a label for the time-series, such as retain or potentially reject, and the investigator is required to make the final call on the latter. The drawbacks of such procedures is that subjective 90 evaluation unavoidably introduces individual biases, which weaken the robustness and the objectivity of results. In addition, the fact that subjective evaluation is usually performed via visual inspection strongly affects traceability of the data processing history, severely hindering reproducibility of resulting datasets. Completely data-driven procedures, on the contrary, sacrify the benefit of first-hand knowledge of the dataset to gain high levels of reproducibility and objectivity, by means of fully automated QC algorithms and decision trees. An important advantage of such procedures is that they strongly reduce the possibility of 95 introducing selective subjective biases when cleaning datasets across multiple stations, thereby contributing to the standardization we referred to at the beginning. Moreover, fully automated algorithms are preferable when the processing involves the ingestion of massive data and the use of visual inspection becomes prohibitive because of extremely time consuming. However, the construction of a completely data-driven procedure that accounts for and exploits all available knowledge of the system is complex and requires care in testing development to prevent high false positive rates.

100
Usually, a QC procedure is comprised of multiple routines, each of which evaluates data with respect to a specific source of systematic error. In the case of EC, we therefore have routines to identify instrumental issues, severe violations of the method assumptions or issues with the data processing pipeline. Obviously, there may be several routines for each category, e.g. routines to look at specific types of instrumental issues (e.g. spikes, reached detection limit, implausible discontinuities in the timeseries, etc.). Once a set of QC tests has been selected, results of these tests must be combined, in order to derive an actionable 105 label to reject or retain individual flux data. One of the most popular QC classification schemes was proposed by Mauder and Foken (2004) and was more recently adopted, with some modifications, by Mauder et al. (2013). This classification scheme establishes a data quality flag on the basis of a combination of results from two tests, proposed by Foken and Wichura (1996), referred to as Instationarity and Integral Turbulence Characteristics tests. Results of the two tests are combined based on a predefined policy to derive the final classification, which assigns the overall flag 0 to high-quality data, 1 to intermediate quality 110 data, and 2 to low-quality data. Recommendations are that: (a) low-quality data should be rejected; (b) the use of intermediate quality data should be limited to specific applications, e.g. long-term budget calculation, while it should be avoided in more fundamental process studies; and (c) high-quality data can be used for any purposes. Other quality classification schemes of this sort have been proposed and used (e.g. Kruijt et al., 2004;Göckede et al., 2008;Thomas et al., 2009).
Two aspects of this approach to QC classification are of concern. Firstly, the combination of the results of individual tests 115 into an overall flag appears to be somewhat arbitrary and no directives are provided -nor they are easily imagined -as to how to extend the combination to integrate results of additional or alternative tests. More fundamentally, with the methods above the final aim of the process is to attach a quality statement to the data via a flag, as opposed to identifying data points affected by severe errors, therefore confounding the two processes -that we deem distinct -of cleaning the dataset and of characterizing its quality. We suggest, instead, that a data cleaning procedure for EC datasets should exclusively aim at identifying flux 120 values affected by errors large enough to warrant their rejection. It should, therefore, lead to a binary classification, such as retain or reject. As for assigning a quality level to the retained data, we propose that the random uncertainty is the appropriate metric. In fact, as a general principle, the larger the random uncertainty, the larger the amount of measurement error and, consequently, the lower the quality of the data (for an in depth discussion of the random uncertainty in EC, see e.g. Richardson et al. (2012), and references therein). The link between data quality and random uncertainty has already been investigated by 125 Mauder et al. (2013). Using the turbulence sampling error proposed by Finkelstein and Sims (2001) as a measure of the flux random uncertainty, they found that highest-quality data are typically associated with relative random uncertainties of less than 20%, whereas intermediate quality data are typically associated with random uncertainties between 15% and 50%. However, explicitly avoiding the binning into intermediate quality and high-quality data allows us to avoid uncertain recommendations as to what application a given intermediate quality datum should be used for. Rather, we recommend taking into account the 130 random uncertainty throughout the data analysis and synthesis pipeline and let the application itself (e.g. data assimilation into modelling frameworks) reveal whether the level of uncertainty of each data point in the dataset matters or not.
Bearing this in mind, the aim of this paper is thus to present a robust data cleaning procedure which i) includes only completely data-driven routines and is therefore suitable for automatic and centralized data processing pipelines; ii) guarantees results reproducibility; iii) is flexible and scalable to accommodate addition or removal of routines from the proposed set; iv) 135 results in a binary label such as retain or reject for each data point, decoupling the data cleaning procedure from its quality evaluation.

The proposed data cleaning strategy
Since quantifying bias affecting half-hourly EC fluxes is not possible in the absence of reference values, the only option is to 140 ascertain the occurrence of specific sources of systematic error which, in turn, are assumed to introduce biases in flux estimates. Therefore, the proposed data cleaning procedure includes i) a set of test aimed at detecting the presence of a specific source of systematic error and ii) an outlier detection procedure aimed at identifying aberrant flux values. As described below, results from tests and outlier detection are integrated in such a way as to leave a large degree of flexibility in the choice of tests and of test threshold values without losing in efficacy while striving to avoid the use of subjective criteria in the decision rule that 145 specifies whether to ultimately retain or reject flux data of dubious quality.
For each test, two threshold values are defined, designed to minimize false positive and false negative error rates. Comparing the test statistic with such thresholds, each test returns one of 3 possible statements: • SevEr: if the test provides strong evidence about the presence of a specific source of systematic error.  The following subsections detail the set of tests used in the proposed data-cleaning procedure and summarized in Table 1.
Depending on their specification, test are applied to either individual high-frequency raw time-series, or to pairs of variables, or to their statistics over the flux averaging interval, or even to derived variables (see Secs. 2.2 for the details). For example, 160 instrumental problems are typically searched for on a per-variable basis, while stationarity conditions are assessed by looking at the covariance between two variables. The test result is then inherited by all fluxes, to which the variable or variable pair is relevant. As an example, sensible heat flux H inherits the test results assigned to (at least) vertical wind speed w, sonic temperature T s and the pair w/T s .
The tests proposed and detailed in the following subsections have been designed and optimized for CO 2 /H 2 O and energy 165 EC fluxes over natural ecosystems based on current knowledge. However in different conditions (gas species, ecosystem, instrumentation etc.), new tests and further optimization of the ones proposed here are possibly needed. As mentioned above, the data cleaning procedure is designed to naturally accommodate any number of tests that generate a statistic that can be compared against threshold values.
Test results are used as inputs to the data cleaning procedure, which includes two stages (see Fig. 1). In the first stage, fluxes 170 that inherited at least one SevEr statement are rejected, while fluxes that inherited no SevEr statements and any number of ModEr statements are retained. Importantly, the statements resulting from different tests (or from the same test as applied to different variables) are not combined in any way. The rationale is that a single SevEr statement is sufficient to establish the rejectability of a data point. Conversely, no matter how many ModEr statements are inherited by a flux data, those statements do not accrue to the status of SevEr and hence they do not lead per se to the rejection of the measurement. This is because of 175 two reasons: i. test independence cannot in general be guaranteed, i.e. the same source of error can affect the statistics of more than one test; and ii. nothing can be told about how different systematic errors combine, e.g. whether they tend to cumulate or cancel out. This policy has the convenient side effect of making the application and interpretation of tests completely independent from each other, which makes the overall procedure extremely flexible and scalable to accommodate new tests beyond what we propose later in this paper (e.g. tests for less investigated gas species): all that is required is for the new test to 180 return statements such as SevEr, ModEr (if applicable) and NoEr.
In the second stage, flux data that inherited no SevEr statement are subject to an outlier detection procedure and only flux data that are both detected as outlier and inherited at least a ModEr statement are conclusively rejected. This implies that data points that inherited any number of ModEr statements but were not detected to be outliers, as well as outliers which showed no evidence of systematic errors, are retained in the dataset and can be used for any analysis or modeling purposes. In other 185 words, only data points that provide strong evidence of error (either because of a SevEr, of because of a ModEr and being identified as an outlier) are rejected, while peculiar data points, that would look suspicious to the visual inspection and are possibly identified as outliers, but only inherited NoEr statements, are retained.

Detection of sources of systematic error affecting EC datasets
The possible sources of systematic errors are divided and analysed into three categories: 1) instrumental issues, 2) poorly 190 developed turbulence regimes and 3) non-stationary conditions.

Detection of instrumental issues
Modern EC instruments can detect and report malfunctions occurring during the measurement process via diagnostic variables.
However, there are situations where a measurement is affected by an error but it is still valid from the instrument's perspective, and for this reason it is not flagged by the instrument diagnostics. As an example, a physical quantity may have variations that 195 are too small to detect, given the settings or the specifications of the instrument. This is the case of a time-series of temperature that varies very little during a calm, stable night; the measurements may be affected by a low-resolution problem, where the quantization of the measurement due to the intrinsic resolution of the instrument becomes evident and leads to a reduced variability of the underlying signal. In this case from the measurement perspective there is nothing wrong with the measured quantity and diagnostics would not indicate any issues. In addition, with some instrumentation, especially older one, or often (e.g. sudden shift in mean, Homogeneity test of differenced data:

4) Poorly developed 4) Integral Turbulence Characteristics (%)
≤ 30 ≤ 100 > 100 turbulence regimes by Foken and Wichura (1996) 5) Non-stationary conditions 5) Non-stationary ratio by Mahrt (1998 when collecting data in analog format, diagnostic information is simply not available . It is therefore useful to devise tests to detect instrument-related situations that are likely to generate systematic errors in resulting fluxes. In the following, we describe a series of routines to identify the most frequent errors caused by instrumental problems detectable on the raw, high-frequency time-series (most of which were already discussed in Vickers and Mahrt, 1997, hereafter VM97).
Detection of unreliable records caused by system malfunction and disturbances 205 EC fluxes are calculated starting from the covariance between the vertical wind speed, w, and the scalar of interest computed over a specified averaging interval, typically 30 or 60 minutes. For the calculated covariance to be representative of the entire interval, the actual number of available raw data records should be close enough to the expected number (e.g. 18000 records for fluxes over 30 minutes from raw data sampled at 10 Hz). If the number of available records is too small, the corresponding flux estimate may be significantly biased.

210
Data records can be unavailable for covariance computation for a variety of reasons. First, records may simply be missing because of problems during data acquisition. In addition, specific values may be eliminated if: Instrumental diagnostics flag a problem with the measurement system.
Individual high-frequency data points are outside their physically plausible range or are identified as spikes (in this work we used the despiking algorithm proposed by VM97).

215
Data were recorded during periods when the wind was blowing from directions known to significantly affect the turbulent flow reaching the sonic anemometer sampling volume, e.g. due to the interference of the anemometer structure itself (Cclamp models) or to the presence of other obstacles.
The angle-of-attack of individual wind measurements is beyond the calibration range specified by the sonic anemometer manufacturer.

220
Note that the criteria above may apply to single variables, groups of variables (e.g. anemometric variables) or to entire records.
Although covariances can be computed also on time series with gaps, some of the procedures involved in the typical data processing do require continuous time series (e.g. the Fast Fourier Transform to compute power spectra and cospectra, or the stationarity tests that compute statistics on sub-periods as short as 50 seconds, Mahrt, 1998). The performance of such procedures may depend on the technique used to impute missing data (i.e. fill the gaps). It is therefore useful to establish 225 criteria for the appropriate use of imputation procedures.
Typically, gaps in raw time series are filled using linear interpolation. While this algorithm provides obvious computational and implementation advantages, it should be noted that it only performs satisfactorily when the time series is dominated by lowfrequency components, while it can introduce biases in time series characterized by significant high-frequency components, as it is the case with EC data. Its application should thus be limited to very short gaps. The pattern of missing data plays a role too: 230 when gaps occur simultaneously across all variables, linear interpolation can lead to biases in resulting covariances even for short gaps. Such biases are linearly proportional to the amount of missing data, and relatively larger for smaller fluxes. These considerations apply also to other interpolation methods such as splines and Last Observation Carried Forward.
With this in mind, indicating with FMR and LGD the fraction of missing records and the longest gap duration in time series involved in the covariance estimation, respectively, we suggest the following classification criteria: • ModEr: if 5% < FMR ≤ 15% or 90 sec < LGD ≤ 180 sec.

Detection of low signal resolution problems
High frequency EC data can be affected by low signal resolution problems (see VM97 for a detailed description). Resolution To disentangle these two situations, we evaluate the discrepancy between the w-scalar cross-correlation functions (CCF) of the used to assign a statement are as follows: Detection of aberrant structural changes in time series dynamics 260 EC time series can be subject to regime changes such as sudden shifts in the mean and/or in the variance. In some circumstances, those are imputable to natural causes as in cases of intermittent turbulence (Sandborn, 1959) and wind pattern change over a heterogeneous or anisotropic footprint. In other cases, such changes are artefacts generated by instrumental malfunctions. Aberrant structural changes in mean and variance due to either environmental causes or measurement artefacts have similar effects on time series dynamics (which makes them difficult to disentangle) and lead to violation of the assumption 265 of stationarity underlying the EC method. They should therefore be treated as sources of systematic errors. Here we propose three new tests to identify such situations, whose rejection rules is based on objective criteria but which, notably, do not discern natural changes from measurements artefacts.
The first test takes into consideration the homogeneity of the distribution of fluctuations (y t = y t −ȳ, whereȳ is the mean level estimated according to the averaging rule adopted in the covariance calculation, e.g. block average or linear detrending) 270 and consists in evaluating the percentage of data in the tails with respect to the bounds imposed by Chebyshev's inequality theorem. Irrespective of the PDF generating the data, Chebyshev's inequality provides an upper bound to the probability that absolute deviations of a random variable from its mean will exceed a given threshold value. As an example, it imposes that a minimum of 75% of values must lie within two standard deviations of the mean, and 89% within three standard deviations.
When the inequality is violated, it is a symptom of data inhomogeneity due to structural changes in time series dynamics (e.g. 275 abrupt change in the mean level, sudden upward or downward trend which may introduce significantly bias in the estimation of the mean values and, consequently, of the covariances). In the following we indicate the test statistic related to the homogeneity of the distribution of fluctuations as HF B , where the subscript B indicates the sigma-rule adopted to define the boundary region (e.g. ±5σ, see later).
The second test makes use of the same rule, but evaluates the homogeneity of the distribution of first-order differenced data, 280 ∆y t = y t − y t−1 (the corresponding statistic is denoted as HD B ). Besides highlighting other useful properties, differencing a variable eliminates any trends present in it, whether deterministic (e.g. linear) or stochastic (Box et al., 2015) and the resulting time series is always stationary. Differencing acts as a signal filtering procedure and the transformed data can better highlight the characteristics of the measurement error process whose variance, under second-order stationary conditions, should be constant over time. Therefore, when the inequality is violated, it is a symptom of data inhomogeneity mainly due to changes 285 in variance (heteroskedasticity). Based on the upper bounds imposed by Chebyshev's inequality for 5σ and 10σ, the first two tests are summarized in the following rules: • SevEr: if HF 5 > 4% or HD 5 > 4% or HF 10 > 1% or HD 10 > 1% (i.e., more than 4% of fluctuations or differenced values are beyond the ±5σ limits or more than 1% of them are beyond the ±10σ limits).
As a robust estimate of σ, we used the Rousseeuw and Croux (1993) Q n estimator corresponding approximately to the first quartile of the sorted pairwise absolute differences. Compared to other robust scale estimators -such as the median absolute 295 deviation about the median (MAD) and the interquartile distance (IQD) -it is a location-free estimator, i.e. it does not implicitly rely on a symmetric noise distribution. Similar to MAD, its breakdown point is 50%, i.e. it becomes biased when 50% or more of the data are large outliers, however, its efficiency is larger, especially when identical measurement occur, e.g. due to low signal resolution problems.
The third test consists in evaluating the kurtosis index on the differenced data ∆y t (hereafter denoted KID test). The kurtosis 300 index is defined by the standardized fourth moment of the variable about the mean. Because variations about the mean are raised to the power of 4, the kurtosis index is sensitive to tail values of the distribution and can therefore be used to characterize them (Westfall, 2014). Since the tails of a distribution represent events outside the "normal" range, a higher kurtosis means that more of the variance is contributed by infrequent extreme observations (i.e. anomalies) as opposed to frequent, modestly sized deviations. The sensitivity to tail values and the application to differentiated values make the KID a very useful tool to detect 305 a range of instrumental problems, which include abnormal changes in mean, variance, but also the presence of undetected residual spikes and drop-out events (in these situations differenced time series will show a spike in correspondence of each change-point).
To eliminate the sensitivity of the KID to the presence of repeated values (which become zeros in the differenced variable), such values were not included in KID estimation. Bearing in mind that the kurtosis index for a Gaussian and a Laplace PDF 310 is equal to 3 and 6, respectively, we choose reasonably large threshold values to make sure we select only time series characterized by heavy-tailed distribution as is the case of data contaminated by extreme events representative of the aforementioned problems. Namely, we suggest that the following criteria be applied: • SevEr: if KID > 50.

Detection of poorly developed turbulence regimes
One of the assumptions underlying the EC method is the occurrence of well-developed turbulence conditions. A widely used test to assess the quality of the turbulence regime is the Integral Turbulence Characteristics (ITC) test proposed by Foken and Wichura (1996, hereafter FW96). Based on the flux-variance similarity theory, the test consists in quantifying the relative 320 change of the ratio between the standard deviation of a turbulent parameter and its corresponding turbulent flux (the "integral characteristic") with respect to the same ratio estimated by a suitable parameterization. The ITC can be calculated for each wind component and any scalar (temperature and gas concentration), but for the purpose of data cleaning, the ITC for vertical wind speed is used (see Foken et al., 2012b). This is defined as: The criteria adopted to assign SevEr, ModEr and NoEr statements are based on threshold values proposed by Mauder and Foken (2004, Table 16). In particular:

Detection of non-stationary conditions
The working equation of turbulent fluxes as the (appropriately scaled) w-scalar covariance is based on a simplification of the To avoid biases, a possible approach is to preliminarily remove the source of non-stationarity before calculating covariances.
This way, the amount of data rejected for non-stationary conditions can be limited. To this end, procedures based on linear detrending or running mean filtering are often used during the raw data processing stage. However, their application can be 345 ineffective, for example when linear detrending is used on nonlinear trends (see the Supplementary Material (SM), Appendix B) and even expose to the risk of introducing further biases when data are truly stationary or when nonstationarity is of more complex nature (see Rannik and Vesala, 1999;Donateo et al., 2017, for a comparison of detrending methods). For this reason, an alternative approach is to remove periods identified as non-stationary.
A few tests have been proposed for EC data, of which we discuss two popular ones. A widely used statistics is based on a A major problem with Eq. 4 is that when the covariance at the denominator is close to zero, the test statistic will approach infinity making the ratio unstable. As a consequence the FW96 test is disproportionately sensitive to fluxes of small magnitude (see also Pilegaard et al., 2011).

360
An alternative test was proposed by Mahrt (1998, hereafter M98). This test measures nonstationarity by comparing the effects of coherent behavior with the inherent random variability in the turbulent signals. As with FW96, the time series is divided into I non-overlapping intervals of 5 minutes; in this case however each interval is also divided into J = 6 non-overlapping sub-intervals of 50 seconds. The test statistic is defined as: where σ B is a measure of the variability of the covariance between different intervals calculated as: and σ W is a measure of the average variability within each interval i and is given by: Respect to the FW96 test, the non-stationarity ratio by M98 is always well-behaved having the denominator strictly positive (average of standard deviations). For this reason, the test proposed by M98 was selected for this work. Based on a performance 375 evaluation described later in this paper (see Sect. 3), we suggest the following flagging criteria: • SevEr: if S M98 > 3.
2.3 Outlier detection for half-hourly EC flux time series is misspecified and the residual is due to a wrong estimate forŷ true t ; ii) the residual is due to the presence of a non-negligible systematic error (i.e. |β t | > 0 in Eq. 2); or iii) the residual is indeed a tail instance on the PDF of e t . Condition i) can be assessed by examining the degree of serial correlation in the residual time series. When a model is correctly specified, residuals should not show any serial correlation structure and resemble a white noise process. Distinguishing between conditions ii) and iii) is not trivial, and would require an in depth analysis of the causes generating the anomalies. We propose to consider 390 the presence of at least a ModEr statement as a symptom of condition ii) and thus to reject flux data identified as outliers and flagged with at least a ModEr statement. Otherwise, if all tests return NoEr statements, the outlying data is retained irrespective of the magnitude of the anomaly. Details about the modeling framework and how to take into account the heteroskedastic behavior of the residual component are provided in the following.

Signal extraction 395
For the purpose of signal extraction, we considered the following multiplicative model: where D t represents the diurnal cycle, LRT t represents the long-run (e.g. annual) trend, SRT t is the short-run (e.g. the intraday) temporal dynamics term and I t is the irregular or residual component, considered as an estimateê t of e t in Eq. (1). The the long-run trend) of the time-series (see Hyndman and Athanasopoulos, 2018, Ch. 6). Such a dynamic is typical of those ecosystems characterized by diurnal cycles of fluxes which are more pronounced during growing than during dormant seasons.
The estimation of each component in Eq. (8) can be performed in a variety of ways. In this work we used the STL (Seasonal-Trend decomposition based on LOESS -locally estimated scatterplot smoothing) algorithm developed by Cleveland et al. (1990) and implemented in the stlplus R software package (Hafen, 2010(Hafen, , 2019. We choose the STL algorithm because of its  (Lomb, 1976;Scargle, 1982). This method is particularly suited to detect periodic components in time series that are unequally sampled as a consequence of the presence of missing data at this stage of the data cleaning procedure.

Outlying flux data
Previous studies focusing on flux random uncertainty quantification have evidenced an heteroskedastic behavior of the random 425 error (see Richardson et al., 2012, and references therein). Furthermore, it has been demonstrated that the random error distribution can be better approximated by a Laplace than by a Gaussian distribution (Vitale et al., 2019). To take into account these features, residuals are first grouped into 10 clusters of equal size (defined by percentiles) based on the estimated flux values.
Then, for each of the 10 clusters, the outlier detection is performed assuming a Laplace distribution and taking into account the (1 − α)% confidence interval with α = 0.01 given by µ ± log(α)σ, where µ (location) is estimated by the median and the 430 scale parameter, σ, is estimated through the Q n estimator by Rousseeuw and Croux (1993) for the same reasons introduced in Section 2.2.1.
It is important to note here that a significant limitation of the proposed outlier detection method is the inability to detect systematic errors whose sources act constantly across a flux time series, for long periods of time. As an example, a miscalibration of the instruments that persists for days, would most likely not be identified as outlying fluxes by the proposed outlier 435 detection procedure. Such sources of systematic errors should be prevented via appropriate QA actions or, at least, specific QC tests should be devised, able to mark those time-series with SevEr statements.

Monte Carlo simulations
This work has made extensive use of Monte Carlo experiments that make use of simulate time series mimicking raw EC datasets. The main purpose of the simulations is to create pairs of reference time series with known covariance structure that, 440 after being contaminated with a specific source of systematic error, allow a quantitative and objective evaluation of (i) the bias effect the source of systematic errors have and (ii) the ability of proposed tests to correctly detect them. We note that, to these purposes, it is not strictly required to simulate medium-or long-term time series with realistic joint probability distributions from which to generate half-hourly fluxes with typical diurnal and seasonal cycles. In fact, it is reasonable to assume that QC series were then contaminated with specific sources of systematic error, details of which will be provided in Section 3.

Eddy-covariance datasets and flux calculations
In this study, data from ten EC sites part of the ICOS network (www.icos-ri.eu) were used: BE-Lon (Lonzee, Belgium, cropland; Moureaux et al., 2006) The turbulent fluxes of CO 2 (FC, µmol CO 2 m −2 s −1 ), sensible heat (H, Wm −2 ) and latent heat (LE, Wm −2 ) were calculated from the covariances between the respective scalar and the vertical wind speed (w, ms −1 ) following the standard EC calculation method (see, e.g., Sabbatini et al., 2018). The Net Ecosystem Exchange of CO 2 (NEE, µmol CO 2 m −2 s −1 ) was estimated by integrating FC with the concurrent storage flux (SC, µmol CO 2 m −2 s −1 ). The EddyPro ® software (LI-COR Biosciences, 2019) 470 was used to this aim, employing, the double coordinate rotations for tilt correction, 30 min block averaging, the maximum cross-covariance method for time lag determination and the in-situ spectral corrections proposed by Fratini et al. (2012).

Results and discussion
3.1 Performance evaluation of QC tests

475
A simulation study was carried out with the twofold aim of quantifying the bias caused by low signal resolution problems and evaluating the performance of the LSR test described in Section 2.2.1. To this end, error-free AR processes of length n = 18000 were first simulated and subsequently contaminated with artificially generated errors. The correlation between time series has been set to 0.1, simulating medium-to-low fluxes, while the autoregressive coefficient (φ) was allowed to take on a value in the set [0.90, 0.95, 0.99] to represent different degrees of serial dependence (i.e. autocorrelation) as observed in EC 480 data (SM-Appendix A). Low resolution problems were simulated i) by rounding the simulated values from 2 to 0 digits, ii) by sampling at random an increasing percentage of data (15,30,45,60, and 75% of the sample size) and then replacing it via the Last Observation Carried Forward (LOCF) imputation technique. This allows to create the typical ramp structure commonly encountered in raw, high-frequency time series. Each of the 6 scenarios (3 values of φ × 2 types of error, abbreviated S R i=1,...,6 ) was ran 199 times to obtain robust statistics. Three realizations of contaminated AR processes are depicted in Figure 2.

485
Bias affecting correlation estimates was quantified as the difference in absolute percentage between the correlation estimated on error-free (ρ EF ) and on contaminated (ρ C ) data. Results for each scenario are shown in Figure 3. When low resolution problems were simulated by rounding values (S R i=1,2,3 ), the bias was less than 2%. Similar results were reported by VM97 and more recently by Foken et al. (2019). On the contrary, when low resolution problems were simulated by replacing values via LOCF, a significant bias was introduced in correlation estimates, whose amount depends not only on the number of contaminated data, 490 but also on the degree of serial dependence. As expected, the percentage of data affected by error being equal, the lower the degree of serial correlation, the higher the amount of bias. In S R 4 , when the AR processes were simulated with φ = 0.90, a bias greater than 5% was observed as soon as more than 30% of data were contaminated by error. On the contrary, when φ = 0.99 as in S R 6 , the bias was less than 5%, even when 75% of data were contaminated by error. This is because when a series has strong autocorrelation, following values are expected to be close to current value and therefore, replacing them with the current 495 value does not change the time dynamic significantly and as a consequence the correlation estimate remains unbiased. The ability of the LSR test to disentangle among these situations can be appreciated by looking at the distribution of the R 2 values in the 6 scenarios as shown in Figure 3 (middle panels). To aid in comparison we also added the results of the amplitude resolution test by VM97 (hereafter A-R, right panels of Figure 3). Considering threshold values at 0.99 and 0.995 to identify SevEr and ModEr statements, the LSR test showed a good performance with much lower false positive and false negative rates 500 lower than A-R test. For the latter, low-resolution problems were identified only for scenarios S R i=1,2,3 and limited to cases when the number of digits is reduced to zero. The amount of bias in such cases was however negligible. On the contrary, the sensitivity of the LSR test was such that the higher the bias affecting correlation estimates the lower the R 2 values, irrespective of the low-resolution simulation method.
We applied both LSR and A-R testing procedures to the selected field datasets described in Section 2.5 (results are reported 505 in SM-Appendix D). For LE and NEE, similar amounts of data were flagged by the two tests. On the contrary, for sensible heat flux the LSR test flagged much less data than the A-R test. In particular, for 5 sites the percentage of H data hard-flagged by the A-R test exceeded 40% with a maximum of 68.3% at SE-Htm. The hard-flag was often assigned by the A-R test to the sonic temperature time series during periods of low variability, which led to the typical step ladder in the data. However, in most of these cases, that does not necessarily imply biased covariance estimates.

510
To aid in comparison, some illustrative examples are shown in Figure 4. Panel a) shows a sonic temperature time series for which both tests provide negligible evidences of error. Notice that in this case, the R 2 is close to unity although 19.8% of data was constituted by repeated values. The case shown in panel b) is representative of contrasting results: the A-R test assigned a hard-flag, while the LSR test returned a NoEr statement. By visual inspection, it would seem that despite the step-ladder appearance in the data, the time series dynamic is mostly preserved. In these occurrences, bias affecting covariance estimates   : replacing 15% of the data with the original values multiplied by 5, to simulate changes in variance.
Although these scenarios only cover a fraction of the problems encountered in real observations, the experiment aims at evaluating the test sensitivity in presence of aberrant structural changes which, in most cases, can only be imputable to the malfunctioning of the measurement system. Note that scenarios S SC i=1,2 do not aim at simulating time series affected by structural changes. Rather, their purpose is to evaluate the propensity of the tests to false positive errors.

535
Illustrative examples of simulated time series are shown in Figure 5. Each scenario was ran 199 times. For each simulation, the statistics of the HF 5 , HD 5 and KID tests were calculated. As a reference, we also applied the "discontinuity" and "kurtosis index" tests proposed by VM97 (while their "dropouts" and "skewness" tests resulted insensitive to the simulated scenarios and will therefore not be further discussed). Results are summarized in Figure 6.
We observe that all tests exhibited a low false positive error rate (cfr. scenarios S SC 1 and S SC 2 ). The only exception was the 540 VM97 test based on the Haar transform to detect discontinuities in the mean level, which instead showed a low performance when time series was contaminated by a stochastic trend component. The sudden shift in mean level simulated in S SC 3 was correctly identified by the KID test in most simulations. The Haar transform for detecting discontinuity in the mean level identified the simulated structural changes, but only in less than 50% of cases the hard-flag was assigned. Good performances were correctly detected by only KID and HD 5 tests, respectively.
The application of the testing procedures on actual EC time series have shown a higher sensitivity of the VM97 tests, compared to the newly proposed tests, with a tendency to assign hard-flags even in cases no evidence of instrumental error was supported by visual inspection. On the contrary, the proposed tests resulted more selective at identifying data affected by structural changes, although in some case such structural changes were not necessarily imputable to instrument malfunction.

550
In most of these occurrences, however, structural changes are indicative of non-stationary conditions, which as we know is another source of systematic error that introduces bias in covariance estimates (see Section 3.1.3). This is an example where two tests are not fully independent and could identify the same issue, reason why the ModEr statements are not combined.
Illustrative examples of application of the testing procedures on raw data are shown in Figure 7.  S S 8 : as in S S 2 , where both x t and y t were contaminated by stochastic trend components.
Notice that S S i=1,..,4 simulate fluxes measured under stationary conditions (by definition, since |φ x , φ y | < 1), while the remaining S S i=5,..,8 simulate fluxes measured under non-stationary conditions. To mimic trend dynamics of magnitude similar 570 to those observed in real cases, the slope of the deterministic linear trend function f (t) = β · t, was fixed equal to 0.0004 and 0.004 for x t and y t , respectively, whereas the stochastic trend components were generated as n t=1 ε t , where ε is drawn from a Normal distribution with mean 0 and standard deviation equal to 0.025 and 0.25 for x t and y t , respectively. Exemplary realization of the simulated AR processes and their CCFs estimated on either single run and averaged over multiple runs (i.e. ensemble CCF) are illustrated in Figure 8.

575
Results of applying the FW96 and M98 stationarity tests to 100 simulation runs for each of the 8 scenarios considered are shown in Figure 9. Test performances were evaluated through a statistical sensitivity analysis given by the percentage of correctly identified cases. The threshold values used to assign ModEr and SevEr statements were 30% and 100% for the FW96, and 2 and 3 for the M98, respectively.
In scenarios S S 1 and S S 2 , where simulated time series are stationary and characterized by a lower degree of serial correlation, 580 both tests exhibited good performances. The fraction of cases in which the statistics exceeded the threshold values that would, wrongly, lead to the rejection of data (i.e. 100% for FW96 and 3 for M98) was less than 5%, with a slightly better score for M98. In S S 3 and S S 4 , with stationary time series characterized by a higher degree of serial correlation, the performance of the FW96 test appeared discordant: unlike the excellent performance obtained in S S 4 , in S S 3 the test statistic was higher than 30% in 44% of cases, and, greater than 100% in another 10%. The cause of this low performance is related to the sensitivity of the 585 FW96 test to cases in which its denominator (i.e. covariance over the entire period) approaches zero and thus tends to make the ratio diverge, independently from the quantity at the numerator. The M98 test showed instead a lower sensitivity to low correlations among variables, but a greater sensitivity to the degree of serial correlation. In both S S 3 and S S 4 , the percentage of data in which the test statistic was greater than 2 was in fact higher than 20%. At the same time, however, in only 1% of cases the statistic exceeded 3, the threshold value adopted to assign a SevEr statement. This result indicates that the use of 3 instead 590 of 2 as a threshold value for the M98 test is preferable since it reduces the false positive error rate.
In both S S 5 and S S 6 , the M98 test showed an excellent ability at detecting non-stationarity caused by the presence of deterministic linear trend components: the statistic was higher than 3 in 95% of cases. On the contrary, for the FW96 test a similar performance would require the use of a 30% threshold value while according to the recommendations by Foken et al. (2004), when the value of the statistics is between 30% and 100%, if well-developed turbulence conditions are satisfied, the data would 595 not be rejected, but classified as data of intermediate quality. In the presence of stochastic trend components as simulated in scenarios S S 7 and S S 8 , the M98 test performed better than the FW96 test. However, the false negative error rate (i.e. data erroneously considered stationary) remained higher than 20% when using 3 as a threshold value. In particular, only 79% and 67% in the S S 7 and S S 8 , respectively, received the status of data affected by SevEr. The performance considerations described above are confirmed when tests are applied to observed data. Some examples 600 are shown in Figure 10. Time series represented in panels a-c refer to cases in which both tests provided strong evidence of non-stationarity (S FW96 > 100%, S M98 > 3). In these cases, the difference between the average of 5-min covariances and Figure 9. Performance evaluation of stationarity tests by Foken and Wichura (1996, FW96) and by Mahrt (1998, M98)  those estimated over 30-min is indeed significantly different from zero (right-hand panels). Panels d-f represent 3 situations in which the M98 test provided evidence of non-stationarity (S M98 > 3) while the FW96 test returned high-or intermediate quality assessments, as the difference between the average of 5-min covariances and those estimated over 30-min is negligible.

605
However, in such cases 5-min covariances are obviously affected by considerable variability and/or trends, and only happen to provide a mean value close to the 30-min covariance. Therefore, we suggest that the FW96 test provides a necessary, but not sufficient, condition for stationarity. Conversely, the M98 test provides a satisfying performance also in such cases. Examples in panels g-i depict 3 situations in which the FW96 test provided strong evidence of non-stationarity (S FW96 > 100%) while the M98 test provided, at most, weak evidence (NoEr or ModEr). As mentioned earlier, such disagreement often only occurred  Figure 10. Application of stationary tests on a selection of EC raw data collected at FI-Hyy site. From left to right: vertical wind speed (w, ms −1 ); CO2 time series (µmolmol −1 , after mean removal); cross-covariance function (CCF); comparison of 5-min covariances (black points), their average (cyan lines) and 30-min covariance (blue lines). FW96 and M98 test statistics are reported on the right-hand side. on account of the 30-min covariance being close to zero. As can be observed for these selected cases, in fact, not only the differences between the average of 5-min covariances and those estimated on the whole period of 30-min is negligible, but also the degree of dispersion of the individual 5-min covariances remain at low levels, as it is expected in stationary conditions.

Application of the data cleaning procedure
In this Section we report the results of the data cleaning procedure based on the workflow depicted in Figure 1 and including 615 the application of QC tests described in Section 2. An illustrative example for NEE time series collected at FI-Sii site during a period of 3 months is shown in Figure 11 (for the other flux variables and for all the other sites, we refer the reader to the SM-Appendix D). The original time series had a rate of missing data of 12.3%. The data cleaning procedure eliminated Figure 11. Illustrative example of the sequential data cleaning procedure applied to NEE fluxes at FI-Sii site.
another 24.6% of the data, for a total of 36.9% resulting data gaps. In particular, 1.6% of data was rejected after the removal of unreliable data caused by system malfunction and disturbances (panel b); 3.6% was removed due to low signal resolution 620 problems (panel c); 5.4% was eliminated due to evidence of aberrant structural changes (panel d); the ITC test did not remove any data (panel e); an additional 10.6% of data were removed because of nonstationary conditions (M98 test, panel f); finally, 1.6% of fluxes were rejected because identified as both outlying and inheriting at least one ModEr (panel g). Severe low signal resolution issues were identified only sporadically (in no more than 2.4% of flux values), while a ModEr 630 statement for HD B test was inherited by 18% of H values, due also to a moderate heteroskedasticity often observed in differ- enced sonic temperature time series. For all 3 flux variables, the percentage of outlying data was around 1.5%. Approximately half of them received at least a ModEr statement from one of the QC tests and were therefore removed. Figure 13 shows the STL decomposition (Cleveland et al., 1990) of NEE time series at FI-Sii site using the set of parameters described in Sect. 2.3. The ability of the modelling approach to separate the signal from the noise was assessed by evaluating 635 the spectral characteristics of the irregular component by means of the Lomb-Scargle (LS) periodogram (Lomb, 1976;Scargle, 1982). In general, the periodogram did not show significant peaks, as exemplified for NEE in Figure 14   complex correlation structure present in EC flux data, the irregular component was often heteroskedastic. This means that even if independent (since most of the serial correlation structure is removed), the irregular component is not identically distributed, 640 i.e. its PDF changes over time. Such a property strongly limits the ability to use global threshold values above and below which data are identified as outliers.
Previous researches (e.g. Richardson et al., 2012;Vitale et al., 2019), have highlighted that the random error scales with the flux magnitude and that a Laplace distribution can better approximate the PDF of the random error. With this in mind, we grouped the values of the irregular component into 10 clusters of equal size, defined by percentiles of the signal estimated 645 by the STL. This way, each cluster should contain values of the irregular component that are more likely to be identically distributed. Assuming a Laplace distribution and the (1 − α)% confidence interval at α = 0.01 significance level, any values Dashed red lines indicate the α-outlier regions (α = 0.01) assuming a Laplace distribution. Red points indicate the detected outliers. exceeding ±4.6 · σ was detected as outlier (where σ was estimated using the Q n estimator of Rousseeuw and Croux, 1993).
An illustrative example of this procedure is depicted in Figure 14, where for simplicity the detail of only 3 clusters is reported.

650
Quality control of eddy covariance flux datasets is challenging. The sources of systematic error responsible for introducing significant biases in the flux computation are manifold, and their correct identification is often made difficult by the masking effect induced by both the intrinsic stochastic properties (e.g. high degree of serial dependence, heteroskedasticity) and by the high level of noise characterizing raw data.
To take into account these features, new tests have been developed and included in a robust data cleaning procedure where the 655 data rejection is articulated in two stages: the first stage involves the removal of any flux data for which at least one of the QC tests returned strong evidences about the presence of a specific source of systematic error (SevEr); the second stage consists in the removal of outlying fluxes, provided that at least one of the QC tests returned weak evidences about the presence of sources of systematic error (ModEr) for the same flux value. Any flux data where all QC tests provided only negligible evidences of systematic error (NoEr) is never removed, even if it is later identified as an outlier in the flux time series.

660
Compared to the existing classification schemes, the proposed approach does not aim at assigning a quality flag to flux data by combining the results of different QC tests. Rather, it aims at ensuring its scalability in order to facilitate the inclusion of new tests beyond those proposed in this paper.
Although there is a strict relationship between the value of test statistics and the amount of bias affecting individual flux data, the interpretation of SevEr, ModEr and NoEr statements is performed in probabilistic terms, as the chances the presence of a 665 source of systematic error is responsible for introducing bias in flux estimation. Consequently, the choice of threshold values used to assign the SevEr, ModEr and NoEr statements are to be interpreted as indicative of the margin of error associated with the result of a statistical test.
In this study, the performance evaluation of each proposed test was carried out mainly by means of Monte Carlo experiments.
Although the proposed scenarios are representative only of a part of the innumerable cases that can be encountered in real 670 applications, simulations offer undoubted advantages because they allow a quantitative and objective evaluation of the bias effect each source of systematic error has on flux estimation, and of the ability of proposed tests to correctly detect them. The design of increasingly complex scenarios and the building of a reference EC datasets are targets of ongoing work. As for assigning a quality level to the retained data, we maintain that the estimates of random uncertainty is the appropriate metric. In fact, as a general principle, the larger the random uncertainty, the larger the amount of measurement error and, 675 consequently, the lower the quality of the data. Assuming that flux data affected by systematic error have been avoided or removed via appropriate QA/QC procedures, the use of random uncertainty estimates as a quality indicator 1) would not constrain the QC test development; 2) would not preclude a classification of the data quality, if needed, and 3) would meet the requirements of advanced methods of analyses where interval estimates are more important than individual point estimates, such as in studies based on data assimilation techniques.

680
Although there is still room for improvement, particularly in the development of more performing QC tests aiming at detecting violations of the main assumptions underlying the EC technique, we believe that the proposed data cleaning procedure can serve as a basis toward a unified QC strategy suitable for the centralized data processing pipelines, where the use of completely data-driven procedures that guarantee objectivity and reproducibility of the results constitutes an essential prerequisite.
Author contributions. DV and DP conceived the study. DV organized the structure, selected, proposed and implemented the methodologies and did all the simulations, discussing the results with the coauthors. DV and GF wrote all the sections with the supervision of DP and with the contribution of all the coauthors. All authors reviewed the final manuscript, approved it, and agreed for the submission.