Causal networks of biosphere–atmosphere interactions

Local meteorological conditions and biospheric activity are tightly coupled. Understanding these links is an essential prerequisite for predicting the Earth system under climate change conditions. However, many empirical studies on the interaction between the biosphere and the atmosphere are based on correlative approaches that are not able to deduce causal paths, and only very few studies apply causal discovery methods. Here, we use a recently proposed causal graph discovery algorithm, which aims to reconstruct the causal dependency structure underlying a set of time series. We explore the potential of this 5 method to infer temporal dependencies in biosphere-atmosphere interactions. Specifically we address the following questions: How do periodicity and heteroscedasticity influence causal detection rates, i.e. the detection of existing and non-existing links? How consistent are results for noise-contaminated data? Do results exhibit an increased information content that justifies the use of this causal-inference method? We explore the first question using artificial time series with well known dependencies that mimic real-world biosphere-atmosphere interactions. The two remaining questions are addressed jointly in two case studies uti10 lizing observational data. Firstly, we analyse three replicated eddy covariance datasets from a Mediterranean ecosystem at half hourly time resolution allowing us to understand the impact of measurement uncertainties. Secondly, we analyse global NDVI time series (GIMMS 3g) along with gridded climate data to study large-scale climatic drivers of vegetation greenness. Overall, the results confirm the capacity of the causal discovery method to extract time-lagged linear dependencies under realistic settings. The violation of the method’s assumptions increases the likelihood to detect false links. Nevertheless, we consistently 15 identify interaction patterns in observational data. Our findings suggest that estimating a directed biosphere-atmosphere network at the ecosystem level can offer novel possibilities to unravel complex multi-directional interactions. Other than classical correlative approaches, our findings are constrained to a few meaningful set of relations which can be powerful insights for the evaluation of terrestrial ecosystem models. 2 https://doi.org/10.5194/bg-2019-297 Preprint. Discussion started: 13 August 2019 c © Author(s) 2019. CC BY 4.0 License.


Introduction
Understanding biosphere-atmosphere interactions is a prerequisite to accurately quantify feedbacks in the Earth system (Bonan, 2015). On the one hand, the biosphere responds to the atmospheric drivers such as radiation intensity, temperature, vapour pressure deficit, and composition of trace gases. On the other hand, the biosphere influences the atmosphere via partitioning the incoming net radiation into sensible, latent, and ground heat fluxes as well as via controlling the exchange of trace gases 5 and volatile organic compounds. Over the past decades, many of these processes have been identified and their physical, chemical and biological effects have been investigated (see e.g. Monson and Baldocchi, 2014;McPherson, 2007, for overviews).
However, the synergistic effects of these processes and the specific cause-effect interactions underlying these effects are still not fully understood (Baldocchi et al., 2016;Miralles et al., 2018).
Today, there is a manifold of monitoring systems operating at various spatial and temporal scales. Networks of eddy co- 10 variance towers offer high temporal resolution to study the dependence between carbon, water, and energy fluxes and the atmosphere under a variety of normal and stress conditions or in response to disturbances (Baldocchi, 2014). Satellite remote sensing data complement this picture. They typically only monitor vegetation states at multi-day resolutions and some products offer nearly complete global coverages (Justice et al., 2002;Woodcock et al., 2008). The actual and future satellite missions are leading to rapid development in the field with ever higher spatial, temporal, and spectral measurements (Malenovský et al.,15 2012; Guanter et al., 2015;Qi and Dubayah, 2016).
The study of biosphere-atmosphere interactions using observations typically relies on correlative approaches, or is based on model-data i.e. requires a-priori knowledge. Only few attempts have been made to learn directional dependencies in a data-driven manner in biogeosciences. For instance, Ruddell and Kumar (2009) used transfer entropy, a bivariate information theoretic measure, to estimate networks of information flow. These networks constructed for a corn-soybean ecosystem under 20 drought and normal conditions showed substantial differences in connectivity. The decoupling during drought between variables describing both land and atmospheric conditions was attributed to changes in the feedback patterns for the two conditions.
In recent years, a new branch in statistics aiming for causal inference from empirical data has experienced substantial progress.
The idea of causal inference emerged already in the early 20th century (Wright, 1921). Later, Granger suggested one of the first applicable formalisms (Granger, 1969); since then, several efforts in ecology and climate science have concentrated on the 25 bivariate form of Granger causality (Elsner, 2006(Elsner, , 2007Kodra et al., 2011;Attanasio, 2012;Attanasio et al., 2012).
Aiming to mitigate some of the limitations of the traditional Granger causality, Detto et al. (2012) used a conditional spectral Granger causality framework that allows to disentangle system inherent periodic couplings from external forcing. The disentanglement is enabled via decomposition into the frequency domai using wavelet theory. This method enabled the finding that soil respiration in a pine and hardwood forested ecosystem in winter is not influenced by canopy assimilation but only by 30 temperature, a result that would not be detectable via lagged correlation or bivariate Granger causality. Green et al. (2017) used a similar approach to investigate biosphere-atmosphere feedback loops. It is found that they can explain up to 30% of variance in radiation and precipitation in certain regions. Recently, Papagiannopoulou et al. (2017a) applied a non-linear multivariate conditional Granger causality framework to study climatic drivers of vegetation at the global scale. This approach revealed artificial tests (Runge et al., 2018) and climatological case studies (Runge et al., 2014;Kretschmer et al., 2016). Hence, this method could be potentially of very high relevance for learning the casual dependency structure in the complex biosphereatmosphere system.
In this study, we explore the potential of PCMCI for disentangling and quantifying interactions and feedbacks between terrestrial biosphere state and fluxes and meteorological variables. The study is structured as follows: Firstly, we describe the 15 performance of the method on artificial time series data with well-known dependencies that mimic some basic properties of observed land surface fluxes such as heteroscedasticity (Sect.3.1). Secondly, we explore three replicated eddy covariance measurement towers in a Mediterranean ecosystem and explore how the identified interdependencies of carbon and energy fluxes and micro-meteorological observations vary over time (Sect.3.2). Thirdly, we use global satellite data of vegetation greenness to understand the lagged dependency of ecosystems with respect to climatic drivers (Sect.3.3). Based on these results, we dis- 20 cuss the potentials and limitations of PCMCI for other applications in land-atmosphere studies and give recommendations for further methodological developments (Sect.4).

Causal Discovery via PCMCI
PCMCI assesses the causal structure of a multivariate dataset or process X by estimating its time series graph. In short, the nodes of a time series graph G represent single variables X 1 , X 2 , ... ∈ X at specific (lagged) time steps t, t − 1, .., t − τ max . A lag is the number of time steps τ between two nodes X i t 1 and X j t 2 , i.e. τ = t 1 − t 2 for t 1 > t 2 . To generate a causal graph from 5 observational data through statistical dependencies, a series of assumptions must be adopted. Here, we assume: time order, the causal Markov condition, faithfulness, causal sufficiency, and no contemporaneous causal effects (Runge et al., 2018).
PCMCI is applied in combination with the ParCorr linear independence test based on partial correlations (cf. Sect. 2.1.1). This application additionally requires stationarity in mean and variance and linear dependencies. In the following, we briefly discuss these assumptions (further details in Runge, 2018a) and refer to a variable X i that is causally affecting a variable X j as 'parent' 10 or 'driver' and the latter as 'receiver' or 'target'.
The time order within the time series allows to orient directed links which are only pointing forward in time. This accounts for causal information propagating forward in time only, i.e. the cause shall precede the effect. Therefore, a directed causal link X i t 1 → X j t 2 can only exist between two nodes X i t 1 , X j t 2 if t 1 < t 2 . When a contemporaneous link is found, i.e. t 1 = t 2 , it is considered to be undirected. The connection between the graph G and its process X is given by the causal Markov condition 15 together with the faithfulness. The Markov condition says that if there is no link in the graph, there is conditional independence in the distribution, and if there is no conditional independence, there is connectedness in the graph; the reverse relations, if there is conditional independence, there is also separation in the graph and if there is connectedness in the graph, there is no conditional independence, are given by the faithfulness assumption. With these assumptions the graph is a visualisation of the conditional dependence and independence relationships among the variables including their lags. Finally, the causal sufficiency 20 assumption implies that, every common cause of two or more variables X i ∈ X is included in X. If this is not the case, detected links may be indirect or due to an unobserved common driver. However, the absence of a link in the detected graph still implies that no direct link is present (as this only requires the assumption of faithfulness).
Several causality methods suffer from high dimensionality, i.e. multivariate transfer entropy (Runge et al., 2012). They ..) as a conditioning set S in the estimation of the test statistic PCMCI addresses this issue by reducing the set of conditions S. PCMCI first estimates S for each X j t . The estimation removes all variables that are either independent or dependent due to indirect paths or common drivers only and leaves only relevant conditions converging to the true causal parents in the limit of infinite sample size. This is done using a variant of the PC algorithm (Spirtes et al., 2001). The conditioning set is subsequently used in the MCI step to test every possible link; MCI also attributes a link strength. These two steps as well as the concept of independence tests, which 30 are at the core of both PC and MCI algorithms, are explained in more detail in the following.

Independence Test
At the core of PCMCI, there are conditional independence tests CI(X i t−τ , X j t , S) testing for X i t−τ X j t | S. Within the PCMCI software package Tigramite (Runge, 2018b), several independence tests are implemented. Here, we focus on the linear independence test called ParCorr. The ParCorr conditional independence test is based on partial correlations and a t-test. This assumes the model with coefficients β and Gaussian noise . This leads to the residuals with estimatedβ. ParCorr removes the influence of S on X and Y via ordinary least squares regression and tests for independence of the residuals using the Pearson correlation with a t-test. The independence test returns a p-value and test statistic value 10 I, i.e. the correlation coefficient in case of ParCorr.

PC algorithm
The PC algorithm estimates a set of parents for each variable of the process X j t ∈ X t which are used as low-dimensional conditions in the MCI algorithm. A comprehensive pseudo code of this procedure is given in Runge et al. (2018). The algorithm starts with a fully connected graph and iteratively removes links if conditional independence is found. At first, a preliminary set 15 of parents P for each X j t equal to the whole past of the process X t is initialised: The conditional independence test is embedded into two loops. The outer loop iteratively increases the cardinality p of S starting from 0, the empty set. Thus it is first tested, whether X i and X j are independent. Further, only the p strongest conditions are selected. The inner loop iterates through all X i t−τ ∈ P(X j t ). The cardinality is increased until |S| = | P(X j t )\ X i t−τ | − 1. At the end of each iteration 20 in the outer loop the non-significant parents are removed from P(X j t ). Every conditional independence test is evaluated at a significance threshold called α pc , which is usually set to a liberal value between 0.1 and 0.4. Alternatively, one can let α pc unspecified. PCMCI then evaluates the best choice of α pc ∈ {0.1, 0.2, 0.3, 0.4} based on the Akaike information criterion which is further explained in (Runge et al., 2018). 25 MCI is the actual causal discovery step that ascribes a p-value and strength to each possible link. MCI iterates through all pairs (X i t−τ , X j t ) : i = 1, ..., N, τ = 0, ..., τ max and calculates CI(X i t−τ , X j t , S). S consists of the two potentially multivariate sets of parents P(X j t ) and P(X i t−τ ) obtained in the PC step. P(X i t−τ ) is constructed by shifting the time series of P(X i t ) by τ. In case X i t−τ ∈ P(X j t ), X i t−τ has to be removed from P(X j t ). If τ = 0, conditional dependence is estimated for contemporaneous nodes X j t and X i t . Due to missing time order, a dependence would be left undirected. Further, as the parents P(X i t ) and P(X j t ) used in each conditional dependence test are defined to lie in the past of X i t and X j t , links, both contemporaneous and lagged, can be spurious due to contemporaneous common drivers or contemporaneous indirect paths. The absence of a link, though, means that a physical (contemporaneous) link is unlikely (assuming faithfulness, cf. Runge et al. (2018)).

MCI tests
The link strength is given by the effect size of the conditional independence test statistic measure CI used in combination with MCI. In case of ParCorr, the effect size is given by the partial correlation value, which is between -1 and 1. This value is 5 shown in Runge et al. (2014) to directly depend on the receiver's and driver's variance as well as the coupling coefficient. Even non-linear links can be detected with ParCorr as they can often be linearly approximated. In case the linear part is even stronger than the non-linear part, ParCorr can have better detection rate than a non-linear independence test (Runge et al., 2018).

Artificial Dataset -Test Model
We tested the algorithm on artificial datasets prior to its application to real world data. Using time series of measured global radiation (Rg) we created three artificial time series that conceptually represent temperature (T ), gross primary production (GPP) and ecosystem respiration (Reco). Note that this test model is not intended to accurately represent observed land-5 atmosphere fluxes, but only serves to test the procedure. The model incorporates one linear auto dependence T t−τ 1 → T t , one linear additive cross-dependence Rg t−τ 2 → T t and two non-linear dependencies, multiplicative Rg t−τ 3 • T t−τ 4 → GPP t , and multiplicative exponential GPP t−τ 5 • c T t−τ 6 → Reco t (cf. Fig. 1) according to the equations: The parameters c 1 , c 2 , ..., c 5 are referred to as coupling coefficients, and the time lags are noted as τ 1 , τ 2 ,..., τ 5 . The subscripts mo and obs abbreviate model and observation, respectively. T re f is set to 15 • C. The term ξ, termed "intrinsic" or "dynamical noise", here represents values from uncorrelated, normally distributed noise. Having dynamic noise is essential for a method 15 utilizing conditional independence tests. It is based on the assumption that a process or state is never fully described by its deterministic part because there are unresolved intrinsic processes, summarized as ξ.
The model was fitted to real observational data (using radiation, temperature and land-atmosphere fluxes) of daily time resolution, measured by eddy-covariance method (Baldocchi et al., 1988;Baldocchi, 2003) from FLUXNET, by minimizing the sum of squared residuals using the gradient descent implemented in the Optim.jl package (Mogensen and Riseth, 2018). 20 We fitted the model to 72 sites listed in From each of the 72 sets of parameters we generated four sets of time series each having a length of 500 years. The time series generation was initiated using two types of data: first, uncorrelated, normally distributed noise, and second, unprocessed radiation data as used during the fitting (the available radiation data was repeated to 500 years). The resulting datasets are called baseline dataset and seasonality dataset, respectively. In both cases, the model was run twice, once with homoscedastic (constant variance of ξ), once with heteroscedastic dynamical noise ξ. To induce heteroscedasticity, ξ was multiplied with a total precipitation of 765mm. Most precipitation fell between October and April. 10 We expect the causal imprints in the data to vary between seasons and during the course of the day. To satisfy causal stationarity, we estimate networks separately for each month and consider only samples for which the potential radiation was above 4 5 of the potential daily maximum, which corresponds to midday samples. We used a mask type that limits only the receiver variable to the respective month and day time values (cf. Table A1 for PCMCI parameter settings). This setting causes time series lengths ranging from 239 datapoints in December to 372 datapoints in July. Minimal and maximal lags were set to 15 0 and 8, respectively. Due to the limited number of years, we left the data unprocessed, i.e. we did not subtract a seasonal mean for anomalisation. Constraining the samples to separate month and midday values reduces the effect of seasonality as a common driver that would lead to spurious links. Furthermore, to correct for multiple testing we applied the Benjamini-Hochberg false discovery rate correction (Benjamini and Hochberg, 1995). Thereby, the p-values for the whole graph obtained from the MCI step are adjusted to control the number of false discoveries (Runge et al., 2018). We chose a two-sided significance level of 20 0.01.

Gridded global data set
The second observational case study was performed on a global data set. We used data with 0.5 • spatial and monthly temporal resolution from 1982 to 2008. The dataset is composed of three climatic variables, global radiation (Rg), temperature (T) and precipitation (P), and one vegetation state index, the Normalized Difference Vegetation Index (NDVI). Both temperature and 25 precipitation datasets were taken from the Climate Research Unit (CRU), version TS3. 10 (Harris et al., 2014). The radiation data stems from the Climate Research Unit and National Centers for Environmental Prediction dataset (CRUNCEP, Viovy, 2016). The used NDVI data stems from the Global Inventory Monitoring and Modeling Systems (GIMMS) in version 3g_v1 (Pinzon and Tucker, 2014).
To examine the influence of radiation, temperature, and precipitation on NDVI by means of PCMCI we used the following 30 settings. We compute the anomalies by subtracting a smoothed seasonal mean. A maximal time lag of three months was chosen based on the largest lag with significant partial correlation among all pairs of variables, partialling out only the autocorrelation of each variable. The receiver variable was limited to the growing season defined by T>0 and NDVI>0.2, which allows good  Figure 1. The artificial datasets are generated with a prescribed interaction structure (True Network), which is obtained by fitting the test model to the FLUXNET sites. Here we show for four time series length the process graphs estimated via both lagged correlation and PCMCI.
The data used stems from the homoscedastic realisation of the seasonality dataset of the Hainich site. The significance level was set to 0.01.
The number of time lag labels were limited to five in the correlation networks. But for the longest time series typically the whole range of lags (0-25) was significant.

Test Model
As motivating example, in Fig. 1 we show PCMCI and lagged correlation networks in the form of process graphs. It is clearly visible, that many more spurious links pass the significance threshold of 0.01 using lagged correlation as compared to using PCMCI. Those spurious links can complicate the analysis or lead to false assumptions and misleading hypotheses. We ex- Additionally, the distributions are split to show the impact of heteroscedastic noise (orange) compared to normal distributed noise (blue).
The significance level of 0.01 is given by a blue horizontal line. ticity or seasonality is present. The effect on the FPR is an increase above 0.01 for time series length of 1 and 5 years with a much stronger increase due to seasonality (factor of 4 and 8, respectively) than for heteroscedasticity (factor of 2).
The effect of non-stationarities on the TPR differs among the links. The detection of linear links (Rg → T and T → T ) is not affected by seasonality and slightly improves for heteroscedastic dynamical noise. The detection of non-linear links is improved by seasonality with the strongest effects in the link T → GPP. The link T → Reco has a stonger non-linearity and 5 therefore the detection rate shows a weaker effect on seasonality. Furthermore, the coupling coefficient c 5 , the base of T, can be close to or be exactly one (cf. Fig. B1). This would actually cause on the one hand the effect of T → Reco to vanish, rendering a detection impossible and on the other hand result in a linear dependence of GPP on Reco which improves its detection.
Heteroscedasticity seems to have a slight negative effect on non-linear links. In general, the TPRs in the seasonality dataset are quite high, even for non-linear links, and predominantly above 80% and often reaching 100%. 10 Comparing the TPRs of the non-linear links, shows some disparity. The links T → GPP and T → Reco experience zero detection in the baseline dataset but partially considerable rates in the seasonality dataset with a strong dependence on the time series length. On the contrary, the median of the TPR of the links Rg → GPP and GPP → Reco is above 95% in the seasonality dataset, even for time series length as short as 91 days, but remains high in the baseline dataset. The removal of the seasonal cycle keeps the TPRs largely unaffected, but reduces the FPR. Nevertheless, it still remains above the significance 15 level by a factor of four and two for five and one year time series length, respectively.   Additionally the distributions are split to show the impact of heteroscedastic noise (orange) compared to normally distributed noise (blue).
In summary, the seasonality dataset exhibits high TPR even for non-linear links. Compared to stationary time series, the detection of non-linear links actually benefits from seasonality. The high detection, though, comes at the cost of a high false positive rate for time series length of and above one year. To a certain degree, the increase in FPR can be counteracted by anomalization.

Majadas de Tiètar dataset 5
At first we look at the link consistency by comparing networks that were obtained for each tower within a month. The comparison is done for two months with strongly differing climate conditions: April and August. In Fig. 4    nantly at zero lag. Approximately half of the links with lag one are auto-dependent links (a link from the past of a variable to its present). Comparing the links between the month April and August, distinct differences can be noticed. First, August has slightly fewer significant links compared to April. Second, the only links remaining that are significant in two or three towers are between atmospheric variables. Third, the remaining link strengths tend to be weaker in August than in April.
The difference among the seasons is further investigated in Fig. 5 which shows process graphs for each month of the year. 5 We combine the networks of the three towers to one process graph by plotting every link that is significant in at least one tower.
The process graphs in Fig in September. Here, the onset of precipitation events (cf. Fig. G1) occurred that lead to strong respiration peaks (Ma et al., 2012). Creating such a network via lagged correlation would result in much more significant links (causing the network to be 15 not interpretable as opposed to PCMCI) and NEE does not decouple from the atmosphere in August (cf. Fig. F1).
The above results demonstrate that PCMCI is sensitive enough to capture seasonal differences and certain physiological reasonable biosphere behaviour. Moreover, PCMCI yields a better interpretable network structure than pure correlation approaches.

20
Subject of inspection were the significant lags and MCI values of each climatic variable on NDVI. In Fig. 6 the maximal MCI value and the corresponding lag are plotted for the links X t−τ → NDVI : τ ∈ {0, 1, 2, 3} with X being one of the climatic drivers radiation, temperature, or precipitation. The chosen significance threshold was set to 0.05. Fig. 7 shows the climatic driver with largest MCI per grid point. PCMCI detects a regionally varying influence of climatic drivers. As expected, the boreal regions are strongly driven by temperature instantaneously, while (semi-) arid regions, which correspond predominantly to 25 grass or prairie dominated areas, respond strongest to precipitation at a time lag of one month. Radiation is found to have a comparatively low spatial effect with hot spots in south and east China, central Russia and east Canada.
The dominant lags are found to be zero and one. Just a very small fraction of the total area shows a maximal MCI value at a higher lag of two or three months. The lags are also not equally distributed among the climatic drivers. Radiation and temperature are predominantly strongest at lag zero, while precipitation has a much larger fraction of area showing the strongest 30 response at lag one. Regions where the impact of Rg on NDVI is strongest at lag 1 tend to respond negatively to Rg but positively to precipitation at lag one. On the other hand, a large part of regions with the strongest impact of precipitation at lag zero respond negatively to it but positively to radiation.    In summary, PCMCI estimates coherent interaction patterns which match well with anticipated behaviour based on vegetation type and prevailing climatic conditions.
Causal discovery methods promise an improved understanding and can help to come up with new hypotheses about the interaction between biosphere and atmosphere (Christiansen and Peters, 2018;Runge et al., 2019). But the underlying assumptions need to be properly taken into account. The coupled biosphere-atmosphere system possesses several challenges that potentially violate the underlying assumptions of causal discovery in general and the employed method's assumptions in particular. Here, 5 we investigate the effect of a violation of assumptions on PCMCI network estimates.
With regard to expected non-linearities in biosphere-atmosphere interactions, using a linear independence test within the PCMCI framework may not be adequate. We motivate our choice with the following arguments: first, non-linearities are often approximated linearly. Second, a linear regression based test has a much higher power for detecting linear links than a non parametric test (Runge, 2018a) and can, hence, detect links already at smaller sample sizes. Third, linear partial correlation

Lessons learned from the test model
The probability to detect a link with PCMCI depends strongly on a link's MCI effect size, which is larger for strong variance in the driver and a low variance in the receiver (Runge et al., 2018). Several results can be explained by this observation. First, the variance of three out of five drivers of cross dependencies in the test model are either directly or indirectly (via GPP) influenced by Rg, which has the highest variance of all variables. Consequently, the detection power of the three links is large, 20 almost 100%. In comparison, the other variables' variances are weaker, since they are influenced by T , which results in a lower detection power. This is the origin of the disparity in detection rates of the non-linear links. Second, also the partially strong increase in TPR of non-linear links (influenced in a multiplicative way by T ) from the baseline dataset to the seasonality dataset can be explained by this increase in variance. A multiplicative link is actually not generally expected to be found by ParCorr (Runge, 2018a), but the value of the multiplicative factor is dominated by the seasonal value, and not the dynamical 25 noise, which might cause rather a scaling of the dynamical noise terms rather than a random distortion. Third, the dependence on the variance ratio can also explain the difference in TPR between homoscedastic (equal error variance) and heteroscedastic (error variance changing over time) time series, i.e., the variance of Rg and GPP exhibits a strong seasonality with its peak in summer, while the variance of T is rather constant. This explains, for example, the strong decrease in TPR for the link T → GPP at 91 days time series length when comparing homoscedasticity to heteroscedasticity. The decrease in TPR is less 30 pronounced when another season, implying a different variance, is chosen for this comparison. As links with weak driver variance and strong response variance are more likely to be missed, one may ask which effect this will have on the detection of feedback loops where one variable has low and the other high variance. Here lies a limitation of the test model where no feedback loops were implemented.
Seasonality and heteroscedasticity constitute violations of the stationarity assumption underlying the independence test ParCorr. As shown in Runge (2018a), including the cause of the non-stationarity as an exogenous driver in the analysis allows PCMCI to regress out its influence on the other variables. However, for ParCorr this is only valid if the dependence on the 5 non-stationary driver is linear. Therefore, the regression on Rg fails for GPP and Reco in the test model. With this ill-posed setting, the probability to detect false links increases with increasing time series length or when more periods are included.
Stationarity in mean is obviously also not fully guaranteed when subtracting the seasonal mean. Here we observe that the FPR stays above the significance level for the anomalised seasonality dataset. One can ask whether the FPR stays above the significance threshold because subtracting the seasonal mean does not remove the heteroscedasticity. However, we attribute this 10 high FPR to a not fully removed seasonality since the FPR of both homoscedastic and heteroscedastic time series decreases by roughly the same amount in the anomalised seasonality dataset and the effect of heteroscedasticity is rather weak in the baseline dataset. Much of the influence of heteroscedasticity is also removed when limiting the analysis to a specific time period, i.e.
season, which makes the data causally stationarity (cf. Sect. 2.1). For example, the link from radiation to GPP vanishes in winter as there is mostly no active plant material left. One can argue, as it is done in Peters et al. (2017), that the causality non-stationarity on the causal network structure needs to be investigated.

Causal interpretation of estimated networks from observational data
In both the half-hourly time resolved eddy covariance data and the monthly global dataset the predominant type of dependence found is contemporaneous. PCMCI leaves these undirected since no time order indicating the flow of causal information is available. Further, as discussed in Sect. 2.1, contemporaneous common drivers or mediators are not accounted for. The 30 consequence is that both spurious contemporaneous and spurious lagged links can appear, if they are due to contemporaneous variables. For interactions that are contemporaneous in nature since they occur on considerably shorter time scales than the time resolution, therefore, PCMCI is not the optimal choice regarding a causal interpretation and other methods should be applied in conjunction (Runge et al., 2019). Further, we faced a trade off between fulfilling causal assumptions and detection power.
In practice, accounting for causal stationarity (by limiting the analysis to certain periods of the dataset) means decreasing the number of available data points while accounting for causal sufficiency leads to an increase in dimensionality by adding variables and increasing the maximal lag. Both will lead to a decrease in detection power, which can affect the network structure. PCMCI alleviates the curse of dimensionality by applying a condition selection step, but still one cannot indefinetly add more variables. Another important factor that affects detection power and dimensionality is the time resolution. There 5 are several points in favour and against increasing time resolution. On the one hand, increasing time resolution can resolve contemporaneous links and potentially increases the detection power due to an increased number of datapoints. On the other hand, the dimensionality increases if the maximal lag is adapted. Further, causal information might be split apart and distributed over more lags, rendering the links at each individual lag less detectable. This can cause links to disappear, but links can also appear if new processes are resolved at a higher time scale. At last, observational noise (measurement errors) might be larger in 10 higher resolution data than in aggregated data, as it is averaged out in the latter and thus affects link detection less. Consequently, when comparing network structures based on different settings, i.e. maximal lag, included variables, time resolution, and considered time period, the (dis-)appearance of single links among specific variables can stem from several factors, i.e. a change in detection power, a changed (conditional) dependency, or due to a common driver. These factors together with a non-zero false positive detection rate are challenging for a causal interpretation. Therefore, detected links should be interpreted 15 with care and can give rise to new hypotheses and analyses involving further variables. Generally, a causal interpretation is more robust regarding the absence of links (cf. Sect.2.1). In particular it does not require that all common drivers are observed.
Nevertheless, robust patterns were identified in our studies that are also consistent with other studies. Furthermore, a causal analysis has the advantage of an enhanced interpretability compared to correlative approaches. First of all, we could show that the networks' estimated link strengths are consistent for observational data, even though measurement error affects the 20 data. The dataset used was suitable for this analysis, as the measurement stations are located in a reasonably homogeneous ecosystem that shows only little spatial variation (El-Madany et al., 2018). Thus, also the interaction between biosphere and atmosphere is expected to change only marginally across space. Second, the gradual changes in plant activity that are taking place in the ecosystem of Majadas throughout the year do very well emerge in the coupling strength of daytime NEE to the atmospheric variables. The observed decoupling during the dry season is in accordance with the one of a soybean field during the PC condition selection step which is especially valuable in high dimensional study cases and improves detection power and computation time (Runge, 2018a).
The global study of climatic drivers of vegetation shows a general pattern of lags and dependence strengths of vegetation on climatic variables that is easily-interpretable. The boreal regions appear energy limited and especially driven by temperature (cf . Fig 6c), while the strongest dependence of (semi-)arid regions on precipitation reflects their limitation in water supply. presented by Papagiannopoulou et al. (2017a). We recognize that similar patterns are observed in Wu et al. (2015), but the lags at the maximal MCI value are usually lower than the one found in Wu et al. (2015), which stems from the methodical differences. Besides having used anomaly values, PCMCI regresses both NDVI and the climatic drivers on their parents before calculating the MCI value (cf. Sect. 2.1.3). This especially removes the influence of autocorrelation. Runge et al. (2014) shows how autocorrelation affects the correlative lag causing it to be larger for stronger autocorrelation; thereby the correlative 15 lag may become larger than the causal lag. Therefore, according to Fig. 6b, the causal information embedded in monthly resolution is predominantly received within one month. Finding the strongest causal links at a time lag up to one month appear in agreement with Papagiannopoulou et al. (2017b). Also the spatial distribution of the strongest climatic influences compares well. But there are certain noteworthy differences which not necessarily stem from masking differences, i.e. that we took only values belonging to the growing season while Papagiannopoulou et al. (2017b) took the whole time series. First, 20 there is little significant Granger causality of water availability found in boreal regions while there are significant negative causal dependencies detected via PCMCI. Second, NDVI in arid regions is not or barely Granger caused by radiation and temperature, but in parts shows a negative PCMCI value on those variables. There might be physiological reasons that can explain the PCMCI patterns, i.e. water logging or too high temperatures. To explain the differences though, we could identify two possible reasons. First, Papagiannopoulou et al. (2017b) masked out negative influences of radiation arguing that radiation 25 is not negatively affecting NDVI. They found, that negative influences of Rg are usually a consequence of poor conditioning on other variables. Second, a precipitation event in boreal regions coincides with a reduction in radiation and temperature. Boreal regions usually do not suffer from water shortages. Thus they respond stronger to the reduction of radiation and temperature than precipitation. As precipitation is coupled negatively to radiation and temperature at lag zero, the effect of precipitation on NDVI is found to be negative. Thus, the link P − − →NDVI might be an effect of the contemporaneous common driver scheme P study shows that causal methods can deliver better interpretability and a much improved process understanding in comparison to correlation and bivariate Granger causality analyses that are ambiguous to interpret since they do not account for common drivers.