Estimating causal networks in biosphere–atmosphere interaction with the PCMCI approach

The dynamics of biochemical processes in terrestrial ecosystems are tightly coupled to local meteorological conditions. Understanding these interactions is an essential prerequisite for predicting, e.g. the response of the terrestrial carbon cycle to climate change. However, many empirical studies in this field rely on correlative approaches and only very few studies apply causal discovery methods. Here we explore the potential for a recently proposed causal graph discovery algorithm to reconstruct the causal dependency structure underlying biosphere–atmosphere interactions. Using artificial time series with known dependencies that mimic real-world biosphere–atmosphere interactions we address the influence of non-stationarities, i.e. periodicity and heteroscedasticity, on the estimation of causal networks. We then investigate the interpretability of the method in two case studies. Firstly, we analyse three replicated eddy covariance datasets from a Mediterranean ecosystem. Secondly, we explore global Normalised Difference Vegetation Index time series (GIMMS 3g), along with gridded climate data to study large-scale climatic drivers of vegetation greenness. We compare the retrieved causal graphs to simple cross-correlation-based approaches to test whether causal graphs are considerably more informative. Overall, the results confirm the capacity of the causal discovery method to extract time-lagged linear dependencies under realistic settings. For example, we find a complete decoupling of the net ecosystem exchange from meteorological variability during summer in the Mediterranean ecosystem. However, cautious interpretations are needed, as the violation of the method’s assumptions due to non-stationarities increases the likelihood to detect false links. Overall, estimating directed biosphere–atmosphere networks helps unravel complex multidirectional process interactions. Other than classical correlative approaches, our findings are constrained to a few meaningful sets of relations, which can be powerful insights for the evaluation of terrestrial ecosystem models.


Introduction
The terrestrial biosphere responds to atmospheric drivers such as radiation intensity, temperature, vapour pressure deficit, and composition of trace gases. On the other hand, the biosphere influences the atmosphere via partitioning incoming net radiation into sensible, latent, and ground heat fluxes as well as via controlling the exchange of trace gases and volatile organic compounds. Over the past decades, many of these processes have been identified and their physical, chemical and biological 5 effects have been investigated (see e.g. Monson and Baldocchi, 2014;McPherson, 2007, for overviews). However, there are still substantial unknowns regarding the exact causal dependencies among the different processes (Baldocchi et al., 2016;Miralles et al., 2018), which leads to large uncertainties when predicting e.g. ecosystem responses to drought conditions von Buttlar et al. (2018); Sippel et al. (2017).
Multiple ecological monitoring systems have been setup to monitor ecosystem dynamics. Networks of eddy covariance 10 towers continuously monitor carbon, water, and energy fluxes in high temporal resolution (Baldocchi, 2014). Satellite remote sensing data complement this picture and can be used in tandem (Papale et al., 2015;Mahecha et al., 2017). They typically only monitor vegetation states at multi-day resolutions and some products offer nearly complete global coverage (Justice et al., 2002;Woodcock et al., 2008). The actual and future satellite missions are leading to rapid development in the field with ever higher spatial, temporal, and spectral measurements (Malenovský et al., 2012;Guanter et al., 2015;Qi and Dubayah, 2016). 15 The study of biosphere-atmosphere interactions using observations typically relies on correlative approaches, or is based on model-data i.e. requires a-priori knowledge. In recent years, a new branch in statistics aiming for causal inference from empirical data has experienced substantial progress. The idea of causal inference emerged already in the early 20th century (Wright, 1921). Later, Granger suggested one of the first applicable formalisms (Granger, 1969); since then, several efforts in ecology and climate science have concentrated on the bivariate form of Granger causality (Elsner, 2006(Elsner, , 2007Kodra et al., 20 2011; Attanasio, 2012;Attanasio et al., 2012). From an information theoretic perspective transfer entropy (Schreiber, 2000) evolved as a frequently used measure to infer directionality and amount of information flow (Kumar and Ruddell, 2010;Ruddell et al., 2015;Gerken et al., 2018;Yu et al., 2019). For instance, Ruddell and Kumar (2009) used transfer entropy to estimate networks of information flow. These networks constructed for an agricultural site under drought and non-drought conditions showed substantial differences in connectivity, especially between subsystems comprising variables of land and atmospheric 25 conditions. Those changes in connectivity are attributed to changes in the feedback patterns between the subsystems for drought and normal conditions. The original forms of both Granger causality and transfer entropy are bivariate and converge for the case of vector auto regressive models. While Granger causality is typically limited to linear relationships, transfer entropy captures also non-linear interactions, but requires very large data quantities for the estimation of the probability density function.
Aiming to mitigate some of the limitations of the traditional Granger causality, Detto et al. (2012) used a conditional spectral 30 Granger causality framework that allows to disentangle system inherent periodic couplings from external forcing. The disentanglement is enabled via decomposition into the frequency domain using wavelet theory. This method enabled the finding that soil respiration in a pine and hardwood forested ecosystem in winter is not influenced by canopy assimilation but only by temperature, a result that would not be detectable via lagged correlation or bivariate Granger causality. A time-frequency representation of Granger causality was presented by Shadaydeh et al. (2019) which allowed to identify anomalous events in marine and ecological time series. Green et al. (2017) used a similar approach as Detto et al. (2012) to investigate biosphereatmosphere feedback loops. It was found that they can explain up to 30% of variance in radiation and precipitation in certain regions. Recently, Papagiannopoulou et al. (2017a) applied a non-linear multivariate conditional Granger causality framework to study climatic drivers of vegetation at the global scale. This approach revealed that water availability dominates plant 5 productivity as 61% of the vegetated surface appeared water limited rather than controlled by radiation or temperature (Papagiannopoulou et al., 2017b). In the case of transfer entropy, Goodwell and Kumar (2017a, b) developed a redundancy measure which allows to distinguish unique, synergistic and redundant information transfer of a bi or potentially multivariate system to a target variable. This modification enables a stronger multivariate interpretation of process networks constructed with transfer entropy. Changes in connectivity then potentially point to different ecosystem response strategies to disturbances (Goodwell 10 et al., 2018). These examples highlight that unexpected interaction patterns can in principle be identified from data only and may challenge theoretical assumptions. In fact, in the last years the science of causal inference has developed a strong theoretical foundation and several algorithms have been proposed (Spirtes et al., 2001;Pearl, 2009;Peters et al., 2017;Pearl and Mackenzie, 2018;Runge et al., 2019a). However, only few studies test the suitability of this latest generation of methods to understand ecosystem dynamics (see e.g. Shadaydeh et al., 2018Shadaydeh et al., , 2019Christiansen and Peters, 2018). 15 Ecological and climate data are often time ordered. This property can be exploited to construct time series graphs (Ebert-Uphoff and Deng, 2012). Recently, Runge et al. (2019b) introduced an algorithm to estimate such graphs, called PCMCI, a combination of the PC algorithm (named after its inventors Peter and Clark, Spirtes and Glymour, 1991)

and the Momentary
Conditional Independence (MCI) test. PCMCI has been successfully applied to artificial (Runge et al., 2018) and climatological case studies (Runge et al., 2014;Kretschmer et al., 2016). Hence, this method could be potentially of high relevance for learning 20 the causal dependency structure underlying biosphere-atmosphere interactions.
In this study, we explore the potential of PCMCI for disentangling and quantifying interactions and feedbacks between terrestrial biosphere state and fluxes and meteorological variables. The study is structured as follows: In Sect. 2 we motivate and introduce the method from an ecological perspective. We also describe artificial and real world datasets explored in this study. The results in Sect. 3 describes the performance of the method on artificial time series data with known dependencies 25 that mimic some basic properties of observed land surface fluxes such as heteroscedasticity. We then report on the exploration of three replicated eddy covariance measurement towers in a Mediterranean ecosystem and explore how the identified interdependencies of carbon and energy fluxes and micro-meteorological observations vary over time. Further, we present the analysis of global satellite data of vegetation greenness to understand the lagged dependency of ecosystems with respect to climatic drivers. Based on these results, Sect.4 discusses the potentials and limitations of PCMCI for other applications in 30 land-atmosphere studies and give recommendations for further methodological developments. Monitoring an ecosystem with continuous observations of net ecosystem exchange (NEE), the underlying gross primary production (GPP) and ecosystem respiration (R textnormaleco ) together with the relevant drivers i.e. global radiation (Rg), surface air temperature (T), and soil moisture (SM), allows to study the dynamics of the carbon cycle in terrestrial ecosystems. To 5 foster its understanding, a fundamental question is how these variables causally depend on each other. This requires the identification of directional dependencies such as the well known effects of SM → GPP and GPP → R eco and their differentiation from physically implausible links such as R eco → GPP. Graphical causal models (Spirtes et al., 2001) provide a framework to represent and identify causal relations based on conditional independence relations in data streams of this kind. In the case of an ecological monitoring site as described here, we can exploit the temporal information of the observations for identifying 10 a time series graph as a type of graphical model (Runge, 2018a). Formally this can be stated as follows: The variables X i t comprise a multivariate stochastic process X (where i is the variable index, in the example i ∈ {Rg, T, GPP, R eco , SM}, and t is the time index). A time series graph G visualizes how the individual variables X i ∈ X depend on each other at specific time lags τ, i.e. X i t−τ with τ ∈ {1, .., τ max } (see Runge, 2018a, for definitions). In the following, we refer to a variable X i t−τ that is causally affecting a variable X j t as 'parent' or 'driver' and the latter as 'receiver' or 'target'. To come to a causal interpretation, 15 it is important to exclude dependencies between two variables that are due to common drivers ( For instance, when estimating the effects of GPP on R eco and Rg on R eco using a bivariate measure, one likely obtains implausible results like to strong or even unexpected links because T, respectively as the common driver and mediator (indirect path), is not accounted for. To exclude dependencies due to common drivers or indirect paths, conditional independence tests are used, denoted as CI(X i t−τ , X j t |S), with some conditioning set S. If any variable (or their combination) in 20 S explains the dependence between X i t−τ and X j t , then the CI statistic is zero. Two prominent methods that aim for directional dependencies are Granger causality and transfer entropy (Granger, 1969;Schreiber, 2000). Granger causality is typically estimated as a vector autoregressive model and thus captures only linear links.
Transfer entropy, based on information theory, captures also non linear dependencies. It can be shown that for multivariate Gaussians, transfer entropy is equivalent to Granger causality (Barnett et al., 2009). Both can be phrased as testing for con-25 ditional independence (Runge et al., 2019a). In their original bivariate form, neither of these two methods accounts for third variables. But both can also be extended to deal with multivariate time series as required here (Runge et al., 2012;Granger, 1969). There are even non-linear and spectral modifications of Granger causality which have been applied to study biosphere atmosphere interactions (Papagiannopoulou et al., 2017a;Detto et al., 2012;Claessen et al., 2019). However, the estimation of multivariate transfer entropy is challenging due to the "curse of dimensionality" (Runge et al., 2012) and also multivariate 30 Granger causality exhibits low link detection power for larger number of variables (higher dimensions) and limited sample size, as is the case in our application (Runge et al., 2019b). The strong decrease in detection power happens when using the whole past X − t = (X t−1 , X t−2 , ...) of X j t , truncated at a maximum lag τ max , as a conditioning set S. The problem is that this set can contain a high number of conditions which are irrelevant. For example, when assessing the effect of Rg at a specific time lag τ on GPP using multivariate Granger causality one would create a vector auto regressive model comprising all variables, i.e. Rg, T, SM, GPP and R eco at each available lag. But R eco , dominated by heterotrophic respiration, is not expected to affect gross primary productivity and could be removed to decrease the dimensionality. However, manually selecting conditions is not desirable when the underlying dependence structure is unknown which is why ideally the conditioning set is identified automatically. 5 PCMCI addresses this issue by reducing the set of conditions S prior to quantifying the dependence between two variables.
The two-step approach utilizes a variant of the PC algorithm (Spirtes and Glymour, 1991) and the momentary conditional independence measure (MCI) (Runge et al., 2019b). More detailed descriptions are given in Sect. 2.4 and 2.5, respectively (full description of PCMCI including proofs and quantitative comparisons with other methods are provided in Runge et al. (2019b)).
A schematic of the PCMCI approach is given in Fig. A1. PCMCI belongs to the family of causal graphical models (Spirtes 10 et al., 2001;Pearl, 2009), and follows the assumptions listed in Sect. 2.2. In the limit of infinite time series length, PCMCI converges to the true graph of dependencies, which is why we use the term "causal". As we deal with finite sample length and partially unfulfilled assumptions, spurious links can still appear (beyond the expected false positive rate) and therefore each detected link has to be interpreted with caution.

15
PCMCI assesses the causal structure of a multivariate dataset or process X by estimating its time series graph. To draw causal conclusions from observational data, any causal method must adopt a number of assumptions (Pearl, 2009;Spirtes et al., 2001). For the time series case, here we assume time order, the causal Markov condition, faithfulness, causal sufficiency, causal stationarity, and no contemporaneous causal effects (Runge et al., 2018). PCMCI is applied in combination with the ParCorr linear independence test based on partial correlations (cf. Sect. 2.3). This application additionally requires stationarity 20 in mean and variance and linear dependencies. In the following, we briefly discuss these assumptions (further details in Runge, 2018a;Runge et al., 2019b).
The time order within the time series allows to orient directed links which are only pointing forward in time. This accounts for causal information propagating forward in time only, i.e. the cause shall precede the effect. Therefore, a directed causal link X i t 1 → X j t 2 can only exist between two nodes X i t 1 , X j t 2 if t 1 < t 2 . When a contemporaneous link is found, i.e. t 1 = t 2 , it is 25 considered to be undirected. In ecological language it means that in order to claim that Rg is driving GPP any change in Rg that is affecting GPP must be measured at a time step before the change in GPP occurs. The Causal Markov and faithfulness assumptions relate the underlying physical causal mechanisms to statistical relationships manifest in the data. The Causal Markov condition states that if two processes are not directly connected by some physical mechanism, then they should be statistically independent conditional on their direct drivers, like Rg and R eco conditional on T . The faithfulness assumption 30 concerns the other direction: if two processes are statistically independent, then there cannot be a direct physical mechanism.
The causal sufficiency assumption implies that every common cause of two or more variables X i ∈ X is included in X. If this is not the case, detected links may be indirect or due to an unobserved common driver. However, the absence of a link in the detected graph still implies that no direct link is present (as this only requires the assumption of faithfulness). For example, Rg is expected to influence R eco via T, the indirect path. Though, a link between Rg and R eco might be detected if T is not included in the analysis. However, a missing link between T and R eco , might indicate conditions inhibiting respiratory processes, i.e. very cold temperatures with frozen surfaces or very dry conditions with dead vegetation coverage. Causal stationarity refers to the existence of links over time. In a deciduous forest, for example, the ecosystem's CO 2 exchange is not causally stationary as the link Rg → GPP is given in summer but not in winter. Formally, a process X with graph G is called causally stationary 5 over a time index T , if and only if for all links X i t−τ → X j t in the graph the condition X i t−τ X j t | X − t \{X i t−τ } holds for all t ∈ T .

Independence Test
At the core of PCMCI there are conditional independence tests CI(X i t−τ , X j t , S) to evaluate whether X i t−τ X j t | S given a conditioning set S. Within the PCMCI software package Tigramite (Runge, 2018b), several independence tests are implemented.
Here, we focus on the linear independence test called ParCorr. The ParCorr conditional independence test is based on partial 10 correlations and a t-test. This assumes the model with coefficients β and Gaussian noise . This leads to the residuals with estimatedβ. ParCorr removes the influence of S on X i and X j via ordinary least squares regression and tests for inde- 15 pendence of the residuals using the Pearson correlation with a t-test. The independence test returns a p-value and test statistic value I, i.e. the correlation coefficient in case of ParCorr. Thus, to identify the effect of GPP on R eco that does account for their common driver T ∈ S, ParCorr will perform a linear regression of T on both GPP and R eco accounting for time lags. The p-value of the residuals' partial correlation test can be used to assess whether the two variables are dependent.

PC algorithm 20
To efficiently estimate CI(X i t−τ , X j t |S) the conditioning set S should be as small as possible which means that it should only contain relevant conditions, which allow to isolate the unique influence of X i t−τ on X j t . For an estimation of CI(Rg t−τ , GPP t |S), for example, S should contain T and SM (at certain lags), as they influence the ability of an ecosystem to perform photosynthesis. Likewise, when estimating CI(T t−τ , GPP t |S), S should include Rg and SM for the same reasons. A sufficient set of relevant conditions includes the drivers/parents of the variable X j t . Consequently, the aim of the PC step is to identify an as small as pos- 25 sible superset of the parents of each variable included in the process. The algorithm uses a variant of the PC algorithm (Spirtes et al., 2001); a comprehensive pseudo-code of this procedure is given in the supplementary materials of Runge et al. (2019b).
In the limit of infinite sample size the relevant conditions indeed converge to the true causal parents, practically though, an estimate that contains a few irrelevant conditions, like R eco , is sufficient as well.
The PC step starts by initializing the whole past of a process: P(X j t ) = X − t = {X i t−τ : i = 1, ..., N, τ = 1, ..., τ max }. Next, by 30 evaluating CI(X i t−τ , X j t , S), conditions X i t−τ are removed from P(X j t ) that are independent of X j t conditionally on a subset S ∈ P(X j t )\ X i t−τ . S starts as the empty set ∅ and is iteratively increased. For instance, let X i t−τ be R eco (at a specific lag) and X j t be GPP. The conditional independence between GPP and R eco will be estimated first by using no conditions. If GPP and R eco appear related, one variable will be included in the conditioning set. If the residuals are still dependent, a second variable is included and so on. When T is part of the conditioning set, the residuals of GPP and R eco might not be dependent anymore and R eco is removed from the estimated set of parents of GPP. The PC algorithm adopted in PCMCI efficiently selects those 5 conditioning sets to limit the number of tests conducted.
Every conditional independence test is evaluated at a significance threshold α pc , which is usually set to a liberal value between 0.1 and 0.4. Alternatively, in tigramite one can let α pc unspecified. PCMCI then evaluates the best choice of α pc ∈ {0.1, 0.2, 0.3, 0.4} based on the Akaike information criterion which is further explained in (Runge et al., 2018).

MCI tests
10 MCI is the actual causal discovery step that ascribes a p-value and strength to each possible link. MCI iterates through all pairs (X i t−τ , X j t ) : i = 1, ..., N, τ = 0, ..., τ max and calculates CI(X i t−τ , X j t , S) where S consists of the two (super-)sets of parents P(X j t ) and P(X i t−τ ) obtained in the PC step. P(X i t−τ ) is constructed by shifting the time series of P(X i t ) by τ. In case X i t−τ ∈ P(X j t ), X i t−τ has to be removed from P(X j t ). If τ = 0, conditional dependence is estimated for contemporaneous nodes X j t and X i t . Due to missing time order, a dependence would be left undirected. Further, as the parents P(X i t ) and P(X j t ) used in each conditional dependence 15 test are defined to lie in the past of X i t and X j t , links, both contemporaneous and lagged, can be spurious due to contemporaneous common drivers or contemporaneous indirect paths. The absence of a link, though, means that a physical (contemporaneous) link is unlikely (assuming faithfulness, cf. Runge et al. (2018)). For simplicity, the previously given examples were omiting the time lag. Thus if R eco responds instantaneously (considering the sampling temporal resolution) to changes in T but T responds with a time lag to Rg, both variables will likely appear contemporaneously coupled to Rg. 20 The link strength in the PCMCI framework can be given by the effect size of the conditional independence test statistic measure CI used in combination with MCI. In case of ParCorr, the effect size is given by the partial correlation value, which is between -1 and 1. Assuming a linear Gaussian model the partial correlation value is shown to directly depend on the receiver's and driver's variance as well as the coupling coefficient (Runge et al., 2019a): are the variances of the noise/innovation terms driving X i t−τ and X j t , respectively, and c is their coupling coefficient. In practice, also non-linear links can often be well detected with ParCorr in so far they be linearly approximated. In case the linear part is even "stronger" than the non-linear part, ParCorr might also have a better detection power than a non-linear independence test (Runge et al., 2018).

Artificial Dataset -Test Model
We tested the algorithm on artificial datasets prior to its application to real world data. The artificial dataset was created using a test model which takes time series of measured global radiation (Rg) and creates three artificial time series that conceptually represent temperature (T ), gross primary production (GPP) and ecosystem respiration (Reco). Note that this test model is not 5 intended to accurately represent observed land-atmosphere fluxes, but only serves to test the procedure. The model incorporates one linear auto dependence T t−τ 1 → T t , one linear additive cross-dependence Rg t−τ 2 → T t and two non-linear dependencies, Fig. 1) according to the equations: The parameters c 1 , c 2 , ..., c 5 are referred to as coupling coefficients, and the time lags are noted as τ 1 , τ 2 ,..., τ 5 . The subscripts mo and obs abbreviate model and observation, respectively. T re f is set to 15 • C. The term ξ, termed "intrinsic" or "dynamical 15 noise", here represents values from uncorrelated, normally distributed noise. Having dynamic noise is essential for a method utilizing conditional independence tests. It is based on the assumption that a process or state is never fully described by its deterministic part because there are unresolved intrinsic processes, summarized as ξ.
The model was fitted to real observational data (using radiation, temperature and land-atmosphere fluxes) of daily time resolution, measured by the eddy-covariance method (Baldocchi et al., 1988;Baldocchi, 2003) from FLUXNET, by minimizing 20 the sum of squared residuals using the gradient descent implemented in the Optim.jl package (Mogensen and Riseth, 2018).
We fitted the model to 72 sites listed in Table B1 given  From each of the 72 sets of parameters we generated four sets of time series each having a length of 500 years. The time series generation was initiated using two types of data: first, uncorrelated, normally distributed noise, and second, unprocessed radiation data as used during the fitting (the available radiation data was repeated to 500 years). The resulting datasets are called baseline dataset and seasonality dataset, respectively. In both cases, the model was run twice, once with homoscedastic 30 (constant variance of ξ), once with heteroscedastic dynamical noise ξ. To induce heteroscedasticity, ξ was multiplied with a mean daily variance that was extracted for each variable at each FLUXNET site. In Supplementary Material Fig. F1 a five year time series excerpt from Hainich site (Knohl et al., 2003b) is shown. A third dataset is generated by anomalization (subtraction of smoothed seasonal mean) of the seasonality dataset.

Eddy Covariance Data -Majadas de Tiètar Experimental site
Data from three towers located in We expect the causal imprints in the data to vary between seasons and during the course of the day. To satisfy causal stationarity, we estimate networks separately for each month and consider only samples for which the potential radiation was above 4 5 of the potential daily maximum, which corresponds to midday samples. We used a mask type that limits only the receiver variable to the respective month and day time values (cf. Table A1 for PCMCI parameter settings). This setting causes 15 time series lengths ranging from 239 datapoints in December to 372 datapoints in July. Minimal and maximal lags were set to 0 and 8, respectively. We left the data unprocessed, i.e. we did not subtract a seasonal mean for anomalisation. Constraining the samples to separate month and midday values reduces the effect of seasonality as a common driver that would lead to spurious links. Furthermore, to correct for multiple testing we applied the Benjamini-Hochberg false discovery rate correction (Benjamini and Hochberg, 1995). Thereby, the p-values for the whole graph obtained from the MCI step are adjusted to control 20 the number of false discoveries (Runge et al., 2018). We chose a two-sided significance level of 0.01.

Gridded global data set
The second observational case study was performed on a global data set. We used data with 0.5 • spatial and monthly temporal resolution from 1982 to 2008. The dataset is composed of three climatic variables, global radiation (Rg), temperature (T) and precipitation (P), and one vegetation state index, the Normalized Difference Vegetation Index (NDVI). Both temperature and 25 precipitation datasets were taken from the Climate Research Unit (CRU), version TS3. 10 (Harris et al., 2014). The radiation data stems from the Climate Research Unit and National Centers for Environmental Prediction dataset (CRUNCEP, Viovy, 2016). The used NDVI data stems from the Global Inventory Monitoring and Modeling Systems (GIMMS) in version 3g_v1 (Pinzon and Tucker, 2014).
To examine the influence of radiation, temperature, and precipitation on NDVI by means of PCMCI we used the following 30 settings. We compute the anomalies by subtracting a smoothed seasonal mean. A maximal time lag of three months was chosen based on the largest lag with significant partial correlation among all pairs of variables, partialling out only the autocorrelation of each variable. The receiver variable was limited to the growing season defined by T>0 and NDVI>0.2, which allows good comparison to Wu et al. (2015). The significance level (α pc ) in the condition selection phase (cf. Sect. 2.4) was chosen based on the AIC selection criterion. A concise list of PCMCI parameters that were altered from default settings is given in  3 Results

Test Model
As motivating example, in Fig. 1 we show PCMCI and lagged correlation networks in the form of process graphs. It is clearly visible, that many more spurious links pass the significance threshold of 0.01 using lagged correlation as compared to using PCMCI. Those spurious links can complicate the analysis or lead to false conclusions and misleading hypotheses. We examined  Additionally, the distributions are split to show the impact of heteroscedastic noise (orange) compared to normal distributed noise (blue).
The significance level of 0.01 is given by a blue horizontal line.
ticity or seasonality is present. The effect on the FPR is an increase above 0.01 for time series length of 1 and 5 years with a much stronger increase due to seasonality (factor of 4 and 8, respectively) than for heteroscedasticity (factor of 2).
The effect of non-stationarities on the TPR differs among the links. The detection of linear links (Rg → T and T → T ) is not affected by seasonality and slightly improves for heteroscedastic dynamical noise. The detection of non-linear links is improved by seasonality with the strongest effects in the link T → GPP. The link T → Reco has a stronger non-linearity and 5 therefore the detection rate shows a weaker effect on seasonality. Furthermore, the coupling coefficient c 5 , the base of T, can be close to or be exactly one (cf. Fig. C1). This would actually cause on the one hand the effect of T → Reco to vanish, rendering a detection impossible and on the other hand result in a linear dependence of GPP on Reco which improves its detection.
Heteroscedasticity seems to have a slight negative effect on non-linear links. In general, the TPRs in the seasonality dataset are quite high, even for non-linear links, and predominantly above 80% and often reaching 100%. 10 Comparing the TPRs of the non-linear links shows some disparity. The links T → GPP and T → Reco experience zero detection in the baseline dataset but partially considerable rates in the seasonality dataset with a strong dependence on the time series length. On the contrary, the median of the TPR of the links Rg → GPP and GPP → Reco is above 95% in the seasonality dataset, even for time series length as short as 91 days, but remains high in the baseline dataset. The removal of the seasonal cycle keeps the TPRs largely unaffected, but reduces the FPR. Nevertheless, it still remains above the significance 15 level by a factor of four and two for five and one year time series length, respectively.   Additionally the distributions are split to show the impact of heteroscedastic noise (orange) compared to normally distributed noise (blue).
In summary, the seasonality dataset exhibits high TPR even for non-linear links. Compared to stationary time series, the detection of non-linear links actually benefits from seasonality. The high detection, though, comes at the cost of a high false positive rate for time series length of and above one year. To a certain degree, the increase in FPR can be counteracted by anomalization.

5
At first we look at the link consistency by comparing networks that were obtained for each tower within a month. The comparison is done for two months with strongly differing climate conditions: April and August. In Fig. 4   nantly at zero lag. Approximately half of the links with lag one are auto-dependent links (a link from the past of a variable to its present). Comparing the links between the month April and August, distinct differences can be noticed. First, August has slightly fewer significant links compared to April. Second, the only links remaining that are significant in two or three towers are between atmospheric variables. Third, the remaining link strengths tend to be weaker in August than in April.
The difference among the seasons is further investigated in Fig. 5 which shows process graphs for each month of the year.

5
We combine the networks of the three towers to one process graph by plotting every link that is significant in at least one tower.
The process graphs in Fig. 5 visualize clearly gradual changes within the interaction structure of the biosphere-atmosphere system during the course of a year. During the main growing season from February to May, NEE is coupled strongly to the energy fluxes latent (LE) and sensible heat (H). These connections weaken, disappear or even switch sign with start and course of the dry season. Less regularly NEE also shows connections to radiation (Rg) and temperature (T). Between the atmospheric 10 variables, a basic network between VPD, T, Rg and H remains intact and relatively constant in strength. The dominance of contemporaneous links is found as well, as seen already in Fig. 4. Besides the decoupling of NEE from any variable in the dry period, there are additional interesting patterns. For example, the positive reappearance of the link between NEE and LE in September. Here, the onset of precipitation events (cf. Fig. J1) occurred that lead to strong respiration peaks (Ma et al., 2012).
Creating such a network via lagged correlation would result in much more significant links (causing the network to be not 15 interpretable as opposed to PCMCI) and NEE does not decouple from the atmosphere in August (cf. Fig. G1).
The above results demonstrate that PCMCI is sensitive enough to capture seasonal differences and certain physiological reasonable biosphere behaviour. Moreover, PCMCI yields a better interpretable network structure than pure correlation approaches.

Gridded global data set 20
Subject of inspection were the significant lags and MCI values of each climatic variable on NDVI. In Fig. 6 the maximal MCI value and the corresponding lag are plotted for the links X t−τ → NDVI : τ ∈ {0, 1, 2, 3} with X being one of the climatic drivers radiation, temperature, or precipitation. The chosen significance threshold was set to 0.05. Fig. 7 shows the climatic driver with largest MCI per grid point. PCMCI detects a regionally varying influence of climatic drivers. As expected, the boreal regions are strongly driven by temperature instantaneously, while (semi-) arid regions, which correspond predominantly to 25 grass or prairie dominated areas, respond strongest to precipitation at a time lag of one month. Radiation is found to have a comparatively low spatial effect with hot spots in south and east China, central Russia and east Canada.
The dominant lags are found to be zero and one. Just a very small fraction of the total area shows a maximal MCI value at a higher lag of two or three months. The lags are also not equally distributed among the climatic drivers. Radiation and temperature are predominantly strongest at lag zero, while precipitation has a much larger fraction of area showing the strongest 30 response at lag one. Regions where the impact of Rg on NDVI is strongest at lag 1 tend to respond negatively to Rg but positively to precipitation at lag one. On the other hand, a large part of regions with the strongest impact of precipitation at lag zero respond negatively to it but positively to radiation.   In summary, PCMCI estimates coherent interaction patterns which match well with anticipated behaviour based on vegetation type and prevailing climatic conditions.

Discussion
Causal discovery methods promise an improved understanding and can help to come up with new hypotheses about the interaction between biosphere and atmosphere (Christiansen and Peters, 2018;Runge et al., 2019a). But the underlying assumptions need to be properly taken into account. The coupled biosphere-atmosphere system possesses several challenges that potentially violate the underlying assumptions of causal discovery in general and the employed method's assumptions in particular. Here, 5 we investigate the effect of a violation of assumptions on PCMCI network estimates.
With regard to expected non-linearities in biosphere-atmosphere interactions, using a linear independence test within the PCMCI framework may not be adequate. We motivate our choice with the following arguments: first, non-linearities are often approximated linearly. Second, a linear regression based test has a much higher power for detecting linear links than a non parametric test (Runge, 2018a) and can, hence, detect links already at smaller sample sizes. Third, linear partial correlation 10 is easily interpretable, for example, positive and negative MCI values. This motivation is supported on the one hand by the results of the test model and on the other hand by additionally performed analyses on the observational datasets using Gaussian regression and distance correlation as an independence test. These results (cf. Fig. H1,I1,K1) show similar patterns but due to the low sample sizes exhibit lower statistical significances. In general, the derived results show a high detection power with a strong consistency in calculated effect strengths on eddy covariance data and global, regularly gridded reanalysis data that 15 leads to well interpretable patterns. Observed drawbacks are a high FPR in case of violated assumptions, especially strong periodicity, as well as the appearance of contemporaneous lags in measurement datasets.

Lessons learned from the test model
The probability to detect a link with PCMCI depends strongly on a link's MCI effect size, which is larger for strong variance in the driver and a low variance in the receiver (cf. Sect. 2.5). Several results can be explained by this observation. First, 20 the variance of three out of five drivers of cross dependencies in the test model are either directly or indirectly (via GPP) influenced by Rg, which has the highest variance of all variables. Consequently, the detection power of the three links is large, almost 100%. In comparison, the other variables' variances are weaker, since they are influenced by T , which results in a lower detection power. This is the origin of the disparity in detection rates of the non-linear links. Second, also the partially strong increase in TPR of non-linear links (influenced in a multiplicative way by T ) from the baseline dataset to the seasonality 25 dataset can be explained by this increase in variance. A multiplicative link is actually not generally expected to be found by ParCorr (Runge, 2018a), but the value of the multiplicative factor is dominated by the seasonal value, and not the dynamical noise, which might cause rather a scaling of the dynamical noise terms rather than a random distortion. Third, the dependence on the variance ratio can also explain the difference in TPR between homoscedastic (equal error variance) and heteroscedastic (error variance changing over time) time series, i.e., the variance of Rg and GPP exhibits a strong seasonality with its peak 30 in summer, while the variance of T is rather constant. This explains, for example, the strong decrease in TPR for the link T → GPP at 91 days time series length when comparing homoscedasticity to heteroscedasticity. The decrease in TPR is less pronounced when another season, implying a different variance, is chosen for this comparison. As links with weak driver variance and strong response variance are more likely to be missed, one may ask which effect this will have on the detection of feedback loops where one variable has low and the other high variance. Here lies a limitation of the test model where no feedback loops were implemented.

Seasonality and heteroscedasticity constitute violations of the stationarity assumption underlying the independence test
ParCorr. Seasonality constitutes a common driver in this model. In general, such common drivers increase the dependence 5 among the variables and hence, lead to a higher detection rate for true links (TPR) as well as a higher false positive rate (FPR) for absent links if this driver is not conditioned out properly. This additionally causes the TPR and the FPR rate to increase in the seasonality model. As shown in Runge (2018a), including the cause of the non-stationarity as an exogenous driver in the analysis allows PCMCI to regress out its influence on the other variables. However, for ParCorr this is only valid if the dependence on the non-stationary driver is linear. Therefore, the regression on Rg fails for GPP and Reco in the test model. 10 With this ill-posed setting, the probability to detect false links increases with increasing time series length or when more periods are included. Stationarity in mean is obviously also not fully guaranteed when subtracting the seasonal mean. Here we observe that the FPR stays above the significance level for the anomalised seasonality dataset. One can ask whether the FPR stays above the significance threshold because subtracting the seasonal mean does not remove the heteroscedasticity. However, we attribute this high FPR to a not fully removed seasonality since the FPR of both homoscedastic and heteroscedastic time series

Causal interpretation of estimated networks from observational data
In both the half-hourly time resolved eddy covariance data and the monthly global dataset the predominant type of dependence found is contemporaneous. PCMCI leaves these undirected since no time order indicating the flow of causal information is available. Further, as discussed in Sect. 2.2, contemporaneous common drivers or mediators are not accounted for. The consequence is that both spurious contemporaneous and spurious lagged links can appear, if they are due to contemporaneous 5 variables. For interactions that are contemporaneous in nature since they occur on considerably shorter time scales than the time resolution, therefore, PCMCI is not the optimal choice regarding a causal interpretation and other methods should be applied in conjunction (Runge et al., 2019a). Further, we faced a trade off between fulfilling causal assumptions and detection power.
In practice, accounting for causal stationarity (by limiting the analysis to certain periods of the dataset) means decreasing the number of available data points while accounting for causal sufficiency leads to an increase in dimensionality by adding 10 variables and increasing the maximal lag. Both will lead to a decrease in detection power, which can affect the network structure. PCMCI alleviates the curse of dimensionality by applying a condition selection step, but still one cannot indefinitely add more variables. Another important factor that affects detection power and dimensionality is the time resolution. There are several points in favour and against increasing time resolution. On the one hand, increasing time resolution can resolve contemporaneous links and potentially increases the detection power due to an increased number of datapoints. On the other 15 hand, the dimensionality increases if the maximal lag is adapted. Further, causal information might be split apart and distributed over more lags, rendering the links at each individual lag less detectable. This can cause links to disappear, but links can also appear if new processes are resolved at a higher time scale. At last, observational noise (measurement errors) might be larger in higher resolution data than in aggregated data, as it is averaged out in the latter and thus affects link detection less. Consequently, when comparing network structures based on different settings, i.e. maximal lag, included variables, time resolution, and 20 considered time period, the (dis-)appearance of single links among specific variables can stem from several factors, i.e. a change in detection power, a changed (conditional) dependency, or due to a common driver. These factors together with a non-zero false positive detection rate are challenging for a causal interpretation. Therefore, detected links should be interpreted with care and can give rise to new hypotheses and analyses involving further variables. Generally, a causal interpretation is more robust regarding the absence of links (cf. Sect.2.2). In particular it does not require that all common drivers are observed. 25 Nevertheless, robust patterns were identified in our studies that are also consistent with other studies. Furthermore, a causal analysis has the advantage of an enhanced interpretability compared to correlative approaches. First of all, we could show that the networks' estimated link strengths are consistent for observational data, even though measurement error affects the data. The dataset used was suitable for this analysis, as the measurement stations are located in a reasonably homogeneous ecosystem that shows only little spatial variation (El-Madany et al., 2018). Thus, also the interaction between biosphere and 30 atmosphere is expected to change only marginally across space within this ecosystem. Second, the gradual changes in plant activity that are taking place in the ecosystem of Majadas throughout the year do very well emerge in the coupling strength of daytime NEE to the atmospheric variables. The observed decoupling during the dry season is in accordance with the one of a soybean field during drought conditions observed by Ruddell and Kumar (2009). The gradual changes in ecosystem activity are not visible in a pure (lagged) correlation analysis or are only visible in color or density changes but the large number of significant links prevents any detailed interpretation on the physical mechanisms and changes thereof. The large number of significant links compared to the PCMCI networks stems solely from the absence of conditioning on common drivers or mediating variables, which often further leads to an overestimation of the link strength in correlation networks. As a result, processes, such as the decoupling of NEE during the dry period, stay hidden. To reduce the effect of confounding, often analyses 5 utilize partial correlation (see e.g. Buermann et al., 2018). However, a partial correlation can introduce new dependencies (as opposed to removing them) if one conditions on causal effects of the variables under consideration (the 'marrying parents' effect). This issue is avoided in PCMCI by only conditioning on past variables. Additionally, PCMCI chooses only relevant variables as conditions by applying the PC condition selection step which is especially valuable in high dimensional study cases and improves detection power and computation time (Runge, 2018a). 10 The global study of climatic drivers of vegetation shows a general pattern of lags and dependence strengths of vegetation on climatic variables that is easily-interpretable. The boreal regions appear energy limited and especially driven by temperature (cf . Fig 6c), while the strongest dependence of (semi-)arid regions on precipitation reflects their limitation in water supply. water logging or too high temperatures. To explain the differences though, we could identify two possible reasons. First, Papagiannopoulou et al. (2017b) masked out negative influences of radiation arguing that radiation is not negatively affecting NDVI. They found, that negative influences of Rg are usually a consequence of poor conditioning on other variables. Second, a precipitation event in boreal regions coincides with a reduction in radiation and temperature. Boreal regions usually do not 35 suffer from water shortages. Thus they respond stronger to the reduction of radiation and temperature than precipitation. As precipitation is coupled negatively to radiation and temperature at lag zero, the effect of precipitation on NDVI is found to be negative. Thus, the link P − − →NDVI might be an effect of the contemporaneous common driver scheme P − ← − Rg + − → NDVI and therefore would not be causal. In fact, a similar argumentation can be given for the negative impact of temperature and radiation on NDVI in arid regions.

5
In summary, we pointed out the need for careful interpretations in applying causal discovery methods and especially highlighted the challenges linked to the study of biosphere-atmosphere interaction via PCMCI. We demonstrated that the network structures estimated from observational data are explainable with respect to plant physiology and climatic effects. Finally, our study shows that causal methods can deliver better interpretability and a much improved process understanding in comparison to correlation and bivariate Granger causality analyses that are ambiguous to interpret since they do not account for common 10 drivers.

Outlook
The preceding discussion has shed light on the merits of PCMCI as well as the challenges of applying causal discovery methods. Runge et al. (2019a) discuss further challenges and methods and give an outlook how multiple methods can be combined to alleviate limitations.

Conclusions
Here we tested PCMCI, an algorithm that estimates causal graphs from empirical time-series. We specifically explored two types of data sets that are highly relevant in biogeosciences: eddy covariance measurements of land-atmosphere fluxes and global satellite remote sensing of vegetation greenness. The causal graphs estimated from the eddy covariance data collected in a Mediterranean site confirm patterns we would expect in these ecosystems: During the dry season's plants senescence, 5 for instance, the ecosystem's carbon cycle (NEE) decouples from meteorological variability. On the contrary during the main growing season with warm and humid conditions strong links between NEE, LE and H characterise the graph. Not only the strongly contrasting states emerge in the graph structure using the causal framework, but also the gradual transitions that relate to minor changes like the connectivity of sensible heat to temperature with progressing dryness. A purely correlative analysis, instead, is not able to resolve these patterns. PCMCI allows us to identify and focus on much fewer, but highly relevant 10 dependencies only. Applying the approach to three replicated eddy covariance systems shows the robustness of the method to random errors in the fluxes measurements and confirm one of the assumption of eddy covariance: above a relatively homogeneous terrain the fluxes measured should be spatially invariant, and so the underlying causal relationship between climate and fluxes. The global analysis of NDVI in relation to climatic drivers confirms the known patterns of dependence strengths of vegetation on climatic variables: boreal regions are energy limited and especially driven by temperature and secondarily radiation, 15 while in semi-arid regions vegetation dynamics are strongly dependent on water supply. However, obtained response times of vegetation to climatic variations are lower using PCMCI than correlation which can be attributed to a better treatment of the autocorrelation in the time-series and cross-relations among climate variables. Compared to merely correlative approaches, this leads to a interpretable pattern of driver-response relationships. In short, the new developments achieved in causal inference allow to gain well constrained insights on processes, that would otherwise be drowning in the correlation chaos. Therefore 20 we hope that this study fosters usage of causal inference in analysing interactions and feedbacks of the biosphere-atmosphere system and furthermore exhibits our demand of further developments.

Process Graph Time series Graph
Aim: To study process X via graphical models, we need to estimate conditional independence between all variables: Problem: Conditioning on the whole past X − t leads to high dimensional estimations.
PCMCI approach PC-step 1. Start with a fully connected network ...
Remove nodes from P(X j t ) iteratively by keeping nodes that are conditional dependent to X j t at a significance level set by the parameter α pc (typically 0.2 -0.4).
Ideally the true parents are still in the set of parents (colored arrows). But likely some spurious parents as well (grey arrows).

MCI-step
1. Perform the conditional independence test for each pair of nodes using their estimated set of parents from the PC-step as conditioning set.
For example a link between X 2 t−1 und X 3 t exists iff:      April August Figure H1. Same as Fig. 4 of the manuscript but the analysis was performed using a non-linear independence test. Comparison of the networks of three eddy covariance measurement stations (LMa, LM1, LM2) located in Majadas (Spain). Links that are found to be significant in one of the three networks are included. For each link, the calculated strength of all three networks is plotted with its 90% confidence interval. The colors blue, orange, and green correspond to the towers LMa, LM1, and LM2, respectively. The significance threshold is 0.01.
If a link does not pass the significance, it is marked by a black dot. The links are grouped into lag 0 (top), lag 1 (middle) and all lags greater than 1 (bottom). Links at lag 0 are left undirected (−), yet as Rg is set as main driver, links incorporating Rg at lag 0 are directed (→). Note that GPDC only yields positive link strengths. Further, the strength values estimated with GPDC are rather weak due to the low number of datapoints and the larger sensitivity of that method to the sample size  Author contributions. CK and MDM designed the study with contributions from JR and DGM. CK conducted the analysis and wrote the manuscript. All authors helped to impove the manuscript. AC, MM, TEM, OPP conducted field experiments in Majadas, processed and provided its data as well helped with interpretation.
Competing interests. The authors declare that they have no competing interests.
Acknowledgements. We thank Maha Shadaydeh, Rune Christiansen, Jonas Peters, Milan Flach and Markus Reichstein for useful comments.