Explainable machine learning for modeling of net ecosystem exchange in boreal forests

Ezhova, Ekaterina; Laanti, Topi; Lintunen, Anna; Kolari, Pasi; Nieminen, Tuomo; Mammarella, Ivan; Heljanko, Keijo; Kulmala, Markku

doi:https://doi.org/10.5194/bg-22-257-2025

Articles | Volume 22, issue 1

https://doi.org/10.5194/bg-22-257-2025

Articles | Volume 22, issue 1

Research article

13 Jan 2025

Research article |

| 13 Jan 2025

Explainable machine learning for modeling of net ecosystem exchange in boreal forests

Ekaterina Ezhova, Topi Laanti, Anna Lintunen, Pasi Kolari, Tuomo Nieminen, Ivan Mammarella, Keijo Heljanko, and Markku Kulmala

Abstract

There is a growing interest in applying machine learning methods to predict net ecosystem exchange (NEE) based on site information and climatic variables. We apply four machine learning models (cubist, random forest, averaged neural networks, and linear regression) to predict the NEE of boreal forest ecosystems based on climatic and site variables. We use data sets from two stations in the Finnish boreal forest (southern site Hyytiälä and northern site Värriö) and model NEE during the peak growing season and the whole year. For Hyytiälä, all nonlinear models demonstrated similar results with R² = 0.88 for the peak growing season and R² = 0.90 for the whole year. For Värriö, nonlinear models gave R² = 0.73–0.76 for the peak growing season, whereas random forest and cubist with R² = 0.74 were somewhat better than averaged neural networks with R² = 0.70 for the whole year. Using explainable artificial intelligence methods, we show that the most important input variables during the peak season are photosynthetically active radiation, diffuse radiation, and vapor pressure deficit (or air temperature), whereas, on the whole-year scale, vapor pressure deficit (or air temperature) is replaced by soil temperature. When the data sets from both stations were mixed, soil water content, the only variable clearly different between Hyytiälä and Värriö data sets, emerged as one of the most important variables, but its importance diminished when input variables labeling sites were added. In addition, we analyze the dependencies of NEE on input variables against the existing theoretical understanding of NEE drivers. We show that even though the statistical scores of some models can be very good, the results should be treated with caution, especially when applied to upscaling. In the model setup with several interdependent variables ubiquitous in atmospheric measurements, some models display strong opposite dependencies on these variables. This behavior might have adverse consequences if models are applied to the data sets in future climate conditions. Our results highlight the importance of explainable artificial intelligence methods for interpreting outcomes from machine learning models, particularly when a set containing interdependent variables is used as a model input.

Download & links

How to cite.

Received: 01 Nov 2023 – Discussion started: 06 Dec 2023 – Revised: 29 Aug 2024 – Accepted: 29 Oct 2024 – Published: 13 Jan 2025

1 Introduction

Forests play an important role in the global carbon cycle because they remove carbon from the atmosphere through photosynthesis and store it in the wood biomass and forest soil. Recent studies suggest that in the past several decades, the net carbon uptake of the boreal forest has been increasing and that of the tropical forest has been decreasing, making the boreal forest the largest terrestrial carbon sink on the planet (Tagesson et al., 2020). The dynamics of the forest carbon cycle and its interaction with various climatic drivers are generally well understood; however, the complex responses of forests to climate change and their potential to mitigate its impacts keep boreal forests at the forefront of multidisciplinary research. This ongoing interest spans from observational studies to global modeling efforts (Artaxo et al., 2022; Petäjä et al., 2022; Kulmala et al., 2020, 2023; Tang et al., 2023). There is a growing need for more accurate models of carbon fluxes, providing reliable results in warming climate conditions (Kämäräinen et al., 2023). Hence, suitable models must correctly capture current carbon cycle dynamics using commonly measured ecosystem-level data and give reasonable predictions for, e.g., future higher temperatures. In other words, the models' performance should be adequate in the range of values currently underrepresented in the data sets.

In addition to traditional process-based models (Launiainen et al., 2022; Junttila et al., 2023), the use of machine learning (ML) models has become ubiquitous. ML models play an important role in providing an alternative for the hypothetic deductive modeling approach, i.e., an inductive approach. This means no prior assumptions are made about the data, which are modeled with a purely empirical model with a general function class. Currently, there are plenty of carbon flux data available from the FLUXNET database, as well as extensive meteorological reanalysis data sets or measurements of many different variables directly from research stations. Data availability has boosted the application of data-intensive ML methods to carbon flux modeling (Dou and Yang, 2018; Zeng et al., 2020).

Using ML, the functional relationship between carbon flux (net ecosystem exchange, NEE; gross primary production, GPP; or respiration) and the site and climatic variables, including radiation and meteorological and biospheric input parameters, can be obtained. There exists plenty of literature featuring the ML approach to quantify different components of the carbon cycle using site and climatic variables as input (Dou and Yang, 2018). In many studies (Cai et al., 2020; Wood, 2021; Zhu et al., 2023; Zeng et al., 2020), researchers identify “the best model”, which reproduces the carbon fluxes depending on the available set of input parameters better than other models. Statistical accuracy metrics are typically used as a criterion for model assessment. Many different ML models have been tested, but random forest has appeared particularly popular (Liu et al., 2021; Reitz et al., 2021).

However, these empirical machine learning models are often a “black box” in the sense that the parameters used by models to make the predictions can not be directly extracted from the model to provide a human-understandable way to interpret them easily. The results, therefore, should be treated cautiously. Recently, Shirley et al. (2023) demonstrated with an example from Alaska that the boosted regression tree ML model gave inaccurate results in current and future carbon balance estimates at high latitudes. Increasing the data set by adding more stations from the same area improved the result for the current carbon sink. Still, future estimates were unreliable, ascribed to the fact that the data sets representing future conditions could not be used for model training.

In response to this need, various methods that attempt to make ML models more open and interpretable have emerged. They are called explainable artificial intelligence (XAI) methods (Dwivedi et al., 2023). With XAI techniques, researchers can explore and analyze the factors that influence the model outcomes, making it easier to interpret the results and enhance the utility of ML approaches, e.g., in the context of carbon cycle research.

In the present study, we model boreal forest NEE with subhourly time resolution, using an extensive set of input variables from two research stations at different latitudes: Hyytiälä at 61°51^′ N and Värriö at 67°46^′ N. Using the same time resolution, we use different data sets considering separately the peak growing season (defined as the period of maximum photosynthetic activity of an ecosystem) and the whole year. The Hyytiälä data set is divided into pre- and post-thinning data periods because the thinning of a forest (i.e., cutting down the share of trees) significantly impacts not only the NEE but also many site variables.

We expect an ML model to learn differently depending on the seasonality of the time series used for model training. For example, diffuse radiation is an essential input variable for photosynthesis on a subhourly scale during the peak growing season because ecosystem photosynthesis is enhanced under higher-diffuse-radiation conditions due to better light use efficiency (Gu et al., 2002; Ezhova et al., 2018). In winter, this effect is missing, which might make diffuse radiation not as crucial a variable for the model trained on the whole-year data set. Instead, other input variables, such as air or soil temperature, can be relevant when focusing on the seasonal cycle of carbon fluxes (Kolari et al., 2009). Moreover, besides time-related factors, a spatial factor represented by latitude is also expected to affect the model buildup. The first aim of this study is to analyze how ML models treat input variables related to temporal (peak season vs. whole year) and spatial variability.

The second aim is to use different ML models to understand how the best model's outcome compares to what we know from process understanding of the carbon fluxes' dynamics. In addition to that, we compare different ML models and check if all of them reproduce CO₂ flux dynamics robustly, if they tend to choose the same important input variables, and if dependences on these variables are similar between the models.

Finally, we combine data sets from the two latitudes, include data from a post-thinning period in Hyytiälä, and use XAI to understand how the models perform on this mixed data set. We introduce additional variables (the site variables) distinguishing between the sites and model NEE with and without these variables.

In this study, we have several research goals:

compare the ML models' performance for two ecosystems from different latitudes but with the same main tree species using accuracy metrics and XAI (with a linear regression model as a baseline); assess the reliability of results based on the robustness of their reproduction by different models;
analyze the shift in the choice of model variables and their general performance depending on the seasonality (i.e., peak growing season or the whole year) and latitude;
study how combining the data sets from the two studied forest ecosystems at different latitudes and including post-thinning data affects model results.

2 Materials and methods

2.1 Stations and data sets

We used atmospheric observations from the SMEAR II station in Hyytiälä, Finland (Hari and Kulmala, 2005), and the SMEAR I station in Värriö, Finland (Hari et al., 1994). The stations are located in boreal forests in central Finland (Hyytiälä: 61°51^′ N, 24°17^′ E; 80 $m a . s . l .$ ) and in the Finnish subarctic region (Värriö: 67°46^′N, 29°36^′E; 180 $m a . s . l .$ ). The mean annual air temperature is 3.5 °C in Hyytiälä and −0.5 °C in Värriö (source: ICOS database). The mean annual precipitation in Hyytiälä is 710 mm, and in Värriö it is 601 mm. Forest stands at both sites are dominated by 60–65-year-old Scots pines (Pinus sylvestris L.). However, the average tree height differs, being ca. 19.9 m at SMEAR II and 10 m at SMEAR I, as measured in 2023. The forest canopy at SMEAR II is closed, and at SMEAR I it is open. Both sites are part of the Integrated Carbon Observation System (ICOS) and Integrated European Long-Term Ecosystem, critical zone, and socio-ecological Research (eLTER) networks, meaning continuous observations of carbon fluxes and other ecosystem parameters. Meteorological variables and radiation are also routinely measured at the stations. The data are publicly available to download from the SmartSMEAR database (https://smear.avaa.csc.fi/, last access: September 2022; latest updated data sets can be found at https://etsin.fairdata.fi/data sets/SmartSMEAR, last access: November 2023).

Data from Hyytiälä were divided into two separate data sets: pre-thinning, referred to just as Hyytiälä data (prior to 2019), and post-thinning (post 2019), referred to as post-thinning Hyytiälä data. The separation is due to the thinning of the forest at Hyytiälä station in 2019, which involved the removal of smaller trees from the forest understory, and additional thinning (from below) conducted from January to March 2020. In the thinning, 30 % of tree basal area was removed (Aalto et al., 2023), which significantly changed NEE due to the decrease in biomass. The data set thus had differences that were too large to be treated as a direct continuation of the pre-thinning data set. The amount of data points and the time intervals for each data set can be seen in Table 1.

Table 1Summary of data sets: time periods and number of observations.

Download Print Version | Download XLSX

Table 2List of input variables used for model training.

Download Print Version | Download XLSX

The data used in this study have a 30 min interval. The high frequency enables a more detailed study of the daily cycle of NEE. It allows for the analysis of the impact of such variables that affect the ecosystem processes on a short timescale, such as the impact of changes in radiation on photosynthesis. Raw measurements of the target variable (NEE) were collected using the eddy covariance technique (Aubinet et al., 2012) and then processed into NEE through the EddyUH software (Mammarella et al., 2016). Negative NEE corresponds to the ecosystem acting as a net carbon sink, while positive NEE corresponds to the ecosystem acting as a net carbon source. We model NEE using meteorological variables such as air temperature, soil temperature, solar radiation, relative humidity, and soil moisture content. The leaf area index (LAI) is not used here as its seasonal variability in the chosen period is relatively small (Hyytiälä about 30 %; Värriö 20 %), which translates to a below 10 % change in canopy light interception and roughly the same percentage in GPP. For some input variables, minor differences exist in how the data are measured at the two stations (e.g., soil moisture is from different depths). The data used were non-gap-filled to avoid the influence of models typically used for gap filling. At Hyytiälä, photosynthetically active radiation (PAR) was not measured before 2009, and we used global radiation multiplied by the PAR quantum efficiency of 2 $µ mol s^{- 1} W^{- 1}$ (Ross and Sulev, 2000; Ezhova et al., 2018) to calculate missing values of PAR. All variables used are listed in Table 2.

In the pre-processing of the data, time points that contained missing values of any studied input variable were discarded. Also, all rows where the PAR value was less than 10 $µ mol s^{- 1} m^{- 2}$ were filtered out due to the interest being solely on modeling daytime NEE. We calculated the diffuse fraction,

\begin{matrix} (1) & F_{dif} = \frac{{PAR}_{dif}}{PAR}, \end{matrix}

and vapor pressure deficit (Monteith and Unsworth, 2013),

\begin{matrix} (2) & \begin{aligned} VPD & = e_{s} - e_{a}, where \\ e_{s} = 611 \exp (\frac{17.27 T_{air}}{237.7 + T_{air}}), e_{a} = e_{s} \frac{RH}{100} . \end{aligned} \end{matrix}

In Eq. (2), T_air is in degrees Celsius (°C) and e_s and e_a are in pascals (Pa).

Table 3Overview of the training configurations for ML models across different data sets.

Download Print Version | Download XLSX

Table 4List of the final model hyperparameters with their respective values for each modeling setup. Values of parameters are listed in the following order corresponding to different setups: Hyytiälä all, Hyytiälä peak, Värriö all, Värriö peak, mixed data sets with site label, and mixed data sets without site label.

Download Print Version | Download XLSX

The machine learning models were trained in two sets of four setups (Table 3), and the results within a set were compared against each other. For both sets, four different machine learning models were trained for all of the four cases, meaning a total of 32 models trained. In the first set, models for data representing entire year and peak growth season (July and August) were trained using data from either pre-thinned Hyytiälä or Värriö. In the second set, models were trained by combining the data from two sites into a single mixed data set and then training them with and without variables that denote from which site the data originate from (“Värriö”, “Hyytiälä” for Hyytiälä pre-thinned, “HyytiäläT” for Hyytiälä post-thinned). Similarly to Set 1, setups included the entire year and peak growing season. A summary of the configurations for all experiments can be seen in Table 3.

In all cases, the data were split into training and test data, where training data were used to train the models, while test data were used to evaluate the models' performance. For modeling NEE for pre-thinned Hyytiälä and Värriö, 75 % of their respective data were used for training the model, while the rest were used as the test data to evaluate the model performance.In case of the mixed model, 80 % of the each respective data set was used to train the model. The processed data used for training the machine learning are publicly available at Laanti (2024).

2.2 Machine learning models

To ensure robustness and reduce potential biases, we validate our findings across four distinct ML models, aiming to identify consistent patterns or insights and provide an overall picture of how well the models can use these data to predict NEE. Applying several models to the same data set provides a context on what input variables are consistently considered important across different models. The four models used were cubist (Quinlan, 1992), random forest (Breiman, 2001), averaged neural network (Kuhn, 2008), and basic linear regression (Kutner et al., 2004). All were implemented in R (v. 4.3.0: https://www.r-project.org/, last access: November 2023) using R's “caret” library (v. 6.0.94: https://github.com/topepo/caret/, last access: November 2023). Linear regression served as the baseline model, while the other models were chosen due to their proven competence in solving various regression problems (Fernández-Delgado et al., 2019). The code used for training the machine learning models is publicly available at Laanti (2024).

Random forest (RF) is a popular model that has been used in previous research (Cai et al., 2020; Liu et al., 2021; Abbasian et al., 2022; Zhu et al., 2023) due to its ease of use, high accuracy, and robustness. It is an ensemble model that uses the averaged output of random regression trees (Fernández-Delgado et al., 2019) by training different regression trees on different subsets of the data. The final prediction is the average result of the different tree predictions. The algorithm is quite robust as the different trees are trained with the different subsets of the training data. The randomForest library (Liaw and Wiener, 2002) implements the regression algorithm of RF used in this study.

Cubist is one of the best-performing regression models (Fernández-Delgado et al., 2019) across multiple types of data sets (i.e., type and size of data). Like RF, it is created from multiple individual regression trees, where each terminal leaf contains a smoothed linear regression model for prediction (Zhou et al., 2019). It creates a series of if–then rules that can be considered the branches of a tree, while the leaves are an associated multivariate linear model. The corresponding model is used to calculate the final predicted value as long as the set of covariates satisfies the conditions of the corresponding rule. Cubist also uses boosting with its training committees, which creates a series of trees with different weights and nearest-neighbor search to adjust the predictions better.

Model-averaged neural networks (avNNet) are a single-hidden-layer feed-forward neural network characterized by its architecture and training approach. The network consists of interconnected neurons arranged in layers, with the final layer outputting the prediction (Ripley, 2007). During the training phase, initial weights, which influence predictions, are randomly assigned. These weights are then iteratively updated, enabling the network to capture nonlinear relationships. Given the randomness in predictions due to these initial weight assignments, avNNet constructs multiple neural network models and averages their results. This averaging process promotes a more robust and stable prediction by minimizing the impact of any single model's randomness.

The basic multivariate linear regression (LinRegr) is used as a baseline to understand how much impact and improved results more advanced models can provide. LinRegr finds a linear relationship between the independent and dependent variables determined by minimizing the sum of the squared differences between the predicted and the actual values (Hastie et al., 2009).

2.3 Cross-validation framework, hyperparameter tuning, and validation metrics

K-fold cross-validation is a resampling method for validating model efficiency, which generally results in less biased models (Jung, 2018). The K-fold cross-validation method shuffles the data set randomly and splits it into K groups or folds. First, each fold is taken as a holdout, while the model is fit on the rest of the folds, and then the model is evaluated on the holdout set. The score is retained, and the model is discarded. In repeated K-fold cross-validation, this process is done R times on different splits. K-fold cross-validation also effectively prevents model overfitting, where a machine learning model has learned to model the inherent noise of a data set, to a point where it fails to model for points not included in the training data set (Berrar, 2019).

During the model training, repeated K-fold cross-validation was used with caret libraries' (Kuhn, 2023) grid hyperparameter search. This method trains and evaluates a model using all possible combinations of specified hyperparameter values to identify the combination that yields the best model performance. It was used to tune the models' hyperparameters and configuration settings that are external to the model and can be adjusted to optimize performance. Values R=5 repeats and K=10 folds were used to fit each model. The tuned hyperparameters can be seen in Table 4. The train and test data as well as the folds of the K-fold cross-validation were split using a predetermined random split to ensure repeatability. However, due to technical limitations, in-depth hyperparameter tuning was not used on the models that contained data from all sites. Instead, hyperparameters based on the results from the single-site models were used.

In evaluating the performance of our machine learning models, we primarily relied on two key metrics to assess the models' goodness of fit: the coefficient of determination (R²) and the root mean squared error (RMSE). RMSE measures the differences between the values predicted by a model and the actual values and provides an understanding of the magnitude of error the model might make in its predictions. A lower RMSE indicates a better fit to the data, implying that the model's predictions are more precise. The models' hyperparameters were tuned specifically based on the RMSE score.

In addition, each model was trained on five different data splits to account for variability and reduce the influence of any single fortunate or unfortunate split on the results. The performance metrics, R² and RMSE, were averaged across these splits to ensure a robust and reliable assessment of model performance.

2.4 Explainable AI methods

As machine learning models have been used more in research and industry, the demand for more transparent and interpretable models has grown (Dwivedi et al., 2023). As model accuracy has risen, so has model complexity. The highly accurate and complex models have many hyperparameters that can not be made human-understandable. To be trustworthy, the ML model must produce interpretable or transparent results. Relying on unexplained or inaccurate predictions can lead to critical errors. Accuracy metrics do not always portray the true prediction capability of a model, so it is vital to critically evaluate the results against existing knowledge or theories. XAI methods aim to provide machine learning models and methods that enable users to better understand, analyze, and evaluate the models' decision-making.

In this study, we used two XAI methods: permutation feature importance and accumulated local effect (ALE) plots (Molnar, 2020). They provide insight into how the input variables affect a model's output. Both are model-agnostic global methods, meaning they can be used regardless of the selected model and provide interpretations on the data set as a whole rather than individual points (Molnar, 2020). Both of these methods were implemented using R's “iml” library (v.0.11.1: https://github.com/christophM/iml/, last access: November 2023, Molnar et al., 2018).

2.4.1 Permutation feature importance

Permutation feature importance is a method that aims to measure the increase in the prediction error of a model after the input variables (features) are permuted. In permutation feature importance, the relationship between a specific input variable and the variable the model tries to predict is deliberately disrupted to understand how the models' prediction accuracy is affected (Molnar, 2020). If an input variable is important, randomly rearranging its values increases the model error, as the model then relies on that specific input variable for an accurate prediction. The trained model is denoted as f, input variable matrix as X, target vector as y, and error measure as L(y,f(X)). The algorithm works as follows.

The original model error $e = L (y, f (X))$ is estimated.
For each input variable with index $i \in {1, \dots, p}$ , where p is the total number of input variables,
- 2.1
  a permutated input variable matrix $\hat{X}$ is generated by permuting input variable i in the data X, which breaks the association between input variable i and the true outcome y;
- 2.2
  the error caused by the permutation is estimated by predicting with it $\hat{e} = L (y, f (\hat{X}))$ ; and
- 2.3
  the permutation input variable importance is calculated as quotient ${Imp}_{i} = \hat{e} / e$ .
Input variables are sorted by descending Imp.

Only test data are used to calculate the permutation feature importance. Assessing feature importance using the training data might result in scores that are too inflated due to a model overfitting on training data. That said, the features with very high scores might not be as important for making accurate predictions on new, unseen data. As with the metrics R² and RMSE, the permutation feature importance was calculated on multiple different data splits to ensure robustness of the results.

2.4.2 ALE plots

Accumulated local effect (ALE) plots describe how input variables influence the prediction of a machine learning model on average (Molnar, 2020). ALE reduces a complex machine learning function to a function that depends on only one, as in our case, or two input variables and visualizes the effects between a selected variable and the prediction of the target variable of a machine learning model. The idea is to remove the unwanted effects of other input variables, take partial derivatives (local effects) of the prediction function with respect to the feature of interest, and integrate (accumulate) them with respect to the same feature.

The value of ALE at a certain point can be thought of as the effect of the selected variable at a specific value compared to the average prediction made on the data. To calculate the ALE value for input variable s at point $x \in [min (x_{s}), max (x_{s})]$ , with x_s being the vector of this variable's values, the input variable values x_s are divided into K intervals, where the start of the first interval is the lowest value z₀=min(x_s), and the differences of predictions between the sequential intervals are calculated. While the exact ALE formula requires a model with a derivative, an approximate version is used here that is more widely adopted and works for models without a derivative. Initially, an uncentered effect is computed:

\begin{aligned} {\overline{f}}_{s, ALE} (x) = & \sum_{k = 1}^{k_{s} (x)} \frac{1}{n_{s} (k)} \sum_{i : x_{s}^{(i)} \in] z_{k - 1, s}, z_{k, s}]} [f (z_{k, s}, x_{- s}^{(i)}) \\ - f (z_{k - 1, s}, x_{- s}^{(i)})] . \end{aligned}

The values x_s of input variable of interest s are replaced with grid values z_s, where the grid values represent the edges of the intervals. The interval index an input variable value x∈x_s falls in is denoted as k_s(x), while n_s(k) denotes the number of observations inside the kth interval of x_s. A single data point is denoted as $x^{(i)} = (x_{s}^{(i)}, x_{- s}^{(i)}))$ , where $x_{s}^{(i)}$ denotes the ith value for the selected input variable, and $x_{- s}^{(i)}$ is the vector of all the other features of a single data point that are kept constant. The ML predicting function is denoted as f.

The differences between the predictions $f (z_{k, s}, x_{- s}^{(i)}) - f (z_{k - 1, s}, x_{- s}^{(i)})$ are the effect that the input variable s has for an individual data point for predicting the dependent variable (NEE in our case) when using the upper and lower values of an certain interval. The sum $\sum_{i : x_{s}^{(i)} \in] z_{k - 1, s}, z_{k, s}]}$ adds up the effects of all instances within an interval $x_{s}^{(i)} \in] z_{k - 1, s}, z_{k, s}]$ . This is then divided by the number of observations in this interval n_s(k) to obtain the average difference of the predictions of this interval. The sum $\sum_{k = 1}^{k_{s} (x)}$ accumulates the average effects across all intervals, meaning that the uncentered ALE of an input variable of interest is accumulated by all its previous intervals. After that, the effect is centered, making the mean effect zero:

f_{s, ALE} (x) = {\overline{f}}_{s, ALE} (x) - \frac{1}{n} \sum_{i = 1}^{n} {\overline{f}}_{s, ALE} (x_{s}^{(i)}) .

The value of ALE can be thought of as the main effect of the input variable at a certain value compared to the average prediction of the data. The ALE plot has the advantage that it generates valid interpretations even if the variables are correlated – an issue that persists in other methods that reduce a prediction function f to a function that depends on a single input variable such as partial dependence plots or marginal plots (Molnar, 2020). As with permutation feature importance, only the test data set was used to reduce the chance of inflating scores due to a model overfitting on the training data set.

3 Results and discussion

3.1 NEE modeling for Hyytiälä and Värriö data sets

In this section, we report the results obtained with different models from Set 1 in Table 3 (pre-thinned Hyytiälä and Värriö, whole year and peak growing season). First, we assess models' performance with routinely used accuracy metrics (R² and RMSE), visualize diurnal and annual NEE cycles, and then use XAI methods. In each subsection, we start the discussion with the peak-growing-season results and continue with the whole-season results.

https://bg.copernicus.org/articles/22/257/2025/bg-22-257-2025-f01

Figure 1R² coefficients for all the models and different setups from Set 1 (Table 3). In each of the four panels, the results for the training data set are shown on the right (labeled “Train”), and the results for the test data set are shown on the left (dotted bars, labeled “Test”). Different colors are used to distinguish between the ML models; see legend. “ALL” denotes the scores for the models trained on the whole-year data sets; “PEAK” denotes the scores for the models trained on the peak-growing-season data sets. The black error bars show the min and max, and the bars show the mean of the scores trained on different splits of the data.

Explainable machine learning for modeling of net ecosystem exchange in boreal forests

2.1 Stations and data sets

2.2 Machine learning models

2.3 Cross-validation framework, hyperparameter tuning, and validation metrics

2.4 Explainable AI methods

2.4.1 Permutation feature importance

2.4.2 ALE plots

3.1 NEE modeling for Hyytiälä and Värriö data sets

3.1.1 Assessing model performance using accuracy metrics

3.1.2 Which variables explain NEE: feature importance

3.1.3 How the models use input variables: ALE

3.2 NEE modeling: mixed data set

3.2.1 Assessing model performance on a mixed data set using accuracy metrics

3.2.2 Feature importance for the mixed data set

3.2.3 ALE for the mixed data set