We use complex network theory to better represent and understand the ecosystem connectivity in a shelf sea environment. The baseline data used for the analysis are obtained from a state-of-the-art coupled marine physics–biogeochemistry model simulating the North West European Shelf (NWES). The complex network built on model outputs is used to identify the functional groups of variables behind the biogeochemistry dynamics, suggesting how to simplify our understanding of the complex web of interactions within the shelf sea ecosystem. We demonstrate that complex networks can also be used to understand spatial ecosystem connectivity, identifying both the (geographically varying) connectivity length-scales and the clusters of spatial locations that are connected. We show that the biogeochemical length-scales vary significantly between variables and are not directly transferable. We also find that the spatial pattern of length-scales is similar across each variable, as long as a specific scaling factor for each variable is taken into account. The clusters indicate geographical regions within which there is a large exchange of information within the ecosystem, while information exchange across the boundaries between these regions is limited. The results of this study describe how information is expected to propagate through the shelf sea ecosystem, and how it can be used in multiple future applications such as stochastic noise modelling, data assimilation, or machine learning.

Although shelf seas, understood as the seas covering parts of a continental shelf, are only 7 % of the global ocean, they are responsible for 20 % of the global biological productivity, contribute to 20 % of the ocean uptake of atmospheric carbon, and are the grounds for 80 % of global fish catches

Networks are a mathematical tool for modelling the key relationships/connections between objects/data. Typically, most networks generated from real-world data are complex networks with examples being found in biochemical systems, neural networks, social networks, the Internet, and the World Wide Web

In this work, we use complex networks (CNs) and associated statistical analyses, together with NWES as a test case, to investigate three relevant topics related to shelf sea biogeochemistry. (i) We used the network connectivity to estimate the spatial horizontal correlation length-scales of the biogeochemical variables. Typically, spatial-correlation functions are identified either through ensemble runs or diagnostic methods

The analysis from this study can provide additional information to biogeochemistry modellers for building simplified (yet realistic with respect to the objectives) and computationally cheaper models than ERSEM, capable of simulating a wide range of what-if scenarios. Simultaneously, it can identify the necessary model complexity to realistically simulate the NWES biogeochemistry. Finally, for the goal of developing efficient ML-based emulators for ERSEM or for some of its critical parameterizations or sub-components, this study paves the way for how to perform efficient feature selection (i.e. how to select the minimum number of input variables to achieve the desired accuracy).

The paper is organized as follows. We first give, in Sect. 2, details on the model used, explaining each component used to output the data we analysed, as well as the relevant configurations for each component. In Sect. 3 we discuss the methods used, starting with the pre-processing step used to remove the seasonal signal from the data analysed. We then detail the approach used to estimate the mean length-scale of each biogeochemical variable, as well as how we develop a series of spatial networks that help to efficiently capture the spatial variability of these length-scales. We then explain the clustering algorithm used on these networks that splits the shelf sea into a set of regions. The final part of the methodology moves away from the spatial analysis of the variables and gives details on how we developed a CN to compare the inter-variable interactions and clusters that form. Following this, we present and discuss our results in Sect. 4, with each subsection corresponding to a subsection of the methodology. We finish with concluding remarks, in Sect. 5, summarizing the key findings and discussing future work.

The ERSEM pelagic variables.

The Atlantic Margin Model (AMM7) domain used in this study. The figure also shows the ocean bathymetry of the North West European Shelf.

To obtain a complex picture of the shelf sea biogeochemistry, including the relationships between the variety of key biogeochemical variables (detailed in Table

NEMO: Nucleus for European Modelling of the Ocean

The NEMO ocean physics component (OPA) is a finite-difference, hydrostatic, primitive equation ocean general circulation model

ERSEM

Both the physical and biogeochemical models were forced by daily-varying river discharge data from

In order to extract non-trivial interactions and dynamics of the system, we removed the dominating seasonal signal. Typically, this is achieved by phase-averaging and standardizing the data to generate an anomaly time series (with respect to climatology) with zero mean and unit variance. However, with our high temporal resolution (daily) but just 3 years of data, this phase-averaging method can heavily skew a dataset with both high inter-annual and daily variability. As a result, we instead opted to use a high-pass filter that standardizes every time step of data according to its local temporal behaviour (a running average with a 10

First, for each day, we computed the local mean that bears the signature of the seasonality. This is done by averaging the values within a chosen time window centred on that day:

Using the output from both Eqs. (

The data consist of 50 ERSEM state variables (as well as temperature and salinity) on a 375

As hinted to above, our method is designed to mitigate (or possibly remove) the skewness caused by instances of high inter-annual variability (those would eventually not be an issue when working with a dataset on a much longer period than 3 years). Furthermore, the proposed approach is still effective in removing the seasonal cycle from data, and it is more sensitive than phase-averaging to dynamics in both low- and high-activity periods of the time series. A key limitation is that the method is not suitable if we had intended to compare data points and times that are separated by an offset significantly larger than

To estimate the horizontal length-scale of each biogeochemical variable, we calculated a Spearman's correlation between the time series at a reference point and all of its surrounding points simultaneously, within a selected radius. This should reduce the number of unnecessary computations while still being confidently large enough to capture the length-scales. We intentionally chose to use the Spearman correlation in order to capture the non-linear relation that would have been otherwise masked with the Pearson correlation. Starting with a circle of small radius (7

Horizontal length-scale estimate;

Figure

We averaged these length-scales over 300 sample points for each variable, at least 21

The method used to construct spatial networks from the biogeochemical model is largely inspired by similar applications to models of the climate system

As opposed to the biogeochemical length-scales computed in Sect.

Method for calculating the length-scale of a given node in a network representing the horizontal connectivity of a variable. Panel

The approach introduced here started by creating a spatial network for each variable, as described in Sect.

With the spatial variation captured for each variable, we sought to find any underlying structure that was shared between variables across the surface layer. Each of the spatial networks, i.e.

With the spatial networks, the

For a network with

Thanks to this normalization, we can apply a static threshold to each of the networks (as mentioned in Sect.

For

The method has several characteristics that make it preferable to other clustering methods that are applied to the dataset directly (i.e.

Nevertheless, a key challenge in SGC is selecting the appropriate number of clusters to use with the algorithm. A common solution to this problem is to use the “eigengap heuristic”

We identified “robust regions” as connected areas of ocean that rarely, or never, contain the boundaries from the clustering of any individual state variable. For the spatial network of each variable, we identified every node that is geographically adjacent to another node with a different cluster label (as found from Eq.

The work in the previous sections focused on understanding how each variable separately behaves in horizontal space. In this section, we focused on developing an understanding of how the variables interact with each other co-spatially. This was achieved by assessing the interactions between the different biogeochemical state variables of ERSEM, computing the absolute value of Spearman's rank correlation coefficient between the time series of each variable at an ERSEM grid point. As with before, we chose Spearman's correlation to capture any potential non-linear, monotonic links between variables. These correlations can be represented as a weighted adjacency matrix, where the rows and columns represent each of the variables, and each matrix's entries represents the strength of a pairwise connection between variables at a grid point. As one might expect, the strength of these correlation coefficients will vary spatially. Therefore, in order to identify the most consistent and robust connections (and in a computationally efficient way), we calculated an adjacency matrix for 300 points randomly sampled across the shelf (bathymetry

We accounted for any processes that occur on a lagged or delayed timescale through cross-correlation – determining the degree to which one time series is correlated with another time series after shifting the latter series forward or backward in time. The correlation between any variable pair in the results is always shifted by an offset that maximizes the correlation between the two variables. It should be noted however that, as a result of the pre-processing step applied to the data (cf. Sect.

As this inter-variable analysis provided us with a weighted adjacency matrix, we were once again able to apply the SGC algorithm described in Sect.

Figure

Estimate of mean horizontal length-scales for each ERSEM variable on the shelf (as shown in Fig.

Horizontal length-scales vary spatially across ERSEM variables, using correlation network connectivity to approximate the scaling factor. The spatial variation is consistent between different variables, as shown by the co-spatial Pearson's correlation between each variable.

Not only is there a large variation in the mean horizontal length-scales between variables (found according to Sect.

Aggregated boundary heatmap generated from community detection (clustering) of the spatial network for each of the 50 ERSEM variables (cf. Table

Beyond the applications in DA, identifying horizontal length-scales is also relevant for the design of appropriate strategies for probabilistic prediction or model error compensation. For instance, when considering how to model stochastic noise across the spatial domain, we can clearly see that simply applying white noise across each grid point would be unrealistic, as there are significant spatial correlations to consider (e.g. applying such noise to and initialization to introduce uncertainty in initial value conditions). As outlined, these spatial correlations will also vary in size, meaning that the correlated noise model should be scaled differently according to the target variable.

Of particular interest in Fig.

Figure

We used those robust boundaries to identify 13 regions representing areas of NWES connectivity. Results of this regionalization are represented in Fig.

The 13 key regions identified based on the regionalization found in Fig.

While we see that the features from Fig.

Grouping of ERSEM state variables calculated from 300 sample points on the NWES. Panel

Figure

If two variables display a high mean correlation and low coefficient of variation, it indicates that there is a reliable and consistent connection between them in the NWES model dataset. The pairwise elements of each group within a high-correlation “block” tend to show a low coefficient of variation in the corresponding plot, indicating that these variables can be grouped together both reliably and consistently.

Although some of these links among variables could have been anticipated to some degree, the quantitative grouping demonstrates the opportunity, and provides the metric, for researchers aiming to either reduce the complexity of the ERSEM ecosystem model or build more simplified (but realistic with respect to the objectives) models than ERSEM. For example, Fig.

Finally, we would like to caution against over-interpreting the Spearman cross-correlation matrix from Fig.

A network derived from the correlation measures found in Fig.

Figure

The cyan cluster consists of temperature, dissolved inorganic carbon (DIC), and semi-labile organic matter, with the dissolved inorganic carbon being weakly connected to the higher-trophic-level–DOM cluster. Connections between temperature and gases, such as

Marine biogeochemistry is complex to simulate, representing a plethora of processes in an often computationally costly manner. As a result, it is not well suited for addressing specific questions that necessitate extensive and long-lasting ensemble simulations, such as ecosystem response to climate change and anthropogenic stresses across a broad range of scenarios, or analyses aimed at informing policy-making decisions. Here, we aimed to use a complex network analysis to gain insight into connections found across the ecosystem while providing an understanding that will aid in the simplification of its complex interactions and dynamics.

With future observation missions that will provide new biogeochemical variables for assimilation, there is a need to further understand how transferable the spatial horizontal correlation length-scales are between the different biogeochemical variables. Using the correlation analysis and the resulting spatial networks, we can conclude that the biogeochemical horizontal correlation length-scales at the ocean surface vary significantly between variables and are not directly transferable. However, we have provided an approximation for the horizontal correlation length-scales of all variables across the whole NWES spatial domain. The spatial horizontal correlation length-scales are derived for the ocean surface but are expected to be relevant within the ocean mixed layer. The spatial length-scale distributions are similar (highly correlated) across the variables and form realistic spatial features, enhancing the confidence in those results. With this clear indication of structure embedded into the horizontal connectivity of the ecosystem, we sought to split the shelf sea into geographic regions using clustering network algorithms. This clustering process was applied to each variable independently, yet it identified a set of clear and consistent boundaries that represent areas of extremely low connectivity across which information is not shared. This resulted in 13 key regions, suggesting that each functions as a quasi-separate system but with unified biogeochemical/ecosystem characteristics within its boundaries. This also identified the Celtic Sea and the north-west section of the NWES as areas of high exchange between the shelf sea and open ocean. Finally, we demonstrated that the complex network carries important information on how the ecosystem variables cluster into natural functional groups. Our analysis demonstrated that the chemical components (nitrogen, carbon, silicon, etc.) of each pelagic variable (e.g. diatoms, nanophytoplankton, microzooplankton) are closely linked, and a simpler version of the model can be built, by reducing these variables through parameterization. We also see that the pelagic variables form even larger functional groups (e.g. POM, phytoplankton, HTL/DOM), composed of variables that can be effectively parameterized through monotonic functions of each other.

These findings show that complex networks can be used as an effective tool in simplifying the complexity of the ecosystem dynamics, providing simplifications to the system extracted from the behaviour of the model itself. These simplifications will be applied in future work, e.g. building an ML-based reduced-order emulator to improve data assimilation on the NWES.

Data are available on MASS and obtainable on request. MASS is the Met Office Managed Archive Storage System and is accessed using the user interface known as MOOSE; access is now possible from both the MONSooN and JASMIN systems at the Centre for Environmental Data Analysis (CEDA,

The supplement related to this article is available online at:

IH wrote and executed all code. JS provided the model and data. All authors contributed to analysing and interpreting the results, proof-reading the manuscript, and adjusting the text.

At least one of the (co-)authors is a member of the editorial board of

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

We would like to thank Daniele Marinazzo for the insightful discussion at the early stage of this work.

Ieuan Higgs was supported by the Natural Environment Research Council via the National Centre for Earth Observation (contract no. PR140015) and the University of Reading (NCEO/Reading 26). Jozef Skákala and Stefano Ciavatta were supported by the projects SEAMLESS (funded by the European Union's Horizon 2020 research and innovation programme under grant agreement no. 101004032) and NECCTON (funded by the Horizon Europe research and innovation action under grant agreement no. 101081273). Jozef Skákala also received additional funding from NCEO. Alberto Carrassi was supported by the project SASIP (grant no. 353), funded by Schmidt Futures – a philanthropic initiative that seeks to improve societal outcomes through the development of emerging science and technologies. Ross Bannister was supported by NCEO (contract no. PR140015).

This paper was edited by Marilaure Grégoire and reviewed by Damien Couespel and one anonymous referee.