A novel method for diagnosing seasonal to inter-annual surface ocean carbon dynamics from bottle data using neural networks

. The ocean’s role in modulating the observed 1–7 Pg C yr − 1 inter-annual variability in atmospheric CO 2 growth rate is an important, but poorly constrained process due to current spatio-temporal limitations in ocean carbon measurements. Here, we investigate and develop a non-linear empirical approach to predict inorganic CO 2 concentrations (total carbon dioxide ( C T ) and total alkalinity ( A T )) in the global ocean mixed layer from hydrographic properties (tem-perature, salinity, dissolved oxygen and nutrients). The beneﬁt of this approach is that once the empirical relationship is established, it can be applied to hydrographic datasets that have better spatio-temporal coverage, and therefore provide an additional constraint to diagnose ocean carbon dynamics globally. Previous empirical approaches have employed multiple linear regressions (MLR) and relied on ad hoc geographic and temporal partitioning of carbon data to constrain complex global carbon dynamics in the mixed layer. Synthe-sizing a new global C T / A T carbon bottle dataset consisting of ∼ 33 000 measurements in the open ocean mixed layer, we develop a neural network based approach to better constrain the non-linear carbon system. The approach classiﬁes features in the global biogeochemical dataset based on their similarity and homogeneity in a self-organizing map (SOM; Kohonen, 1988). After the initial SOM analysis, which in-cludes geographic constraints, we apply a local linear optimizer to the neural network, which considerably enhances the predictive skill of the new approach. We call this new approach SOMLO, or self-organizing multiple linear output. Using independent bottle carbon data, we compare a traditional MLR analysis to our SOMLO approach to capture the spatial C T and A T distributions. We ﬁnd the SOMLO approach improves predictive skill globally by 19 % for C T , with a global capacity to predict C T to within 10.9 µmol kg − 1 (9.2 µmol kg − 1 for A T ). The non-linear SOMLO approach is particularly powerful in complex but important regions like the Southern Ocean, North Atlantic and equatorial Paciﬁc, where residual standard errors were reduced between 25 and 40 % over traditional linear methods. We further test the SOMLO technique using the Bermuda Atlantic time series (BATS) and Hawaiian ocean time series (HOT) datasets, where hydrographic data was capable of explaining 90 % of the seasonal cycle and inter-annual variability at those multidecadal time-series stations.


Introduction
The ocean's role in modulating rising atmospheric carbon dioxide (CO 2 ) levels has been found to be very important (Khatiwala et al., 2012;Sabine et al., 2004). A variety of data-based estimates suggest net oceanic uptake for CO 2 to be 2.1 ± 1.0 Pg C yr −1 (1 Pg = 10 15 g) since the year 2000, or about 25-30 % of anthropogenic CO 2 emissions over that period (Jacobson et al., 2007;Khatiwala et al., 2009;Manning and Keeling, 2006;McNeil et al., 2003;Mikaloff-Fletcher et al., 2006;Takahashi et al., 2009). Between 1990 and 2009, atmospheric CO 2 accumulation rates vary between 1 and 7 Pg C yr −1 , indicating large inter-annual variability from both the terrestrial and oceanic reservoirs (Sarmiento et al., 2010). Although our long-term, decadal-scale understanding of oceanic CO 2 uptake has advanced, our shorter-term understanding (seasonal to inter-annual) of ocean carbon dynamics remains poorly constrained due to current data limitations.
Atmospheric CO 2 observations, inversion techniques and ocean models suggest a large range for inter-annual variability in oceanic CO 2 uptake (0.1-1.5 Pg C yr −1 ) (Bender et al., 2005;Le Quéré et al., 2003;Patra et al., 2006;Rayner et al., 2008). However, from an oceanic perspective, our understanding of natural variability of ocean carbon has come about sporadically, dominated by regional time-series measurement programs (e.g. Bermuda Atlantic time series (BATS) and Hawaiian ocean time series (HOT)). Without a better understanding of shorter-scale natural variability, the ability to constrain and understand the time-evolving capacity for the ocean to absorb atmospheric CO 2 in a high-CO 2 world will be limited, particularly since some evidence suggests the ability for the ocean to absorb CO 2 has slowed since the late 1980s as a consequence of decadal-scale trends in winds and oceanic circulation (Le Quéré et al., 2010;Sarmiento et al., 2010).
Standard hydrographic measurements in the ocean (temperature, salinity, dissolved oxygen and nutrients) are sampled and analysed much more frequently than inorganic carbon. With the deployment of satellites, gliders and ARGO floats providing an immense capacity for capturing shortterm seasonal to inter-annual variability in the oceans, the question is, can this new information be used to help infer and diagnose short-term carbon dynamics in the ocean?
The oceans inorganic carbon system can be fully constrained by knowing any two measurements within its inorganic carbon constituents; partial pressure of CO 2 (pCO 2 ), total dissolved carbon dioxide (C T ), total alkalinity (A T ) or pH. Significant time and resources have been devoted on national and international levels to survey the global oceanic C T and A T distribution. However, even with approximately 330 000 bottle measurements taken sporadically over the past 30 years, our ability to globally understand natural seasonal C T and A T dynamics has been hindered due to the large spatio-temporal limitations in this current accumulated dataset .
Autonomous pCO 2 measuring devices mounted mainly onto commercial shipping vessels has resulted in a global network of approximately 6.4 million ocean surface pCO 2 measurements (Takahashi et al., 2012). This pCO 2 dataset has given us the best idea of seasonal (Takahashi et al., 2009; herein after referred to as T-09) to inter-annual (McKinley et al., 2011;Park et al., 2010;Telszewski et al., 2009) CO 2 variability within the ocean. However, the global pCO 2 dataset cannot inform us on some very important processes and biogeochemical dynamics that modulate atmospheric CO 2 . The ocean's biological carbon export flux has been estimated to be between 11 and 16 Pg C yr −1 from satellite chlorophyll a measurements (Falkowski et al., 2000), some 5-8 times the net oceanic CO 2 absorption from the atmosphere. Small changes in the biological carbon flux have large and important implications for atmospheric CO 2 . However, this large signal is yet to be constrained from inorganic carbon data itself, since it requires constraints on mixed-layer car-bon dynamics rather than just sea-surface constraints like the pCO 2 climatology. Secondly, without equivalent A T or C T measurements, pCO 2 by itself cannot provide insights into partitioning the biological carbon pump into both organic and calcification components, particularly important with regard to future ocean acidification impacts. Previous estimates on this "rain ratio" (organic/calcifier export flux) have needed to assume a constant redfield ratio on nutrient changes in the oceans mixed layer (Sarmiento et al., 2002). Finally, spatiotemporal deficiencies in the pCO 2 dataset in regions like the Southern Ocean introduce uncertainties in the direct evaluation of short-term variability. To understand seasonal to inter-annual variability in these regions requires methods that have better spatio-temporal coverage than is constrained by historical pCO 2 sampling. Here, we seek to diagnose seasonal to inter-annual C T and A T concentrations in the mixed layer that provide independent, but important additional constraints to the global sea-surface pCO 2 climatology.
To varying degrees, concentrations of C T and A T are influenced by the solubility of CO 2 , biological processes, vertical and lateral water transport and direct CO 2 exchange with the atmosphere . Ocean mixing is largely controlled by density dynamics via temperature (T ) and salinity (S) variations in the ocean, which also regulate the solubility of CO 2 (Weiss, 1974). Information on nitrate (N), silicate (Si), phosphate (P) and dissolved oxygen (DO) variations provide insight into the biological influences on oceanic inorganic carbon (Anderson and Sarmiento, 1994). From this, it should be implicit that we can derive empirical relationships between these standard hydrographical parameters and the carbon constituents. If a robust empirical relationship is established, we could apply our model to the order of magnitude more in situ measurements of these standard hydrographic parameters  or the objectively analysed 1 • × 1 • climatologies (e.g. Locarnini et al., 2010) to give us new constraints on seasonal to inter-annual carbon dynamics in the mixed layer.
The use of the global sea-surface pCO 2 dataset would be ideal to develop such empirical algorithms. However, these continuous pCO 2 measurements generally have no coinciding biogeochemical information (i.e. DO or nutrients) that could help establish an empirical relationship. Some have used satellite chlorophyll a measurements to help constrain ocean surface pCO 2 with varying degrees of success (Chen et al., 2011;Chierici et al., 2009;Telszewski et al., 2009). The benefits of using ship-based bottle measurements of C T and A T , is that they are almost always complemented by a suite of hydrographic and biogeochemical parameters (T , S, DO and nutrients) that can be used to help derive empirical relationships. Wallace (1995) verified a multiple linear regression (MLR) concept by successfully capturing C T using T , S, Si and apparent oxygen utilization (AOU) in the North Atlantic. Several studies have since investigated this MLR approach in capturing the surface distribution of C T and A T (see Table 1). Divergent biological and mixing regimes throughout the ocean have made it difficult to use linear empirical techniques on a global scale. Researchers have traditionally partitioned the global bottle dataset geographically, hydrographically and temporally in an attempt to improve the ability of linear approaches to model the non-linear relationship between inorganic carbon and the standard hydrographic parameters. Here we use a non-linear empirical modelling approach to avoid this ad hoc partitioning and show that it delivers considerable improvements in predictability. We use a self-organizing map (SOM; Kohonen, 1988) to classify or cluster measurements of hydrographic parameters into groups and then establish the relationship between these parameters and C T /A T separately for each group. SOMs have already been found to be well suited in extracting features of the ocean surface pCO 2 dataset in the North Atlantic using a combination of modelled and remotely sensed parameters to constrain the system, (Friedrich and Oschlies, 2009a, b;Lefèvre et al., 2005;Telszewski et al., 2009).
To contextualize this work, we firstly explore the use of the traditional MLR approach to diagnose global seasonal carbon dynamics in the ocean. To do this, we employ the MLR approach on a newly synthesized C T /A T bottle dataset of ∼ 33 000 mixed-layer samples. Next, we present our SOMbased approach to diagnose seasonal carbon dynamics on a global scale, which better accounts for non-linearities that would limit the ability of the MLR approach. To compare the MLR and our SOM approach, we develop an independent test approach to assess the model's skill. We then use the BATS and HOT in situ time series as an explicit test for our new approach and finally show the capacity of the model to capture coherent, spatial and temporal carbon fields over the global ocean.

Global carbon measurements and training dataset
The extraordinary effort to collate and synthesize the bottle hydrographic and biogeochemical data has been conducted by several groups; including GLODAP (GLobal Ocean Data Analysis Project; , CARINA (CARbon dioxide IN the Atlantic Ocean; CARINA Group, 2009aGroup, , b, 2010 and PACIFICA (PACIFic Ocean Interior CArbon project; Suzuki et al., 2013).
Precision in measuring bottle C T and A T samples has consistently improved over the past 30 yr as a result of advances in techniques and apparatus (Bradshaw et al., 1981;Johnson et al., 1987). However, it was not until the introduction of standard operating procedures and certified reference materials (Department of Energy, 1994;Dickson et al., 2003;Dickson et al., 2007) that the quality consistency of independent laboratory measurements was achieved and is currently estimated to be ±2 µmol kg −1 (Dickson et al., 2007). To account for any systematic measurement biases between independent laboratories when combining data, a secondary quality control (QC) method was incorporated by the project groups to identify and smooth out any inconsistencies, as outlined in Tanhua et al. (2010). The internal consistency of the CARINA C T /A T dataset has been estimated to ±2.5 µmol kg −1 (Tanhua et al., 2010). More recent additional measurements we included in the global dataset underwent a 1 st QC check to remove measurements that were flagged as bad or questionable under the World Ocean Circulation Experiment (WOCE) convention (Joyce and Corry, 1994).
For this work, 470 cruises from GLODAP, PACIFICA, CARINA, CLIVAR and miscellaneous sources were merged with the BATS and HOT measurements to form the global carbon training dataset, as shown in Table 2. We refined the global data to be within the mixed layer (Supplement A),  non-coastal (Supplement B) and data post-1980 due to large uncertainties in early measuring techniques. The final number of usable C T /A T discrete measurements in the global mixed layer was ∼ 33 000. Whilst the spatial coverage of the refined data is consistent over all major ocean basins (Fig. 1a), there are approximately 45 % less wintertime measurements than were collected during summertime (Fig. 1b), which we examine here as a potential cause for bias when applying our approach.

Normalization of C T measurements
Global atmospheric CO 2 concentrations during the 1980s, 1990s and 2000s have increased at 1.60 ± 0.56, 1.47 ± 0.66 and 1.90 ± 0.38 ppm yr −1 , respectively (Thomas Conway and Pieter Tans, NOAA/ESRL, www.esrl.noaa.gov/gmd/ ccgg/trends). Mixed-layer measurements of C T were corrected for temporal anthropogenic CO 2 uptake to the reference year 2000 by calculating the change in mixed-layer C T in equilibrium with the atmospheric CO 2 increase using observed Revelle factors (see supplement material C for details). This approach is somewhat equivalent to that of T-09 where all pCO 2 measurements values were corrected to the year 2000 using a rate of 1.5 µatm yr −1 .
There are regions of the ocean where upwelling and seaice inhibit air-sea gas exchange, resulting in considerable CO 2 disequilibrium (e.g. Southern Ocean, equatorial Pacific). The anthropogenic CO 2 correction technique used here, like those for T-09 and Lee et al. (2000), will be biased in these regions. However, by performing a test using no anthropogenic CO 2 correction (Supplement D), we demonstrate the very low impact this anthropogenic correction has to our final result. This is in part due to the large natural fingerprint of C T (±50 µmol kg −1 ) relative to the small changes (∼ 1 µmol kg −1 yr −1 ) resulting from anthropogenic CO 2 uptake.

Testing algorithm skill: a systematic independent test (SIT) approach
Most empirical studies report statistical errors calculated as the residual standard error (RSE) from linear regressions. For example, C T in the Indian Ocean was reported to be predicted to within ±5 µmol kg −1 using a suite of hydrographic parameters (Bates et al., 2006), ±8 µmol kg −1 for the Southern Ocean (McNeil et al., 2007) and ±7 µmol kg −1 for a global dataset (Lee et al., 2000). However, an independent dataset not used in the regressions is needed to accurately report true statistical uncertainty for any empirical approach.
Here, we developed a "systematic independent test" (SIT) approach in order to compare the MLR and NN empirical approaches consistently. The SIT method evaluates the algorithm's skill through an independent test of each cruise or time series without using it in the training or regression dataset. This implies that for a training data pool consisting of n cruises and i time series, n + i unique algorithms with identical model configurations are used to predict the excluded cruise or time series measurements. Calculating the residual standard error (RSE; Eq. 1) using all (or a subset) of the cruises and time-series independent predictions then provides a better and accurate estimate of the algorithms global (or regional) skill. In Eq. (1), the independent predictions and in situ measurements are represented by y indp−pred and y in−situ respectively, while N defines the number of discrete samples.
The reason we independently test each cruise dataset individually, rather than a randomly selected subset of the data, is due to similar concentrations of carbon and auxiliary measurements within local casts of the same cruise. As there are typically two to three measurements within each cast of the training dataset, the independent prediction of one of these measurements will give a misleading representation of the model's true skill, as the remaining two measurements with a very similar "biogeochemical fingerprint" will be used to train the algorithm. The prediction of an entire independent cruise is a more robust measure of the algorithm's skill.

Method description
Multiple linear regression is a numerical estimation of the linear relationship between a set of predictor variables, x = (x 1 , . . . , x n , . . . , x N ), and response variable, y, (Eq. 2).
where β 0 and β n represent the intercept and empirically derived coefficients respectively. Multi-collinearity (MCL) between predictor variables and non-normality of the residual errors are both issues that may affect the predictive and diagnostic ability of a MLR. To minimize the effect of these issues, the empirical relationships between C T /A T and the standard hydrographic parameters were constrained using a forward stepwise robust MLR routine. Following the schematic in Fig. 2, the routine initiates by ranking predictor variables p 1 , . . . , p n , . . . , p N according to their degree of linear correlation to the response variable, y; where p n,1 represents the parameter with the highest correla-tion. The primary model (M 1 ) is then established by applying a least-squares MLR between the top ranked predictor variable (p n,1 ) and y to constrain the regression coefficients β 0 and β n,1 . The routine then expands on M 1 in step 3 by regressing the top two ranked predictor variables (m = 2); where m represents the modelled predictor variable with the lowest correlation to y.
To determine if MCL exists in the expanded model (M m ), we calculate the variance inflation factor (VIF) for each modelled variable in M m and compare them to VIF values calculated for the same variables modelled in M m−1 . The existence of MCL is identified if the VIF value for any predictor variable p n,i (where i < m) increased by 5. For the scenario when MCL is detected, the model is updated with interaction terms between the newly added predictor variable (p n,m ) and any modelled variable with a VIF increase greater than 5. An Step 2) Primary least-squares MLR model (m = 1): Step 3) Expand on model (m = m +1): Step 4) MCL test based on VIF values:

Not detected
Step 5) Statistical tests

M M
Step 6) Include higher order terms Step 7) Prune MI based on t-test Step 8) Apply robust MLR Step 1) Rank predictor variables according to analysis of variance (ANOVA) between the previous model (M m−1 ) and expanded model (M * m ) is then applied to evaluate the significance of the newly added predictor variable and interaction terms. If the expanded model is found to statistically constrain the system with a higher degree of skill with a 95% confidence interval, the updates are accepted and the routine returns to step 3 to incorporate the next lowest ranked predictor variable (i.e. m = m + 1).
If MCL is not detected, a null-hypothesis test based on the t statistic is applied to determine if the coefficient of the new predictor variable is significantly different from 0 (i.e. the new predictor is important in constraining the system). If it does not differ from 0 with a 95 % confidence interval, the new predictor variable is defined as insignificant and is subsequently rejected from the model. The routine then returns to step 3 to again expand M m with the next lowest ranked predictor variable.
Once each predictor variable has had an opportunity to update the model (i.e. m = I ), any desired higher order variable terms are incorporated into the model on the provision the first order term was found to be statistically significant. The routine then prunes the model through an iterative process that removes insignificant terms based on the t test. Once all terms are statistically significant, the final stage of the routine applies a robust MLR to the set of significant terms to reduce potential influences from outliers.
This MLR routine is well suited for optimizing the model and dampening the influence of outliers that cannot be reasonably identified as bad measurements. This aspect is particularly important when the global dataset is subject to ad hoc geographical and/or temporal separation methods, where measurements not consistent with the bulk biogeochemical dynamics within a region have the potential to affect the model.

Ad hoc vs. universal MLR
To investigate the application of the traditional MLR method, we compared the skill of using one single regression globally (universal MLR) to an ad hoc approach that partitions the dataset into regions (ad hoc MLR). We based the ad hoc approach on dividing the global carbon dataset on the geographical and temporal guidelines outlined by Lee et al. (2006Lee et al. ( , 2000 and Bates et al. (2006). In this way, the global dataset was subset into 5 geographic regions to constrain the A T system, and 11 geographic regions, 8 of which were subjected to further separation into summer and winter months to constrain C T (see Fig. 3). The universal method simply uses the entire global dataset without division.

MLR results
When universally applying the traditional MLR on the ∼ 33 000 global mixed-layer C T measurements, the statistical regression RSE is 15.1 µmol kg −1 when using T , S, DO, P, N and Si as predictors (Table 3). If applying the ad hoc geographical and temporal separations, the statistical regression RSE reduces to 13.2 µmol kg −1 . However, when the independent test (SIT) is used to evaluate the regressions, errors increase to be 16 µmol kg −1 for the ad hoc approach and 15.6 µmol kg −1 for the global regression. For A T , optimal predictors were found to be T , S, S 2 , DO, P and Si, while a global MLR algorithm captured the signal to within 11 µmol kg −1 using the SIT approach. All empirical relationships for the global and ad hoc MLR models can be found in Supplement Tables T1 and T2. The MLR approach and results give us a framework to attempt to develop a better method that captures any potential non-linear biases that are contributing to errors of ±16 µmol kg −1 in C T predictions and ± 11µmol kg −1 for A T on a global scale.

Overview of the neural network approach
A self-organizing map (SOM) is an algorithm that uses an iterative approach to classify multi-dimensional data into discrete groups, or neurons, usually arranged in a 2-dimensional  grid. Using an algorithm that employs discrete clustering is appealing, as it removes the need for the type of ad hoc partitioning we discussed in Sect. 4.2. This has led to application of SOMs in a wide range of disciplines (Abramowitz, 2005;Hsu et al., 2002;Pöllä et al., 2009). Figure 4 illustrates the routine of SOM training and prediction. For a training dataset of P samples consisting of predictor variables x and response variable y, the SOM clustering process allocates each sample to one of J neurons (sometimes also called clusters, nodes or groups). The neurons are typically arranged in a 2 dimensional A × B matrix so that we represent a node as j a,b . The clustering algorithm aims to ensure that nodes that are nearby in this matrix contain samples that have similar values of the predictor variables x.
The y = f (x) input-output mapping is then completed by performing a linear regression between x and y separately for each neuron.
These SOM and regression parameters can then be used to make predictions of y for an independent set of Q predictor samples (x 1 , . . . , x q , . . . , x Q ). First, each x q is allocated to a SOM neuron, based on its similarity to the SOM weights from the training dataset. This is the "winning neuron" for a particular sample j (x q ). Then the regression parameters for j (x q ) are used to predict y q .
Here we explore two variants to this approach. The first, as described above, uses a multiple linear regression at each neuron, which we describe here as self-organizing multiple linear output (SOMLO). The second takes the mean of all response values belonging to a node, which we will call selforganizing map mean (SOMM). We now describe both in more detail.

Initialization of the model constraints
For our implementation, the input-output pairs (x p , y p ), 1 ≤ p ≤ P in the training dataset are some subset of x = (T , S, DO, N, Si, P), and y = C T or A T . To ensure each predictor variable has an equal opportunity to define the features of the SOM during the training routine, we zero-mean and scale the variables by their standard deviation so that their distribution and range are similar. For nitrate, phosphate and silicate, due to the exponential distribution of their measurements, we first log 10 scale their measurements.
The J -neuron SOM we use here is structured in a hexagonal topology for the current study (Fig. 4). Careful consideration needs to be exercised when defining the size of the SOM, as too few neurons will not capture all important fea-tures, while too many will over-fit the training dataset. Each neuron (j a,b ) is then assigned an initial weighting vector (ω) of length equal to the number of input variables, and whose values are randomly selected from the input variable range.

SOM training routine
Once all the neuron weights have been initialized, training is an iterative process designed to cluster the P samples into J neurons. For each iteration step of the model (τ ), the input data samples are individually presented to the SOM in a random order and the neuron whose weights are closest to the current input sample is declared the "winning neuron" for that sample, using That is, the "winning neuron", j (x p ), for sample x p is simply the neuron that minimizes this distance. Once the winning neuron is established, the weights of the winning neuron, as well as those neurons in its topological neighbourhood in the SOM, are then adjusted towards the value of the current sample value (x p ) via In this expression, h j,j (x p ) determines the extent to which a node's weight is brought closer to the current sample value (termed a "learning rate", h ≤ 1). It also determines the size of the neighbourhood around the winning node that receives a significant adjustment. We use where d j,j (x p ) represents the discrete distance in the SOM topology between the winning neuron j (x p ) and an arbitrary neuron j , and σ 2 (τ ) and η(τ ) are the neighbourhood width and learning rate respectively. As the model progresses through iterations, σ 2 (τ ) ensures that the neighbourhood width shrinks from a value that significantly adjusts most of the neurons to finish with only adjusting the winning neuron. Similarly, the learning rate η(τ ) decreases with iterations, so that regional features of the SOM gradually develop as iterations continue.
The form of the model used here is known as a supervised SOM, whereby distributional information of the response parameter (C T or A T ) is used as an additional constraint beyond the hydrographic information (T , S, DO, etc.) in clustering the global dataset into the set of J neurons. For more detail see Supplement E.

Completing the input-output mapping
We complete the y = f (x) in one of two ways. First, the mean of all output values y p belonging to a node is used -the SOMM. Alternatively, we use MLRs with the training data assigned to the winning neuron to establish this relationship (see Fig. 4). Here we use MLRs after the SOM training through the application of either a principal component regression (PCR; see supplement F for details) or our forward stepwise robust MLR routine (see Sect. 4.1). To ensure confidence in regression coefficients, a minimum threshold value of 10 times the number of predictor parameters was implemented. If the number of data points assigned to the winning neuron is below this threshold value, data from the second most similar neuron is merged with the winner, and then third, until the data pool reaches the threshold limit.

Predicting with the SOMLO / SOMM system
For any independent input data vector (x q ), we can predict the output value (y q ) using the SOM trained above via a twostep process. First, determine which neuron in the SOM each new data sample is closest to using the distance measure in Sect. 5.3 (Eq. 3). Then the output value (of C T or A T ) is determined using either the mean value of the winning neuron's training output values (using the SOMM) or the regression parameters established with training data.
6 Application to the global ocean

Optimization of the global model
To converge on the optimal SOMLO approach for the ocean carbon mixed-layer dataset, we employed a two-phase process. Firstly, three unique subsets of ocean carbon data were extracted to ascertain which hydrographic parameter combination worked best. In the second phase we applied the SIT approach to make an out-of-sample assessment of the global skill of the model.

Defining optimal predictor parameters
Correlations between hydrographic parameters may lead to redundancy in the information predictor variables provide.
To investigate the importance of each variable in informing the SOM or constraining the MLR, we perform tests that exclude the variables one at a time (Fig. 6). These test the ability of the models to capture three unique independent datasets that each represent about 10 % of the global carbon dataset (Table 4). As an example, Fig. 5 presents the spatial distribution of the T1 independent dataset, constituting 11.4 % of the global training dataset.
To explore the optimal SOM configuration, 800 iteration steps were used to train the SOM, using neuron map sizes ranging from 9 to 529 for every different input variable combination, with the ultimate aim to converge on the model with the lowest RSE.
Salinity was found to be the most important parameter for capturing the mixed-layer carbon signal, followed by temper- ature then nutrients (Fig. 6). The final optimal parameter set and SOM neuron size using the three independent tests were (SOPSi, 25) and (TSPO, 56) for the global A T and C T models respectively (Fig. 7). For C T , the SOMLO model using PCR constrained the system with a higher skill than the robust MLR, whilst A T was better constrained using the robust MLR model. The addition of phosphate beyond temperature, salinity and dissolved oxygen improved the prediction of C T by ∼ 27% or 5.1 µmol kg −1 (Fig. 7). Without air-sea gas exchange modulating its behaviour, phosphate likely provides clearer constraints on organic matter production and respiration than dissolved oxygen alone. The redundancy of nitrate for both C T and A T (Fig. 7) is likely due to the near constant stoichiometric uptake rate of phosphate and nitrate by photosynthesizing organisms. The preference of phosphate over nitrate may be a result of the continual production of organic matter by nitrogen fixers after the nitrate pool is completely depleted (Gruber and Sarmiento, 1997). Furthermore, the renaming of samples where only "nitrate + nitrite" was listed to nitrate in the GLODAP and CARINA products ) may serve to introduce additional biases in using nitrate.
Precipitation and dissolution of calcium carbonates (CaCO 3 ) affects the concentration of A T twice as much as C T . As waters high in silicate tend to relate to high biological respiration by diatoms (a non-calcifying organism), and waters of low silicate foster a more conducive environment for calcifying organisms (such as coccolithophores) (Kirchman, 2012), silicate helps constrain the spatial patterns of CaCO 3 cycling which influence A T .
Salinity's significant importance in constraining the A T system is likely due to the known high correlation between these two parameters (Millero et al., 1998), whereas the addition of temperature to the parameter set is redundant, as pointed out by some earlier studies (e.g. McNeil et al., 2007).

Importance of geography in the model
Carbon data from geographically diverse ocean regions will be clustered into the same neuron when input-output concentrations are similar. For example, a cluster of similar biogeochemical data in the North Atlantic Ocean can be equally  represented by those in some parts of the North Pacific Ocean, despite there being little ocean inter-connectedness between these two carbon datasets on shorter timescales. Spatial length scales of variability are known to be within ocean basins, not between them, especially those constrained by land. Without applying geographical boundary conditions, non-linearities may be introduced into the final MLR, which would limit the models predictive skill. To test this hypothesis, optimal model configurations were trained with the inclusion of geographical input parameters during the training of the SOM, but were excluded as predictor parameters in the linear regressions.
To reduce the influence of longitudinal discontinuity at ±180 • in the mid-Pacific, we shifted all longitude values by 160 • W (or 20 • E), thereby setting the 180 • discontinu- ity at a position that bisects continental Africa and Europe (see Supplement Fig. F1). We also tested a normal vector to the Earth ellipsoid (n-vector) that transforms the 2-D latitude/longitude position system into a 3-D vector while maintaining unique vectors for every geographical position. Employing a version of the n-vector presented by Gade (2010) We found introducing geographical information to be a powerful addition in improving the skill of the method for C T by 16 % or 2.2 µmol kg −1 , however there was little improvement for A T (Fig. 8). The optimal SOMLO configuration additionally incorporates longitude and n-vector geographical inputs in constraining A T and C T respectively, and increased the optimal number of neurons to 64 for C T .
To better understand and visualize why geography is important, we compare the spatial distribution of neurons for C T models trained with only biogeochemical information, and both biogeochemical and geographical information (Fig. 9a, b). To illustrate the spatial distribution of the assigned neurons for the global carbon dataset, we plot the neurons using different colours. Here, each colour represents a neuron, while shades of colours indicate close similarity in the weighting vectors. The broad regions of similarity that are captured when the SOM is constrained by only biogeochemical properties include the Southern Ocean, sub-tropical gyres, North Pacific and North Atlantic (Fig. 9a). However, these ocean "fingerprints" extend beyond the known spatial length scales, for example linking features in the Southern Ocean to those of the North Atlantic, while zonal bands stretch across ocean basins (Fig. 9a). When biogeochemical and geographical information are incorporated into the SOM training routine, the resulting distribution preserves the neuron boundaries at known frontal zones, such as the subtropical convergence zone, but is able to constrain the classification of data to be within each ocean basin (Fig. 9b). Using geography is an important additional constraint that implicitly shortens the length scales of variability which dominate seasonal mixed-layer dynamics in the ocean. It is important to note that the addition of geography did not alter the optimal parameter set for the technique.

SOMM/SOMLO comparison
Optimal model configurations were tested with neuron sizes extending up to 2500 to explore the ability of the SOMM model in constraining the three independent datasets (Fig. 10). Using all data the SOMM model converged on an RSE value of 16 µmol kg −1 in constraining C T . Although the SOMM is powerful in constraining complex non-linear datasets, spatio-temporal limitations in the current global carbon dataset hamper the SOM's mean-mode ability to predict C T on a global scale.
We found using a local multiple-linear optimizer (i.e. the MLR) in addition to the global SOM optimizer to significantly improve the model's ability to constrain global C T by  ∼ 27% or 4.4 µmol kg −1 . Similar findings are found for the A T model.

Measuring the improvement over traditional MLR
To evaluate the skill of the two independent approaches used here (MLR versus SOMLO), we tabulated the results of each technique based on the global SIT predictions divided into 5 Fig. 12. Geographical distribution of the 277 samples located within 300 km of a major coastline and with a SIT residual error greater than ±50 µmol kg −1 for C T and/or A T . geographical regions and evaluated globally ( Table 5). The SOMLO approach improves the predictive skill of C T by between 11 and 30 % for all 5 regions (Table 5). In particular, known complex dynamical regions with global CO 2 importance like the equatorial Pacific, Southern Ocean and North Atlantic are where the non-linear SOMLO approach excelled, improving the prediction of C T by between 23 and 30 % (or 4-6 µmol kg −1 ). From a global point of view, SOMLO improves the predictive skill of C T in the mixed layer by ∼ 19 %.
For A T , the benefits of using SOMLO are much weaker, with only a marginal global improvement by 6.7 % (or 0.7 µmol kg −1 ) and even deterioration of detection in the equatorial Pacific and North Atlantic. This is most likely a result of the carbonate system being less prone to nonlinearities and complexity, thereby limiting the benefits of    Fig. F2 for map of spatial division).  SOMLO, since it better constrains more complex non-linear systems.

SOMLO regional error assessment
To scrutinize the spatial skill of the SOMLO model, absolute values of the global SIT residual errors were interpolated around the in situ sample locations using VG gridding software in the Ocean Data View (ODV) program (Schlitzer, R.: Ocean Data View, http://odv.awi. de, 2011). Although the Arctic Ocean, Bay of Bengal and Sea of Okhotsk are regions not well constrained by the technique, the majority of the ocean maintains a relatively homogenous residual error range (Fig. 11). These unconstrained regions are either coastal or marginal seas with known locally complex biogeochemical regimes, so it is understandable that a trained global openocean technique will poorly constrain these local regions. Further investigation of the 395 samples with a SIT residual error greater than ±50 µmol kg −1 for C T and/or A T , revealed that 70 % (277) are located within 300 km of a major coastline (Fig. 12). Since a study by Gibbs et al. (2006) identified terrestrial influences extend up to 345 km from land and well beyond our bathymetric defined coastal ocean limit of 500 m (Supplement B), these anomalous independently predictions are likely the result of land-ocean interactions affecting the carbon and SHP concentrations. Separating the SIT predictions into 14 different regions and removing these anomalous coastal samples then provides the most accurate constraint on the models regional open-ocean skill (Table 6). Again we find the Arctic Ocean and Bay of Bengal are the two regions were the model's skill is poorest. Through the exclusion of Arctic Ocean measurements (North of 70 • ), the final estimate for the global open-ocean accuracy for C T and A T is 10.9 and 9.2 µmol kg −1 respectively.
To investigate skewness, we plot the SOMLO global SIT predictions versus the in situ measurements (Fig. 13a, c). For C T , skewness is limited (R 2 = 0.98), giving us confidence in the model's ability to accurately capture the concentrations of C T and A T for any given set of temperature, salinity, dissolved oxygen, (silicate for A T ), and phosphate measurements in the open ocean mixed layer.
Finally, we found no strong seasonal bias in our SOMLO predictions (Fig. 14).

Application to the Bermuda Atlantic and Hawaiian ocean time-series sites
The SOMLO technique was trained on a global C T and A T dataset that consisted mostly of sporadic one-time cruises in time. To test how well seasonal to inter-annual variability is captured using our technique, we use carbon time-series data from the BATS and HOT stations as a test bed.

Predicting the North Atlantic seasonal cycle for inorganic carbon (BATS)
Located in the Sargasso Sea, the BATS hydrographic site is a high frequency measurement program of carbon and auxiliary parameters that has been ongoing since 1989. To test the global SOMLO model in reconstructing the BATS seasonal cycle, we firstly re-trained the global algorithm without using the BATS 1989-2007 carbon time-series dataset. We then use the measured monthly hydrographic properties between 1987 and 2007 to independently predict C T and A T concentrations at the BATS site and finally compare our predicted carbon values to the in situ measurements to investigate the skill of the technique. We also independently predict C T /A T values with the traditional MLR approach as a further test. Figure 15a and b shows the measured versus predicted C T and A T annual cycles at BATS. Within the uncertainty of the SOMLO prediction, both the magnitude and structure of the seasonal C T cycle at BATS is well constrained, capturing 90 % of the signal (Fig. 15a). For a global MLR approach, the seasonal cycle is overestimated significantly by ∼ 50 %. For A T , the small seasonality is captured by both techniques (Fig. 15b).
To gain better insight into how the SOMLO substantially improves the prediction of the BATS seasonal cycle from the traditional MLR analysis, we investigate the neuron distribution for C T in the northwestern Atlantic (Fig. 16). Applying a traditional ad hoc MLR analysis requires defining somewhat subjective longitude and latitude boundaries for the data to be used in the linear regressions. Here, as an illustration we use the spatial boundaries of 30 to 70 • N and 40 to 85 • W that were also used by Lee et al. (2000) in their MLR approach. The traditional MLR explicitly uses all carbon data within the prescribed region, whilst the SOMLO approach partitions the data into neurons without any prior geographic constraints. The benefit in this approach is that when we are applying the SOMLO to a new dataset (in this case BATS) the SOM only uses neurons (data) most consistent with its "biogeochemical fingerprint", and therefore reduces the potential bias that would be introduced from including all data in the regression.

How well does SOMLO capture inter-annual signals?
Inter-annual variability of C T at BATS is captured to within the uncertainty of the SOMLO technique over the 18 yr period (Fig. 17). This illustrates a new potentially powerful way to diagnose year-to-year carbon variability in the ocean by using the many more long-term hydrographic time series that are available in the ocean (McNeil, 2010). To further test the SOMLO approach in capturing inter-annual variability of C T , we predict the C T signal at the HOT time series as reported by Brix et al. (2004). The SOMLO prediction captures the smoothed inter-annual trend line at the HOT site to within 85 % (Fig. 18). The BATS and HOT comparisons provide additional confidence that the SOMLO approach provides good constraints on both seasonal and inter-annual variability of C T , so that it could be used on a wider scale to help understand the ocean's role in modulating atmospheric CO 2 .

Comparison to previous techniques
It is important to emphasize reported error estimates of previous empirical studies to those calculated here. RSE values presented by previous empirical studies (see Table 1) are calculated from the regression's residual error rather than independent tests as done here, so direct comparisons between previous studies and our results are not valid. We use the systematic independent test approach (see Sect. 3) in order to accurately report the differences between our results and previous traditional MLR results.
We conduct two sets of calculations as shown in Table 7. The first set of calculations (MLR old ) involves taking the regressions from a suite of prior work (Bates et al., 2006;Lee et al., 2006Lee et al., , 2000McNeil et al., 2007) and applying it to the new larger dataset within each region. The second set of calculations (MLR new ) involved developing our own set of regressions using the same geographical and temporal boundaries and predictor variables as the previous authors within the much larger dataset. Using the SIT predictions, the skill of the models were calculated (RSE) and could then be directly compared to our SOMLO values (see Table 7).
The SOMLO, as shown at BATS/HOT, improves the predictive skill of C T and A T in most regions by between 10 and 40 %. Globally for C T , the SOMLO reduces the error by 28 % beyond the MLR method that was used to conduct the only global analysis (Lee et al., 2000).

Diagnosing global C T and A T distributions
Large historical and recent datasets up until 2008 of temperature, salinity, dissolved oxygen and nutrients has allowed researchers to objectively interpolate monthly 1 • × 1 • global climatologies Garcia et al., 2010a, b;Locarnini et al., 2010). Here, we employed the WOA09 ocean surface climatologies in conjunction with our SOMLO Fig. 17. In situ and independently predicted BATS C T measurements partitioned into years with loess fit (locally weighted scatter-plot smoothing). Large-scale features in our estimated annual mean C T and A T distributions exhibit strong agreement with bottle measurements and follows our broader understanding of spatial carbon variability (Figs. 19,20). In the Southern Ocean, for example, we find longitudinally homogenous bands driven by the Antarctic Circumpolar Current (ACC), and higher C T concentrations relative to the global-mean due to strong upwelling of CO 2 -enriched subsurface waters and cooler surface temperatures enhancing CO 2 solubility (McNeil et al., 2007;Metzl et al., 2006). In equatorial upwelling regions, cold waters enriched with remineralized organic material are brought to surface resulting in elevated C T and A T concentrations (Feely et al., 2002). As the surface water is then transported laterally from the site of upwelling, biological processes and loss of CO 2 to the atmosphere reduces C T to some of the lowest concentrations observed globally. For A T , maxima concentrations are found in the central subtropical gyres (∼ 25 • ), where stronger evaporation relative to precipitation drives higher ocean surface salinity and therefore A T concentrations (Lee et al., 2006;Millero et al., 1998). Conversely,    freshwater input from rivers and seasonal ice melt lowers A T concentrations in regions like the Bay of Bengal (George et al., 1994) and Arctic marginal waters.  interpolated bottle measurements collected between 1985 and 1999 to diagnose 1 • × 1 • resolution global climatologies for C T and A T at 33 depth surfaces (GLODAP-v1.1; available at: http://cdiac.ornl.gov/ oceans/glodap/Glop grid OV.html). Comparison between the GLODAP-v1.1 0 m carbon distributions and our results shows good general agreement (Figs. 19, 20). However, our average C T concentration between 65 • N and 77 • S is 14 µmol kg −1 higher than the GLODAP-v1.1 average of 2033 µmol kg −1 . In particular, the Southern Ocean and equatorial Pacific are where we find the largest discrepancies. This could either reflect the uptake of anthropogenic CO 2 that was not accounted for in the GLODAP study , or result from a 30 % improvement in Southern Ocean data coverage since 1999. However, it's likely that the large spatial and temporal bias within the GLODAP dataset plays the largest role in this discrepancy.

Conclusions
Here, we have exploited the global carbon C T /A T mixedlayer bottle database (∼ 33 000) to investigate two different empirical approaches that diagnose mixed-layer carbon dynamics from standard hydrographic parameters. Using independent data as a test, the traditional multiple linear regression approach constrains the C T system to 15.6 µmol kg −1 and A T to 10.4 µmol kg −1 . We then deploy a new non-linear neural network based approach that improves the predictive skill by 2.7-3 µmol kg −1 for C T , or ∼ 19 % over the MLR, and 0.7-1.4 µmol kg −1 for A T or ∼ 10 %. In particular, regions of known complexity and importance to carbon cycling like the Southern Ocean, North Atlantic and equatorial Pacific are where the new non-linear approach excels, reducing errors by up to 35 % over traditional linear approaches. We further test our neural network technique and find it to predict both seasonal and inter-annual variability of carbon at BATS and HOT very well.
The predictive skill of the neural network approach is shown to be spatially and temporally robust, making the model a powerful tool for diagnosing carbon dynamics in the ocean. In reality, the intensity of a sampling regime needed to constrain seasonal to inter-annual variability for carbon is so great that it will always be difficult to achieve on a global scale. We demonstrate here, that the use of non-linear empirical techniques on a global scale could potentially advance our understanding of oceanic carbon variability, particularly in a future where the amount of autonomous hydrographic data is increasing exponentially.