Weekly reconstruction of pH and total alkalinity in an upwelling-dominated coastal ecosystem through neural networks (ATpH-NN): The case of Ría de Vigo (NW Spain) between 1992 and 2019

10 Short and long-term variability of seawater carbon dioxide (CO2) system shows large differences between different ecosystems which are derived from the characteristic processes of each area. The high variability of coastal ecosystems, their ecological and economic significance, the anthropogenic influence on them and their behavior as sources or sinks of atmospheric CO2, highlight the relevance to better understand the processes that underlie the variability and the alterations of the CO2 system at different spatiotemporal 15 scales. To confidently achieve this purpose, it is necessary to have high-frequency data sustained over several years in different regions. In this work, we contribute to this need by configuring and training two neural networks with the capacity to model the weekly variability of pH and total alkalinity (AT) in the upper 50 m of the water column of the Ría de Vigo (NW Spain), with an error of 0.031 pH units and 10.9 μmol kg respectively. With these networks, we generated weekly time series of pH and AT in seven 20 locations of the Ría de Vigo in three depth ranges (0-5 m, 5-10 m and 10-15 m), which adequately represent independent discrete measurements. In a first analysis of the time series, a high short-term variability is observed, being larger for the inner stations of the Ría de Vigo. The lowest values of pH and AT were obtained for the inner zone, showing a progressive increase towards the outer/middle zone of the ría. The mean seasonal cycle also reflects the gradient between both zones, with a larger amplitude and variability 25 for both variables in the inner zone. On the other hand, the long-term trends derived from the time series of pH show a higher acidification than that obtained for the open ocean, with surface trends ranging from 0.020 pH units per year in the outer/middle zone to -0.032 pH units per year in the inner zone. In addition, positive long-term trends of AT were obtained ranging from 0.39 μmol kg per year in the outer/middle zone to 2.86 μmol kg per year in the inner zone. The results presented in this study show the changing 30 conditions both in the short and long-term variability as well as the spatial differentiation between the inner and outer/middle zone to which the organisms of the Ría de Vigo are subjected. The neural networks and the database provided in this study offer the opportunity to evaluate the CO2 system in an environment of https://doi.org/10.5194/bg-2021-33 Preprint. Discussion started: 15 February 2021 c © Author(s) 2021. CC BY 4.0 License.

anthropogenic processes interact, generates totally different characteristics among coastal ecosystems that highlights the need to study the peculiarities of each one to understand the variability of the CO2 system in detail.

75
The present study is centered in the Ría de Vigo (NW Spain), a coastal embayment affected by the Canary Current upwelling system. The upwelling season typically occurs between March and October (Figueiras et al., 2002), bringing high CO2, low pH and nutrient-rich waters from the Eastern North Atlantic Central Water (Fraga, 1981;Ríos et al., 1992) to the Ría de Vigo and promoting a positive estuarine circulation.
From November to March, poleward winds change the dynamics of the outer part of the ría to a negative 80 estuarine circulation and the inner part conserves the positive circulation forced by continental freshwater inputs (Álvarez-Salgado et al., 2000;Piedracoba et al., 2005). The nutrient enrichment because of upwelling, but also because of terrestrial run off (Nogueira et al., 1997), is translated into high primary production rates supporting a variety of fisheries in the area, which are very important for local economy (Fernández et al., 2016).

85
The main processes affecting the Ría de Vigo and its interaction are drivers of the variability of the CO2 system, as previously showed by the studies cited for other coastal locations. Very few studies have focused on analyzing the variability of the CO2 system in the Ría de Vigo. Essentially, the scarce studies have assessed the short-term variability of partial pressure and fluxes of CO2 (Gago et al., 2003;Pérez et al., 1999). The large short-term variability and the long-term trends presented by the drivers of the CO2 system, 90 as the weakening of the upwelling intensity (Lemos and Sansó, 2006;Pardo et al., 2011;Pérez et al., 2010), warming (Lemos andSansó, 2006;Pérez et al., 2010), increased nutrients (Doval et al., 2016) and deoxygenation (Padin et al., 2020), show the need to obtain high-frequency data over a long period of time to robustly evaluate the variability of the system at different time scales.
In this study, we reconstruct time series of in situ pH on the total hydrogen ion scale and AT at seven 95 locations in the Ría de Vigo in three depth ranges (0-5 m, 5-10 m and 10-15 m) with a weekly frequency between 1992 and 2019 to provide a database for the detailed analysis of the CO2 system in this coastal environment of great ecological and economic significance. The non-linearity of the CO2 system leads us to use the ability of neural networks to capture the nonlinear relationships among different variables and thus obtain the time series with a low uncertainty in this highly dynamic system. Based on the good results

100
of several studies to model variables of the CO2 system using feed-forward neural networks both in open ocean (Bittig et al., 2018;Broullón et al., 2019Broullón et al., , 2020aLandschützer et al., 2013;Velo et al., 2013) and in coastal environments (Li et al., 2020a(Li et al., , 2020b, and the higher performance showed among other methods (ocean: Velo et al., 2013;coastal: Li et al., 2020a), we selected this type of neural network for the obtention of the time series.

Neural Networks
Feed-forward neural networks are powerful tools to model complex nonlinear functions (Hagan et al., 2014). A neural network is fed with data pairs of inputs and a target in order to obtain a global relation between both type of variables to compute the target variable from independent data with considerably low 110 https://doi.org/10.5194/bg-2021-33 Preprint. Discussion started: 15 February 2021 c Author(s) 2021. CC BY 4.0 License. error. The network structure is made up of different layers with specific tasks (Fig S1a). The first layer (input layer) stores the input data and sends them to the next layer. In the second layer (hidden layer), a tunable number of neurons receive the data from the previous layer after being weighted by a different coefficient for each input variable and for each neuron in which the information goes in. Inside each neuron (Fig S1b), the weighted data is summed and passed through an activation function, which typically is a 115 sigmoid function, to obtain an output. The network can have several hidden layers depending on the problem to be solved. Finally, if a single target variable is being modeled, the final layer (output layer) contains one neuron with a linear activation function to obtain the desired output, following the process explained for a neuron of the hidden layer.
The goal of a neural network like the previously described is to reach output values closest to the target 120 ones, that is, with a low difference between them. To achieve the goal, this difference is backpropagated through the network to adjust the weights in order to reduce the error through the minimization of a cost function in multiple iterations of the procedure described so far. This process is known as training and its reliability must be tested in independent data to avoid the overfitting of the training data, that is, the output of the network on independent data should have a low error (good generalization). A full description on 125 how the error is reduced through the backpropagation algorithm and the different stopping conditions of the training process can be consulted in Hagan et al. (2014).
In the present study, multiple feed-forward neural networks with one hidden layer were configured to extract the relations between different combinations of input variables: temperature (T), salinity (S), phosphate (P), nitrate (N), silicate (Si), dissolved oxygen (O), position (LL: latitude and longitude; D: 130 depth) and time (Y: year; W: week); and the target ones: pH and AT. These variables were used as inputs to the networks in the combinations as follows, given their influence in the variability of the target variables, which have been shown in different studies (Sect. 1): DTS, DTSN, DTSO,DTSON,DTSOPN,DTSPN,DTSPNSi,DTSPNSiO,DTSPNSiOW,DTSPNSiOY,DTSPNSiOYW,DTSPNSiW,DTSPNSiY,DTSPNSiYW,LLDTSPNSi,LLDTSPNSiO,LLDTSPNSiOW,LLDTSPNSiOY,LLDTSPNSiOYW,135 LLDTSPNSiW, LLDTSPNSiY and LLDTSPNSiYW. The position and time were included to disentangle the different processes occurred in different locations of the Ría de Vigo, which may not always be fully captured by biogeochemical variables (Broullón et al., 2020a), such as continental runoff (Gago et al., 2005;Perez et al., 1992). The periodicity of the input week was represented by: For each combination of the input variables, the tested number of neurons in the hidden layer were: 28, 34, 40, 46, and 52. The number of neurons was selected increasing and decreasing by six the first number tested and testing in which direction the error in the independent dataset decreases to select the following number to test. A sigmoid activation function was selected for the hidden layer, since together with the linear 145 function of the output layer make the network able to fit almost any continuous function (Hagan et al., 2014). The Levenberg-Marquardt algorithm (Hagan and Menhaj, 1994;Levenberg, 1944;Marquardt, 1963)  was used for the training based on several good results obtained in similar studies (Broullón et al., 2019(Broullón et al., , 2020aLandschützer et al., 2013;Li et al., 2020aLi et al., , 2020bVelo et al., 2013). The training process with this algorithm was carried out in the MATLAB software with the trainlm function, detailed in Beale et al.
The dataset used to train the networks (Sect. 2.2) was divided in three sets: data used to assess the generalization of the network (testing dataset; 10% of the total data); data used to train the network (90% of the total data), which was divided in: training (85%) and validation (15%) datasets. The training dataset is used to extract the relations between the input variables and the target. The validation dataset is used to 155 stop the training process when the performance on this dataset does not improve during six consecutive iterations of the training process.

Data
The neural networks designed in the present study were trained with data collected by different cruises carried out by the Instituto de Investigaciones Marinas (IIM), dependent of the Consejo Superior de

160
Investigaciones Científicas (CSIC) from 1976 to 2018. Temperature, salinity, phosphate, nitrate, silicate, dissolved oxygen, pH and AT were measured in discrete samples from multiple profiles around the waters off the Northwestern Iberian Peninsula, mostly collected in the Ría de Vigo (Fig .1). We selected the data inside the Ría de Vigo and in the upper 50m of the water column to specifically capture the processes that control the variability of AT and pH in the area of study (Fig .1). A full description of this database can be 165 consulted in Padin et al. (2020), where sampling techniques are widely described.
The relations extracted with the neural networks from the previously mentioned database were used to create the time series of pH and AT at seven locations in the Ría de Vigo ( where the sampling and analytical techniques were extensively described. Time series of these variables were compared with discrete data from the IIM-database to assess their consistency (Appendix A).
Temporal gaps in nutrients time series were filled using neural networks (Appendix B). Furthermore,

185
Weekly time series of pH and AT were obtained passing through the neural networks the inputs from the INTECMAR-database. The measured pH and AT from the IIM-database were used to validate the neural network derived time series. The validation data used for each time series station is depicted in Fig. S2.
Distances from time series stations to validation points were selected based on a trade-off between spatial variability and the availability of samples with high temporal resolution around the stations. As an example, 190 station V1 (Fig. S2) has a wider validation area than others to capture discrete samples with both monthly and weekly time resolution.
The seasonal cycle of pH and AT was obtained averaging the time series to describe the annual amplitude, the variability and the differences among stations. The long-term trends were also obtained for the two variables to show the possible future conditions in the Ría de Vigo. A seasonal detrending to remove the 195 seasonality following the detrending method used by Bates et al. (2014) was applied to obtain the trends.
The uncertainty of the trends was obtained based on a Monte Carlo simulation. To properly compute the aforementioned values, the outliers were removed from each time series based on the scaled median absolute deviation (sMAD; Hampel, 1974), which has been proposed as a robust outlier estimator (e.g., Leys et al., 2013), following Eq. (3): where is each value of the time series and the full time series. Outliers were those values that exceeded more than three sMAD away from the median. biological activity and dissolution and precipitation of biogenic calcium carbonate (Brewer et al., 1975;Brewer and Goldman, 1976;Cai et al., 2010).
When each network is individually analyzed, other combinations of inputs show better statistics (Table 1) than the ones of the previously mentioned combinations, where averages of the three best networks per each number of neurons were analyzed. Based on these results, a trade-off between the RMSE and r 2 in the 225 test dataset and the most accurate representation of the validation data when constructing the time series of pH and AT (Sect. 3.2), was followed to select the networks which best represent the weekly variability of pH and AT from 1992 to 2019. Considering this evaluation, DTSPNSiOYW trained with 28 neurons in the hidden layer for pH (hereinafter referred to as pH_NN) and LLDTSPNSiYW trained with 52 neurons in the hidden layer for AT (hereinafter referred to as AT_NN) were selected to reconstruct the time series
The modeled pH and AT with pH_NN and AT_NN show the largest differences with the measured values in samples where temperature or dissolved oxygen are higher than 16 ºC or 300 µmol kg -1 respectively, and in those where salinity, nitrate or depth are lower than 35, 2.5 µmol kg -1 or 15 m respectively (Fig. 3). This result suggests the same biogeochemical conditions are similarly affecting the modeling of the two 235 variables. Nevertheless, most of the differences in these samples are not very large, considering the high variability in the Ría de Vigo regarding both the physical (Nogueira et al., 1997) and the biogeochemical (Doval et al., 2016;Nogueira et al., 1997) context. In samples where the measured pH is lower than 8 and higher than 8.1 and AT is lower than 2300 µmol kg -1 , the computed values are slightly biased (Fig. 3). These biases are mainly determined by samples where absolute differences between the measured and modeled 240 values are higher than 3 times the root mean square error (3RMSE), which represent 1% of the samples used to create each network. For pH, 48% of the samples with differences higher than 3RMSE in the biased ranges are in the 0-15 m layer with nitrate concentrations lower than 2.5 µmol kg -1 . For AT, 86% of the samples with differences higher than 3RMSE in the biased range are in the 0-10 m layer with salinities lower than 35 and most of them with nitrate concentrations lower than 2.5 µmol kg -1 . Although it is 245 important to know the conditions where the neural networks present the largest differences with the measured data, these samples only represent 1.7% for pH and 5.7% for AT of the total samples in the mentioned ranges, suggesting that pH and AT values with low uncertainties can be obtained for very different conditions. The importance of each input variable extracted with the connection weight approach (Olden and Jackson,250 2002) (Table 2) reveals that pH is mainly modeled through salinity, temperature and year. Depth, sW and dissolved oxygen are the second group of variables with the largest influence on the modeled pH. Nutrients and cW are the input variables with the lowest relative importance. On the other hand, the modeled AT is highly influenced by silicate, with a secondary importance of phosphate and with temperature and year in the third place of relevance. The other input variables have a low influence. Although surface AT is highly 255 correlated with salinity worldwide (e.g., Millero et al., 1998), the low influence of this input variable in the modeled AT does not reduce the high correlation between the two variables (measured data = 0.96; https://doi.org/10.5194/bg-2021-33 Preprint. Discussion started: 15 February 2021 c Author(s) 2021. CC BY 4.0 License. modeled data = 0.97) nor the linear regression (measured data = 61.9 · + 132; modeled data = 62 · + 129).  (Table 3). This rate of change is larger in the inner stations than in the outer ones in the 0-5 m layer (Table 3), showing the higher variability of the inner 280 zone of the Ría de Vigo. In the 10-15 m layer, the rate of change presents the mentioned difference between zones for AT but to the contrary for pH (Table 3).

Time series reconstruction
Differences in the magnitude of the values can be appreciated between the outer and the inner zones (  input variables (Table S1), is transferred to the computed variables resulting in the aforementioned difference between zones. In general, a reduction of both the amplitude and the variability with depth is obtained for pH and AT (Table 4). The reduction of the amplitude of the seasonal cycle is also obtained for all the input variables, except for dissolved oxygen in the inner stations (Table S1).
Maxima of pH occur in March and April and minima in October and November (Fig. 6). In general, the 300 seasonal cycle is highly correlated with that of dissolved oxygen (Fig. S7), which is mainly driven by the net balance of production and respiration of organic matter. High photosynthetic activity of the phytoplankton community is maintained in the Ría de Vigo from early spring to late summer thanks to influence of the upwelling events (Cermeño et al., 2006;Tilstone et al., 1999), which in average occur Maxima of AT seasonal cycle emerge in August and September, but minima are not clearly defined since low concentrations are maintained between November and May (Fig. 7). This feature is determined by the correlation between AT and salinity, reflecting the net freshwater input through the year, with high salinity values in summer due to the excess of evaporation and low values in winter and spring determined by 315 precipitation and continental runoff (Fig. S8). The higher amplitude and variability in the inner zone (Table   4 and Fig. 7) are mainly determined by the hydrological cycle of the river which flows into the inner zone of the Ría de Vigo.
Spatial differences in the magnitude of the seasonal cycle of pH and AT follow a longitudinal gradient, showing higher values in the outer zone with a progressively decrease towards the inner zone (Fig. 8). As 320 previously remarked, these differences mainly reflect the dissolved oxygen gradient for pH and the salinity gradient for AT, since both input variables decrease towards the inner zone (Table S1 and Figs. S7 and S8).
These spatial differences in the CO2 system variables generate a gradient in the buffer capacity between the outer and the inner zones. The ratio DIC/AT, which can be seen as an indicator of the sensitivity of seawater to changes in CO2 since the Revelle factor is proportional to it (Egleston et al., 2010), is higher in the inner 325 zone (0.91) than in the outer one (0.89), indicating a lower capacity of the seawater of the inner zone to buffer changes because of the increasing anthropogenic CO2. An analysis of the average pH for each decade reveals that the acidification process is stronger for all the stations between the 90s and the 00s (Fig. 9). Furthermore, the lack of data in 1990 and 1991 would likely 340 increase the pH difference between both decades because of the probably lower pH values for those two years that in the following ones. It can also be appreciated that the fact that pH trends are larger in the inner zone than in the outer/middle one (Table 5) (Table 5). As for pH, the largest trends are in the surface of the inner zone and decrease outwards. The AT trends cannot be explained by salinity trends (except in V7), contrary to the expected by the correlation 365 between both variables. The higher trends in the inner zone suggest that an increase in the riverine AT could be contributing to generate them. This process has been associated to positive AT trends in other locations (Kapsenberg et al., 2017). Nevertheless, it is necessary a detailed analysis to find what processes cause the trends, but this is beyond the scope of the present study. A first analysis of the product exposed differences between the inner and the outer/middle zone of the Ría de Vigo, where a larger variability and amplitude of the seasonal cycle and lower values for pH and AT were obtained for the inner zone. Furthermore, significant negative long-term trends were obtained for pH with a higher magnitude than the ones typically found for open ocean, revealing a higher acidification for 380 this coastal ecosystem. On the other hand, significant positive long-term trends were obtained for AT. A spatial gradient was also revealed for the magnitude of the trends presented by the two variables, increasing towards the inner zone.

Long
The database provided in the present study offers an opportunity to evaluate the variability of the CO2 system in an ecologically and economically important ecosystem as the Ría de Vigo. It can be very useful 385 in regional and high-resolution modeling, in driver analysis for different variables of the CO2 system, as some of high relevance for calcifying organisms like calcium carbonate saturation state, and to stablish more accurate conditions in experiments evaluating the impacts of climate change in the organisms living in the Ría de Vigo.

390
Time series of the input variables from the INTECMAR-database have been compared to discrete measurements from the IIM-database to assess the consistency between both datasets. This analysis is necessary because of the differences in the methods used to measure some of the variables. For example, dissolved oxygen in IIM-database was measured using the Winkler method, as others described in Padin et al. (2020), which generally provide more accurate values than the measured in the INTECMAR-database 395 through a SBE 43 dissolved oxygen sensor integrated in a CTD.
Temperature comparison shows a good agreement between the two datasets (Fig. A1). The typical annual cycle with the highest values in summer and the lowest in winter is depicted in the two datasets.
Furthermore, episodes of temperature decrease in the warm season because of upwelling events are well defined in the two datasets.

400
In general, salinity from the two datasets also shows a good agreement (Fig. A1). Nevertheless, there are some discrepancies in all the stations in the winter of 2018, where salinity in the IIM-database shows a homogeneous increase from winter to spring and in the INTECMAR-database a high short-term variability with an extremely high peak (which is still higher with depth) was obtained. These differences are probably because of some problem (e.g., bad calibration) in the conductivity sensor of the CTD used by INTECMAR

405
and are probably translated to the computed values of pH and AT.
Nutrients are very consistent between the two databases (Fig. A1) Dissolved oxygen is the variable that shows the largest discrepancies (Fig. A1).

Appendix B: Gap filling in nutrient time series
Time series of nutrients measured by INTECMAR have some gaps which reduce the time range and the time resolution (Fig. A1). To fill these gaps, different neural networks were configured. As nutrient measurements in the INTECMAR-database are consistent with the ones in the IIM-database (Fig. A1), we 425 decided to train the neural networks with data from the two databases to obtain highly representative relations of the sampling sites between inputs and targets and properly fill the gaps in the time series. The inputs used for the three modeled nutrients were position (latitude, longitude and depth), temperature, salinity, nitrate (to model phosphate and silicate since it was measured where there are gaps in these two variables), silicate (to model nitrate since it was measured where there are gaps in this variable) and week 430 of sampling (as in Eq. (3)). The position and time were included to disentangle the different processes occurred in different locations of the Ría de Vigo, which may not always be fully captured by biogeochemical variables, such as continental runoff (Pérez et al., 1992;Gago et al., 2005). We decided to take as inputs all the possible variables related to nutrients variability based on the results of Sect The RMSE and r 2 in the test dataset for the number of neurons tested were between: 0.13-0.14 µmol kg -1 and 0.66-0.73 for phosphate; 1.5-1.6 µmol kg -1 and 0.70-0.84 for nitrate; and 1.5-1.7 µmol kg -1 and 0.75-440 0.8 for silicate respectively. There are not significant differences among the RMSE for each number of neurons and slight differences in the r 2 . We decided to select the networks with the best statistics regarding these two parameters. The selected networks show that the inclusion of a nutrient as an input is relevant to obtain accurate values of the modeled nutrients as it is showed by the relative importance of the input variables (Table B1). In general, all the networks adequately represent the time series from the INTECMAR-database (Fig. B1), except the extremely high peaks described in Appendix A, which are not fully represented. The filling of phosphate time series before 1995 captures the annual cycles and their amplitude as it is showed by the consistency with the discrete samples from the IIM-database (Fig. B1). The gaps filled in both the nitrate and silicate time series also represent the annual cycles and their amplitude with a similar magnitude than As in Sect. 3.1, the RMSE and the r 2 improves with increasing number of inputs, when the statistics of the 465 three networks with the best performance in the test dataset are averaged (Fig. C1). For the best combination of inputs, that is, LLDTSPNSiYW, we selected one of the ten networks trained with 28 neurons to reconstruct the time series of dissolved oxygen, since the computed values with this network showed the highest correlation with the measured data in the test dataset (RMSE: 16.2 µmol kg -1 ; r 2 : 0.85). The relative importance of each input variable (Table C1)

475
The comparison between the time series of dissolved oxygen from the INTECMAR-database, those reconstructed with the neural network created in this section and the discrete data from the IIM-database shows that the reconstructed time series better represent the discrete measurements for all the stations (Fig.   C2). More precisely, the good correlation between the discrete measurements of dissolved oxygen and the Similarly, the unreliability of the negative anomalies can be seen at V5 and V6 around 2018 (Fig. C2). With these results, we demonstrated that the new dissolved oxygen time series may be useful in obtaining the 485 time series of the target variables of this study: pH and AT.

Code availability
The MATLAB code used in the present study is available through the data repository of the Spanish The two neural network objects used to compute pH and AT are also included.

Data availability
The      Table B1. RMSE, r 2 and the relative importance of each input for the three selected networks to fill the gaps in the time series of nutrients from the INTECMAR-database.