A Comparative Assessment of Gap-filling Techniques for Ocean Carbon Time Series

Regularized time series of ocean carbon data are necessary for assessing seasonal dynamics, annual budgets, interannual variability and long-term trends. There are, however, no standardized methods for imputing gaps in ocean carbon 15 time series, and only limited evaluation of the numerous methods available for constructing uninterrupted time series. A comparative assessment of eight imputation models was performed using data from seven long-term monitoring sites. Multivariate linear regression (MLR), mean imputation, linear interpolation, spline interpolation, Stineman interpolation, Kalman filtering, weighted moving average and multiple imputation by chained equation (MICE) models were compared using cross-validation to determine error and bias. A bootstrapping approach was employed to determine model sensitivity to varied 20 degrees of data gaps and secondary time series with artificial gaps were used to evaluate impacts on seasonality and annual summations and to estimate uncertainty. All models were fit to DIC time series, with MLR and MICE models also applied to field measurements of temperature, salinity and remotely sensed chlorophyll, with model coefficients fit for monthly mean conditions. MLR estimated DIC with a mean error of 8.8 μmol kg-1 among 5 oceanic sites and 20.0 μmol kg-1 among 2 coastal sites. The empirical methods of MLR, MICE and mean imputation retained observed seasonal cycles over greater amounts 25 and durations of gaps resulting in lower error in annual budgets, outperforming the other statistical methods. MLR had lower bias and sampling sensitivity than MICE and mean imputation and provided the most robust option for imputing time series with gaps of various duration.

the feedbacks between climate, ecosystems, and biogeochemical cycles. To that end, the value of sustained time series 35 observations have been well recognized for decades, as they are essential to characterizing processes, quantifying natural variability, identifying regime shifts and detecting long-term changes in our environment (Ducklow et al., 2009). Monitoring ocean carbon over the last three decades has revealed the decline in ocean pH concurrent with the uptake of 40% of anthropogenic CO2 by the global ocean (Bates et al., 2014;DeVries et al., 2017). Quantification of the ocean carbon sink and the impacts of ocean acidification remain actively researched given the significance of the ocean's role in controlling climate 40 feedbacks as well as the ecological and economical importance of our marine systems (Kroeker et al., 2013;DeVries et al., 2019;Krissansen-Totton et al., 2018;Bernardello et al., 2014). Ocean carbon programs have led to a growth in surface pCO2 data from 250,000 global measurements in 1997 to 13.5 million in 2019; however, continuity and coverage of this inorganic carbon data in space and time remains a challenge for understanding seasonal and interannual variability (Takahashi and Sutherland, 2019;Takahashi et al., 1997). 45

Filling the gaps
Consistent sampling intervals for physical and biogeochemical parameters over several decades are critical for understanding ocean processes, establishing variability and detecting long-term changes (Henson et al., 2016). In addition to constraints arising from limitations in technology, logistics and funding, ocean science takes place in a particularly harsh environment where data loss is a common occurrence. Whether from equipment failure, cancelled field campaigns, budget cuts, or a global 50 pandemic, gaps in time series are ubiquitous and must be appropriately filled in order to carry out various statistical analyses and modelling applications which require serially complete data sets.
Machine learning techniques such as neural network methods, regression trees, and random forests have been widely used to fill gaps in meteorological and some oceanographic data, including surface ocean pCO2 (Laruelle et al., 2017;Sasse et al., 2013;Coutinho et al., 2018). While these methods are successful in the context of geospatial data, there remains little 55 standardization in methods for imputing data gaps in oceanographic time series, particularly carbonate chemistry, at monitoring sites where there are not sufficiently close neighboring values (in time or space) that can be leveraged. Linear interpolation and mean imputation are among the most common methods for handling missing data over short to moderate time scales (Reimer et al., 2017;Kapsenberg and Hofmann, 2016;Currie et al., 2011), but comparative assessment and validation of approaches overall is lacking. Gap-filling studies and standardization have been pursued in other terrestrial and atmospheric 60 disciplines, such as eddy covariance carbon flux, solar radiation, air temperature, surface hydrology, and soil respiration (Moffat et al., 2007;Demirhan and Renwick, 2018;Zhao et al., 2020;Henn et al., 2013;Pappas et al., 2014), many of which focused on high temporal resolution data and imputing missing values over time scales from seconds to days. However it is important that the imputation method not only focuses on minimizing error but also minimizing bias, as the preservation of variance and trends is imperative for accurate analyses and understanding of climate (Serrano-Notivoli et al., 2019). 65 This study aims to identify the optimal gap-filling methods for carbonate time series by establishing which techniques perform with sufficiently low error and bias to assess seasonal and interannual variability of carbonate biogeochemistry and the biological and physical processes that determine it. Here we present an empirical multiple linear regression (MLR) model for estimating site-specific DIC concentration in the surface ocean using remotely sensed data products to fill gaps in field measurement records. We compare this MLR approach to two other empirical methods and five statistical methods for time 70 series imputation with the goal of informing best practice for gap-filling temporal ocean carbonate data. Although the focus here is on DIC time series, the principals of this study should extend to other carbonate parameters.  Fig. 1), which have been characterized in other studies (Bates et al., 2014;Fassbender et al., 2016;Fassbender et al., 2017;Zeldis and Swaney, 2018). Additionally, these time series have sufficient sampling frequencies and length of record to assess the monthly mean climatological conditions and seasonal cycle, so to allow inclusion of empirical imputation methods in this comparative assessment. Table 1 lists the site details including the carbonate parameters measured, the duration of the time series, and the gap rate based on the expected sampling frequency for each of the seven sites. 85

Field data
All mixed layer temperature, salinity and dissolved inorganic carbon (DIC) data were averaged to monthly means for each time series site. For non-moored sampling sites with bottle sampling (BATS, CARIACO, HOT, Munida), monthly values were treated as the monthly mean condition. For each site the mixed layer depth was determined according to the temperature profile and a threshold of DT > 0.2 o C relative to 10 m depth (de Boyer Montégut, 2004). For sites that did not measure DIC directly, the measured carbonate parameters were used with in situ temperature and salinity to calculate the DIC concentration using 90 the function carb within the R package seacarb (Jean-Pierre Gattuso et al., 2012;Orr et al., 2018) with K1, K2 from (Lueker, https://doi.org/10.5194/bg-2021-78 Preprint.

Remotely sensed data products
Monthly composites of satellite-derived surface ocean chlorophyll (O'Reilly et al., 1998) from MODIS (4 km resolution) data 95 were paired with field data from each site except FOT. The mean surface chlorophyll was taken from a ~20 km 2 cell surrounding each of these sampling locations. For FOT, surface chlorophyll was estimated from monthly composite of VIIRS data (750 m resolution), with the mean from a ~ 4 km 2 cell surrounding the mooring used in this case given the greater spatial heterogeneity in this semi-closed coastal system.

Estimation of DIC with MLR 100
DIC, pCO2 and other carbonate parameters have been successfully estimated in a variety of marine systems using multiple linear regression (MLR) approaches (Bostock et al., 2013;Velo et al., 2013;Hales et al., 2012;Lohrenz et al., 2018). In addition, empirical estimates of pCO2 using remotely sensed chlorophyll and sea surface temperature (SST) have proven useful for investigating seasonal and interannual dynamics across spatial gradients, particularly in coastal systems where sustained observations may be limited Lohrenz et al., 2018). We investigated using an MLR model to estimate DIC 105 from remotely sensed chlorophyll, SST and salinity in order to fill gaps in the seven monthly time series data. Parametric correlation matrices of DIC with remote chlorophyll, in situ SST and salinity showed significant linear correlation (Table 2), across most sites, with temperature having the strongest and most consistent correlation with DIC.
DIC at time t can be estimated using MLR relationships described in the form of Equation 1.
where has units of µmuol kg -1 , ℎ has units of mg m -3 , has units of o C, and has units of psu and the coefficients and " through $ are the regression coefficients fit using a generalized linear model with a Gaussian error distribution and link function. The sensitivity to each predictor variable was assessed by selectively omitting chlorophyll, temperature, and salinity from the model fit.

Imputation of DIC time series 115
Six general methods were compared for imputing DIC time series: classical, interpolation, Kalman filtering, weighted moving average (WMA) and regression and multiple imputation by chained equations (MICE). To apply the six methods, it was assumed that the gaps in the time series were data 'Missing at Random' (Little, 2002) that is, not missing systematically (i.e., as caused by seasonal sea ice cover or season-specific sampling regimes). Given this assumption, these methods can be used to handle data gaps with limited biasing. 120 The primary goal was imputing timeseries at monthly resolution to investigate variability and trends over seasonal, interannual and decadal timescales. Therefore, random sampling and persistence methods were not considered as these methods can lead to distortion of seasonal structure in the time series. Within the 6 methods chosen, 8 models were evaluated. These imputation models vary in complexity and flexibility and represent a range that have been widely applied to time series data, with 6 of the 125 8 models utilizing formalized packages (Demirhan and Renwick, 2018;Moritz, 2017). These methods limit overfitting and can be implemented with relative ease and low computational cost. Artificial data gaps were created as described below (Section 2.5) for the time series from each site in order to assess the performance of each method. In addition to the MLR model described by Equation 1, alternate models are described next.

130
The classical (and simplest) method applied was mean imputation, where missing values were replaced by the monthly climatological average. The climatological mean was taken as the monthly averaged means across the duration of the time series, which was over 1-2 decades in most cases. Linear interpolation was used to estimate missing values by drawing a straight line between existing values in the time series and using the slope of each of these segments to determine the value of DIC at a time point(s) between known values. Spline interpolation utilized piecewise cubic polynomials to fit a curve with 135 knots at % , K = 1,2…k, to the data, providing more flexibility with the ability to interpolate between each point of the training data. Stineman interpolation was developed to provide the flexibility of polynomials while reducing unrealistic estimations during abrupt changes in slope within the time series (Stineman, 1980) (see Demirhan and Renwick (2018) for algorithm details). Kalman filtering was implemented using a structural model. In this case a linear Gaussian state-space model was fit to the univariate time series by maximum likelihood based on decomposition (Demirhan and Renwick, 2018). A single 140 weighted moving average model was evaluated. Missing values were replaced by weighted average of observations in the averaging window with size = ±2 and weighting was exponential such that the exponent increases linearly to the ends of the window, here ¼, ½… ½, ¼.
Multiple Imputation by Chained Equations (MICE), also known as fully conditional specification (FCS) and sequential 145 regression multivariate imputation, was applied to time series data with artificial gaps and fit using the mice library (Van Buuren, 2011) (cite1) in R version 3.5.2 (Team, 2020) (cite2), with function call mice (data = TimeSeries$data,m = 5,method = "pmm",maxit = 20), where m is the number of multiple imputations, method is predictive mean matching and maxit is the maximum number of iterations. This method progresses through the following steps: 1) missing values are filled by random sampling from the observations for a given variable; 2) the first variable with missing values is regressed against all other 150 variables, while using only those with observed values; 3) moving iteratively, the remaining variables are regressed against the others but now including imputed values fitted by the regression models (White et al., 2011). This process is repeated according to the set iterations, in this case 20, to allow stabilization and convergence of the results. Regression models used in MICE allow for both linear and nonlinear relationships across variables, making this method very flexible.

Model performance and comparison 155
Each imputation model was evaluated using two schemes that assessed model performance and sampling sensitivity.

Cross validation
Leave one out cross validation (LOOCV) was chosen to assess the predictive error of the MLR model as well as the standard error for each imputation method. In this approach a single observation ( !&" ) is held out for validation while the remaining observations ( !&# … !&' ) are used for training the model. This process is repeated n-1 times, allowing each data point 160 in the time series to be treated as both training data and testing data, thus maximizing the efficiency when the data sets are of modest sampling size. Predicted DIC values and model parameters determined in each iteration were collated for the time series and performance statistics were evaluated on the total output.

Bootstrap sampling sensitivity
A bootstrapping approach was used to evaluate the sensitivity of the imputation models to the amount of data gaps in each 165 time series. For each year of input data in the time series, artificial gaps were created by random removal of 1:8 monthly samples resulting in data gaps of 8.33%, 16.67%, 25.00%, 33.33%, 41.67%, 50.00%, and 66.67%. Random sampling was replicated 1000 times for each gap amount to ensure that an even distribution of sampling combinations was evaluated to assess the impacts of degree of data gaps on imputation error. Only years with 12 monthly samples were used to evaluate the sampling sensitivity in order to ensure consistency. It should be noted that most data sets used in this study do not have monthly 170 mean data available for all years. Table 3 shows which years of data were used from each site and the distribution of years across sites.

Statistical performance metrics
The performance of each model was evaluated by comparing the predicted DIC values to the observed DIC measurements.
The performance metrics included the coefficient of (multiple) determination ( # ) for indicating correlation; the root mean 175 square error (RMSE), the relative root mean square error (RRMSE), and the mean absolute error (MAE) for establishing the distribution of individual errors; and the bias error (BIAS) for indicating bias induced on annual sums. Performance metrics were calculated according to Equations 2-6, where ( and ( denote the individual observed and predicted values respectively.

Error propagation, seasonality and annual summation
To evaluate the propagation of error associated with each imputation model we calculated the net annual air-sea CO2 flux at 185 BATS and compared the observed and imputed time series using 3 synthetic gap schemes: bimonthly, as well as 3-month and 6-month sequential gaps. We used a Monte Carlo method to estimate combined uncertainty of the net annual CO2 flux (Fassbender et al., 2016).

Calculation of air-sea CO2 flux
The flux of CO2 (µmol m -2 d -1 ) across the air-sea interface where a positive value denotes flux from the sea to the air was 190 calculated according to Equation 7: Where is the gas transfer velocity coefficient (Wanninkhof, 1992), 7 is the temperature-and salinity-dependent solubility of CO2 in seawater (Weiss, 1974), #, 19:;' and #, ;!< are the partial pressure of CO2 in the surface ocean and the atmosphere respectively. The ocean # was calculated from DIC and TA collected at BATS using the R package seacarb 195 (see Section 2.1). The #,;!< was calculated according to Equation 8: Where #,;!< is the atmospheric concentration of CO2, =;>1 is the barometric pressure at sea level and ? " 6 is the vapor pressure of water at the sea surface temperature and salinity (Zeebe and Wolf-Gladrow, 2001 barometric pressure by taking the daily mean of the 6-hourly FNMOC Sea Level Pressure data product across a 100 km 2 cell (https://data.noaa.gov/dataset/dataset/fnmoc-sea-level-pressure-360x180-6-hourly1).

Monte Carlo simulations
Given that the air-sea CO2 flux depends non-linearly on the sea surface temperature and salinity, the wind speed, barometric pressure, and the # in the atmosphere and the surface ocean, a Monte Carlo approach provides a straightforward means 210 of determining uncertainty. We followed the approach of (Fassbender et al., 2016) in which the independent input values were varied randomly in a Gaussian distribution about the given time series values ±3 9 , where 9 is the standard uncertainty of the input variable. This standard uncertainty was taken as either the measurement uncertainty or the combined standard uncertainty that accounted for measurement uncertainty and uncertainty associated with derived values, such as # from DIC and TA, and the imputed DIC values. The calculations of the air-sea flux were performed n = 1000 iterations using these 215 input values and the overall combined uncertainty was taken as 9,! = 3 ! , where ! is the standard deviation of the outputs for each point in the time series.

Seasonal and interannual variability across sites
Box and whisker plots (Fig. 2) show the seasonal climatology and interannual variability for DIC across the sites tested. The 220 bar plots in Fig. 2 show the seasonal amplitude, which was taken as the difference between maxima and minima of the climatological monthly means, and the interannual variability, which was taken as the standard deviation of the monthly means.
The amplitude of the seasonal cycle of DIC spanned 11 -90 µmol across sites, while interannual variability ranged from 8 -22 µmol. The seasonal cycles, including amplitude, timing and interannual variability illustrate diversity among the test sites so enabling robust assessment of the empirical MLR model for surface layer DIC and other imputation methods. 225 Fig. 3 shows the performance of the MLR model to estimate DIC using the available time series data from each site (N = 897).

DIC estimation by MLR
The cross validated MLR exhibited an R 2 of 0.9352 with an RMSE of 12.04 µmol kg -1 , RRMSE of 0.59%, MAE of 8.76 µmol kg -1 and bias of 0.030 µmol kg -1 . The high R 2 and low error and bias indicate that the MLR model worked well for prediction of DIC from remotely sensed chlorophyll, temperature, and salinity across different ecosystems. The predictions and errors 230 for the data from each site are provided in Table 4, which includes the means of the model coefficients and their standard deviations for the N iterations of LOOCV per site.
The MLR performed best at PAPA with a RMSE of 4.85 µmol kg -1 . The greatest error was associated with the CARIACO and FOT coastal sites, however, most of the predicted values still fell within 1% of observed DIC. When the sites were separated 235 into oceanic (BATS, HOT, KEO, PAPA and MUNIDA) and coastal (CARIACO, FOT) categories, the RMSE was 8.75 µmol kg -1 and 19.97 µmol kg -1 respectively. When comparing the predictive accuracy of the MLR to the DIC variability at each site ( Fig. 4), the interannual variability is strongly correlated ((R) = 0.8532, p < .02) to the RMSE while the seasonal amplitude has no apparent impact ((R)= 0.0771, p > .8), meaning the error in the predictions is most strongly related to interannual variability at each site.
To assess the sensitivity of the MLR to the predictor variables, the model was adjusted by selectively removing predictor variables and refitting the model. The changes in RMSE per site due to the omission of a given variable are shown as an anomaly in the tile plot of Fig. 5. BATS exhibited the greatest sensitivity to chlorophyll relative to other sites; FOT, HOT and KEO were relatively more sensitive to the effect of salinity; and temperature omission had the greatest impact for CARIACO, 245 KEO, MUNIDA, and PAPA. The mean effects of variable omissions are given in Table 5, which indicates that collectively temperature had the greatest impact among the predictor variables on the predictive error. This was consistent with the expectations resulting from the correlation matrix provided in Table 2. The selective omission of predictor variables indicates that salinity contributes the most to the bias error although the bias error was low (<0.1) across all sites.

Performance of imputation methods 250
The results of the LOOCV (Fig. 6 and Table 6) indicated that each of the imputation models performed reasonably well with only 12% of all residuals exceeding 1% error and zero estimated DIC values exceeding 5% error. Differences in the performance of the models to fill singular gaps were generally minor. Table 6 shows the performance metrics for the cross validated models across all sites. Overall, the MICE and MLR models exhibited the highest R 2 and lowest error (MAE, RMSE and RRMSE), followed by Linear Interpolation, Mean Imputation, Stineman Interpolation, Kalman Filtering, Weighted 255 Moving Average and Spline Interpolation in order of increasing RMSE. MLR exhibited the least amount of bias, while Mean Imputation exhibited the greatest amount of bias.
There was considerable variability among the performance of each method across sites. The key in Fig. 7 indicates the RMSE normalized to the mean value for comparing the relative error across sites and methods. The individual cross-validated errors 260 for each imputation method per site are given as the numerical value in each tile of the figure. Fig. 7 provides further evidence that CARICO and FOT exhibit the greatest error overall, while BATS and HOT exhibit the lowest error.

Sampling sensitivity
Sampling sensitivity was assessed by the RMSE for randomized artificial gaps totaling 8.33%, 16.67%, 25.00%, 33.33%, 41.67%, 50.00%, and 66.67%. The randomized approach resulted in a mixture of sequential and non-sequential gaps, while 265 bootstrapping achieved equivalent representation of all months for each assessment. Fig. 8a shows boxplots of the RMSE for each imputation method as a function of percent of data missing at each site. Spline interpolation resulted in much greater error and necessitated separate scaling. There was significant variability in both the performance of different imputation methods within sites and within imputation methods across different sites. In general, mean imputation and MLR converge on a maximum error once data gaps reached 20-40%, whereas the error for other imputation models is positively correlated with 270 the percent of data missing. While the performance of the cross validated Kalman filtering model did not differ greatly from the other interpolation methods, Fig. 8A indicates it leads to a greater number of outliers overall, in particular at BATS, KEO and PAPA. Spline interpolation also resulted in a high number of outliers, with the most extreme error over other methods. series. The MICE model shows the highest level of sensitivity to the percent of data missing despite performing very well under the LOOCV and low numbers of data gaps.

Time series gaps
The imputed secondary time series synthesized with artificial gaps at bimonthly, and sequential 3-month and 6-month durations that were used for assessing imputation error on seasonality and annual sums are shown in Fig. 9a. The time series indicated 280 that most methods performed well to impute bimonthly data gaps. The positive bias error exhibited by mean imputation in the LOOCV analysis was evident in the bimonthly timeseries as well, and it had the highest RMSE at 7.90 µmol kg -1 . The MICE model, which outperformed other methods in the LOOCV analysis, including at BATS, had the second highest error at 7.10 µmol kg -1 for bimonthly imputation. However, mean imputation and MICE performed equally well, with the second lowest RMSE for the 3-month gap time series. One-way ANOVA indicated that the differences in imputation error between methods 285 for both the bimonthly and 3-month gap time series are not significant (Table 7). However, it is notable that the MLR model exhibited an RMSE that was less than half of the other methods at 3.10 µmol kg -1 . The 3-month gap length does not result in significant divergence from seasonal variability when using linear, spline and Stineman interpolation and also Kalman filtering and weighted moving average.

290
There was however greater variability among imputation errors across methods for the 6-month gap time series (Fig. 9b). This was further evidenced by the 1-way ANOVA, with a p-value < 0.0001. The MLR model had the lowest imputation error at 5.02 µmol kg -1 followed by mean imputation and MICE with errors of 7.95 and 8.19 µmol kg -1 respectively. Spline interpolation exhibited the greatest error at 29.18 µmol kg -1 , while the remaining models had errors of approximately 20 µmol kg -1 ( Table 7). The 6-month gap length appears long enough to force linear, spline and Stineman interpolation as well as 295 Kalman filtering and weighted moving average models to diverge from the seasonal variability observed in the time series.
Spline interpolation dramatically overestimated the maximum DIC and seasonal signal in 1998 in the 6-month gap time series. Fig. 10 shows the net annual air-sea CO2 fluxes calculated form BATS observations collected 1998-2001 compared to the CO2 flux calculated from time series imputed using each of the models for the bimonthly, 3-month gap and 6-month gap secondary 300 time series. In general, the models tended to overestimate the flux when the observed sums were lower (1998)(1999) and underestimate the flux when the observed sums were higher (2000)(2001). The anomaly observed from the poor fit using the spline interpolation model for the 6-month gap time series resulted in a large positive flux which was opposite in sign but nearly equal in magnitude to the observations during 1998.

305
The absolute percent difference in the annual summation of air-sea CO2 flux using each imputation model for the bimonthly, 3-month gap and 6-month gap time series are shown in Fig. 11. The relative error of the MLR was similar to MICE and mean models for 6-month gaps but was consistently lowest across all other time series. However, consistent with the time series imputation, 1-way ANOVA indicates that only the mean errors for the 6-month gap time series are significantly different. At this gap duration, the MICE method, mean imputation and MLR model perform significantly better than other approaches. 310 performed nearly as well as MLR over 6-month gaps.

MLR estimation of DIC
The development of remote sensing and MLR-based approaches for carbonate chemistry have been used extensively for extrapolating over broad spatial and temporal scales to investigate regional to basin scale phenomena (Bostock et al., 320 2013;Hales et al., 2012;Evans et al., 2013;Lohrenz et al., 2018;Juranek et al., 2011;Alin et al., 2012). Remote sensing applications have focused primarily on predicting pCO2 and estimating air-sea flux in coastal waters to better understand the seasonal and spatial heterogeneity of carbon sources and sinks and their implications for regional and global carbon budgets Lohrenz et al., 2018). Many MLR models that predict carbonate parameters have been developed using large observational data sets that include either dissolved oxygen (O2) (Juranek et al., 2009;Kim et al., 2010;Alin et al., 325 2012;Bostock et al., 2013) or nitrate (NO3) (Evans et al., 2013) as a predictor variable along with temperature and salinity.
MLR models that incorporate O2 and NO3 can perform particularly well in coastal environments where ecosystem metabolism has a dominant effect the carbonate chemistry , Juranek et al., 2009. However, there are currently no remotely sensed O2 and NO3 data products and the chances of glider or float data being available at a given time series site to coincide with a gap in carbonate measurements are limited. The MLR model presented herein serves as a method for imputing missing 330 DIC values in time series using remotely sensed chlorophyll and temperature and in-situ salinity. Model-based estimates of salinity could be used when in-situ data are not available; however, the error associated with reanalysis salinity data products, Forecast System Reanalysis (CFSR), and the Bluelink Reanalysis (BRAN), would need to be assessed for a given location and included in the uncertainty budget (de Souza et al., 2020) 335 The variability in the MLR model coefficients indicated that the relationships between DIC, chlorophyll, temperature and salinity were location-specific and cannot be spatially extrapolated to different water masses and ecosystems. This was indicated by the variability seen among the correlations of predictor variables to DIC across sites and clearly evidenced by the differences in model performance between the coastal sites (FOT and CARIACO) and the oceanic sites. However, when the 340 MLR was trained with sufficient observations to capture the seasonal cycle, it could predict DIC with error that was far less than the natural variability over seasonal and interannual time scales and was typically on the order of, or better than the variability on monthly time scales. The RMSE of 4.85 -10.67 µmol kg -1 at the oceanic sites is consistent with other MLR studies which have ranged from ~4-11 µmol kg -1 (Evans et al., 2013;Juranek et al., 2011;Bostock et al., 2013), while the RMSE at coastal sites (FOT and CARIACO) of approximately 20 µmol kg -1 is larger than exhibited in a California Current study 345 . The Alin study, like others (Juranek et al., 2009;Juranek et al., 2011), estimated DIC based on O2 and density, incorporating a multiplicative relationship. While O2 may improve the performance of MLR approaches, particularly in biologically active coastal environments, the MLR model here only utilized remotely sensed chlorophyll and temperature and therefore only applied to the surface layer. O2 and CO2 may become decoupled in the surface layer due to varying time scales for air sea gas exchange, making O2 a less reliable predictor variable for surface concentrations of DIC (Juranek et al., 350 2011). Despite somewhat higher RMSE in coastal environments relative to the results of Alin et al. (2012), the MLR model here exhibited predictive error that is still less than 1% at such sites. We contend that this is an acceptable error in such environments for the purposes of gap filling and temporal extrapolation.

DIC time series imputation
Cross validation of the imputation models evaluated in this study indicated that each of these models have reasonably low 355 (typically <1%) error when imputing a single value at monthly timescales. This was similar to other comparative gap-filling studies that focused on higher temporal resolution data and imputing missing values over time scales from seconds to days (Moffat et al., 2007;Zhao et al., 2020;Demirhan and Renwick, 2018). For the assessment of annual budgets in the studies of (Zhao et al., 2020;Moffat et al., 2007), the error associated with the imputation methods was similar to the uncertainty in the fluxes across sites (Lavoie et al., 2015). As a result, the choice of imputation model yielded limited improvement on the 360 accuracy of budget estimates. By contrast, we found that the relative uncertainty for annual air-sea CO2 fluxes at BATS (3.5%) was much less than the percent error for the annual flux values associated with each imputation method (Fig. 11), indicating the importance of the choice of imputation model for a given data set. Evaluation of the BATS time series with synthetic gaps showed that selection of imputation method can have significant effects on the calculated timing, magnitude and structure of seasonal variability. 365 While all methods had similar performance imputing a bimonthly time series, which was consistent with the cross-validation analysis, at a sequential gap duration of 3 months the linear, spline, and Stineman interpolation, Kalman filtering and weighted moving average methods all began to affect the shape and timing of calculated seasonal DIC dynamics. Despite the mean errors and 1-way ANOVA indicating that there was no significant difference in the performance of these methods with gaps 370 of 3 months or less, it is clear visually that mean imputation, MICE and MLR did better in reproducing the timing of seasonal changes. This became even more apparent for the time series with 6-month gaps, where important seasonal dynamics were lost or misrepresented using the linear, spline, and Stineman interpolation, Kalman filtering and weighted moving average methods.

375
In addition to the importance of retaining seasonal dynamics, the timing of the gap within a given year had significant impact on the error in the annual sum of the CO2 flux. For example, in 1998, the 3-month gap was associated with the seasonal maximum and resulted in an underestimate of the DIC and an overestimate of CO2 flux. However, the 3-month gaps in 1999-2001 were associated with the shoulder seasons and did not have as large an impact on the flux error. As noted above, at the 6-month gap duration, the timing within the year had a greater impact on the over-and underestimation of DIC for most 380 methods, resulting in much greater error in the CO2 flux. The annual fluxes in Fig. 10 show the stability of mean imputation, MICE and MLR compared to other methods evaluated in this study. This was further supported by the fact that the mean absolute percent error for these three techniques increased less than with other methods, when the sequential gap length was increased (Fig. 11). This was also evidenced by the uncertainty of the CO2 fluxes that were determined from time series imputed using mean imputation, MICE and MLR remained much lower than other methods when gap durations increased (Fig. 12). 385 The stability of mean imputation, MICE and MLR was expected, because they are based on climatological and empirical relationships rather than the other statistical approaches evaluated here. The bootstrapping assessment of sampling sensitivity for each method provided additional insight into how the imputation methods performed at randomized data missingness rates.
Linear and Stineman interpolation, and weighted moving average had responses similar to each other in terms of the median 390 error and range of outliers in response to varied rates of missingness in the data, while spline interpolation produced a far greater range of outliers for all sites (over 5 times greater at FOT and CARIACO). This was also exhibited in the BATS time series assessment where the flexibility of the spline interpolation led to a tendency to overestimate seasonal maxima and minima, as observed in other comparative studies (North and Livingstone, 2013). Stineman interpolation performed better than basic spline interpolation by providing greater constraint, but no better than linear interpolation, despite the increased 395 flexibility. Interestingly, MICE performed very well at lower percentages of data missing and led to relatively low error in estimating the annual budget, yet it is highly sensitive to the percent of data missingness. However, outliers produced by MICE were constrained by the observational range because it is an empirical model. Outliers were most tightly constrained when using mean imputation and MLR given these empirical approaches are based on the climatology. This was shown in Fig. 8A which illustrated how error variability decreased with increasing percentages of data missing for mean imputation and MLR.
Though the sampling sensitivity for each imputation model varied across sites (Fig. 8B), MLR exhibited the lowest sensitivity and overall error and bias for imputing missing DIC data.
The results presented here indicate that care should be taken when considering what method to use to fill data gaps in ocean carbon time series, with criteria for selection including the percent of missing data, gap lengths and site characteristics. In 405 general, the empirical models performed better than statistical models evaluated in this study. Mean imputation provides a stable and straightforward approach to filling longer gaps but may lead to higher bias in annual budgets, possibly impacting the interpretation of interannual variability and long-term trends. MICE appeared to be well suited to environmental time series data that have covariate parameters such as the correlation between DIC, chlorophyll, temperature and salinity. This could be extended to other nutrients such as phosphate and nitrate as well as dissolved oxygen in order to train the models used in 410 MICE. MICE also offers the opportunity to impute data gaps over multiple variables in larger time series data sets. Our MLR model provides a stable option that performs well over all rates of data missingness once it is sufficiently trained with field data. This MLR approach also provides the benefit of utilizing remotely sensed and modelled data products in the absence of covariate field data. This MLR model consistently performed with lower error and uncertainty than other models in this study and should be considered when assessing best practices for imputing ocean carbon time series. 415

Conclusions
This study provides the first comparative assessment of several gap-filling methods that may be applied to ocean carbon time series. Regularized carbonate time series data are necessary for understanding seasonal dynamics, annual budgets, interannual variability and long-term trends in the ocean carbon cycle and changes to the ocean carbon sink, which are of particular importance in the face of global climate change. Our assessment indicates that the amount and distribution of gaps in the data 420 should be a determining factor in choosing an imputation method that optimizes uncertainty. Imputed values, however, cannot be treated as measurements and the uncertainty of imputation methods must be included in the overall uncertainty budget of broader ocean carbon analyses. The results presented above indicate the performance and behavior of select empirical and statistical approaches and the methods used provide a simple approach for estimating uncertainty using the RMSE of DIC predicted by a given imputation method. 425 This study provides evidence that DIC can be estimated with an empirical MLR approach that uses remotely sensed chlorophyll and temperature, and in-situ salinity. This method performs consistently well across 7 disparate ecosystems in oceanic and coastal environments, but the model coefficients are unique to the water mass and ecosystem and further study is needed to assess the spatial extent over which regional extrapolation is still valid. However, when using this method to impute data gaps 430 in carbonate time series, it performs better than several options, particularly for larger gaps. We conclude that when trained with sufficient field data (e.g., captures the seasonal cycle and some interannual variability), this empirical MLR method accurately predicts DIC from remotely sensed data and provides the most robust option for imputing gaps over a variety of data gap scenarios.

Acknowledgments, Samples, and Data 435
The authors thank the following for long-term contributions to ocean carbon time series and access to high quality data: the Institute for Marine Remote Sensing team for making data from the Cariaco Basin publicly available; the members of the   Table 1 for additional information about each sampling site.          1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2004, 2005, 2007, 200817 HOT 1998, 2004