Articles | Volume 23, issue 3
https://doi.org/10.5194/bg-23-967-2026
https://doi.org/10.5194/bg-23-967-2026
Research article
 | 
03 Feb 2026
Research article |  | 03 Feb 2026

Reconstruction and spatiotemporal analysis of global surface ocean pCO2 considering sea area characteristics

Huisheng Wu, Yunlong Ji, Lejie Wang, Xiaoke Liu, Wenliang Zhou, Long Cui, Yang Wang, Min Liu, and Zhuang Li
Abstract

The partial pressure of carbon dioxide (pCO2) on the surface of the ocean is crucial for quantifying and evaluating the ocean carbon budget. Insufficient consideration of the effects at the sea area scale makes it difficult to comprehensively evaluate the spatiotemporal distribution characteristics and variation patterns of pCO2. This study constructed a pCO2 evaluation dataset based on LDEO measurement data and multi-source data. After conducting correlation testing on a global, far sea, and near sea scale, an ocean surface pCO2 evaluation model was constructed using multiple linear regression, convolutional neural network, gated recurrent unit, long short-term memory network, generalized additive model, extreme gradient boosting, least squares boosting, and random forest. Performance evaluation indicates that the random-forest model consistently achieves the best accuracy across all spatial scales, yielding a global RMSE of 6.123 µatm and an R2 of 0.986. In the open ocean, RMSE decreases to 4.699 µatm and R2 rises to 0.988, whereas in coastal waters RMSE increases to 8.044 µatm and R2 declines to 0.972. Based on this, the annual sea surface pCO2 distribution of 0.25° × 0.25° from 2000 to 2019 was reconstructed. The reconstructed field shows a typical equatorial high/polar low pattern, as well as an overall upward trend consistent with independent observations, with acceleration particularly evident in specific regions of subtropical coastal oceans.

Share
1 Introduction

The partial pressure of carbon dioxide on the surface of the ocean (pCO2) is an important indicator for measuring the exchange of CO2 between the ocean and the atmosphere, and can evaluate the contribution of the ocean's carbon absorption and storage capacity to the global carbon cycle (Falkowski et al., 2000).

Numerous scholars have conducted research on pCO2 estimation and distribution reconstruction by combining satellite remote sensing data and machine learning algorithms. In the study of sea surface pCO2 in local sea areas, Telszewski et al. (2009) reconstructed the distribution of pCO2 in the North Atlantic using self-organizing neural networks (Telszewski et al., 2009); Landschützer et al. (2013) reconstructed the distribution map of Atlantic sea surface pCO2 using self-organizing map feedforward neural network method (Landschützer et al., 2013). Chierici et al. (2012) evaluated the feasibility of jointly estimating sea surface pCO2 in Antarctica and the Pacific region using ship borne measured data and remote sensing data (Chierici et al., 2012). Nakaoka et al. (2013) established a nonlinear relationship between sea surface pCO2 and multiple parameters based on self-organizing neural networks, and reconstructed the spatiotemporal variation of sea surface pCO2 in the North Pacific (Nakaoka et al., 2013). Marrec et al. (2015) used multiple linear regression to estimate the sea surface pCO2 in the waters of the Northwest European continental shelf (Marrec et al., 2015). Gregor et al. (2019) proposed methods such as support vector regression and random forest regression to reconstruct the Southern Ocean surface pCO2 (Gregor et al., 2019); Wang et al. (2021) reconstructed the distribution of pCO2 on the surface of the Southern Ocean using correlation analysis and feed forward neural networks (Wang et al., 2021). Lohrenz et al. (2018) reconstructed the sea surface pCO2 in the northern Gulf of Mexico using regression tree algorithm (Lohrenz et al., 2018). Chen et al. (2019) compared the performance of various methods in estimating surface pCO2 in the Gulf of Mexico (Chen et al., 2019); Fu et al. (2020) applied cubist models to estimate pCO2 on the surface of the Gulf of Mexico (Fu et al., 2020). Zhang et al. (2021) constructed a sea surface pCO2 regression model for the Baltic Sea region (Zhang et al., 2021). In the study of global ocean surface pCO2, Landschützer et al. (2014) expanded the research scope to the global level, reconstructed the pCO2 distribution map from 1998 to 2011, and further extended it to 1982 to 2011 (Landschützer et al., 2014, 2016). Gregor et al. (2017) reconstructed the pCO2 distribution using various nonlinear regression methods (Gregor et al., 2017). Zhong et al. (2022) used generalized regression neural network and stepwise regression algorithm to construct the pCO2 distribution map, and combined stepwise regression algorithm and feed forward neural network, constructed a 1° × 1° pCO2 distribution map from 1992 to 2019 according to the 11 biogeochemical provinces defined by the self-organizing map method (Zhong et al., 2022).

By summarizing previous research, the key limitations of current sea surface pCO2 are:

  1. Insufficient Consideration of Spatial Heterogeneity. Most existing studies either focus on a single local sea area (e.g., the North Atlantic, Gulf of Mexico, Baltic Sea) or adopt a unified global modeling framework, neglecting the significant differences in environmental conditions, driving factors, and pCO2 variation characteristics between far sea areas and near sea areas.

    To address this issue, our study constructs a multi-scale analysis framework covering the global ocean, far sea areas (water depth > 200 m), and near sea areas (water depth  200 m). The research areas are divided into far sea areas and near sea areas based on water depth, and scale-specific pCO2 evaluation models are established. For the environmentally stable far sea areas, we emphasize capturing long-term temporal dependencies and signals of large-scale hydrological and biological processes. For near sea areas affected by various complex factors, we incorporate region-specific driving factors and optimize the model structure to adapt to high variability. This targeted approach effectively improves the fitting accuracy and adaptability of the models in different sea area types.

  2. Inadequate Adaptability Between Models and Driving Factors. Existing studies mostly adopt fixed model structures or globally unified combinations of driving factors, failing to fully consider the requirements of environmental complexity differences in different sea areas for model adaptability. Additionally, the selection of driving factors lacks targeting, making it difficult for the models to accurately capture the core impact mechanisms of pCO2 in different regions.

    We resolve this limitation through the comprehensive optimization of models and driving factors: we compared eight machine learning models and identified the Random Forest (RF) model as the optimal model across all scales. Its advantage in capturing complex nonlinear relationships enables it to adapt to the environmental characteristics of different sea areas. Meanwhile, based on Spearman correlation analysis and the SHAP (SHapley Additive exPlanations) method, we screened key driving factors for each scale (e.g., Total alkalinity in sea water (talk) serves as the secondary key factor at the global scale, while the contribution rate of mole concentration of dissolved molecular oxygen in sea water (O2) significantly increases in near sea areas), ensuring the rationality and targeting of driving factor selection.

  3. Low Reconstruction Resolution. Some existing studies lack the overall processing of spatiotemporal differences in multi-source data, resulting in low spatial resolution of pCO2 reconstruction products (mostly 1° × 1° or coarser), which makes it difficult to accurately reflect the spatiotemporal variation characteristics of pCO2 within small scales.

    We address this limitation through high-resolution and high-precision reconstruction strategies: by processing multi-source data (including strict data matching, outlier handling, and data balancing strategies), we reconstructed the annual pCO2 distribution with a high resolution of 0.25° × 0.25° from 2000 to 2019. The results demonstrate that the accuracy of pCO2 reconstruction is significantly improved compared with existing studies.

2 Methodology

2.1 Research Area

The global ocean, excluding the perennial ice-covered waters in the core area of the Arctic Ocean and the permanently frozen areas around the Antarctic continent, has a total area of 336 million square kilometers, accounting for approximately 92.8 % of the global ocean surface area. This research focuses on the 0–10 m water layer in the ocean surface, which is a critical interface for air sea exchange. Due to the complex types of water bodies, sea surface pCO2 is influenced by various factors. The global ocean was divided into research area scales based on water depth, identifying the areas beyond the continental shelf (water depth > 200 m) as far sea areas and the areas within the range (water depth  200 m) as near sea areas.

2.2 Data sources

2.2.1 Actual measurement data

The measured data of pCO2 is sourced from Global Surface pCO2 (LDEO) Database V2019 (OCADS – Global Surface pCO2 (LDEO) Database (noaa. gov)). This dataset covers 14.2 million measured data from 1957 to 2019 using the equalizer CO2 analyzer system in the global ocean. The dataset provides various types of sea surface pCO2 measured data. This study selected ocean surface pCO2 values measured at actual temperatures from 2000 to 2019, which can truly reflect the pCO2 level at the time of measurement.

2.2.2 Other data

A total of 25 potential influencing factors were selected for the study (Table 1), and their abbreviations are used for convenience. These data are divided into three types of sources: in-situ observations, satellite observations, and numerical models, with good spatiotemporal resolution and coverage, providing reliable data sources for research.

Table 1Specific information about influencing factors (sort based on its resolution and name).

Download XLSX

2.3 Data Processing

2.3.1 Data Matching

To reduce the impact of spatial and temporal resolution differences in multi-source data, we adopted a dual matching strategy to process pCO2 measured data and potential influencing factors. In the temporal dimension, influencing variables were first aligned with the in-situ pCO2 observations; temporal gaps were subsequently infilled via nearest-time interpolation to ensure chronological consistency. In the spatial dimension, data points were aligned through precise geographic coordinate matching algorithms, and nearest neighbor interpolation was used to supplement missing points to improve spatial accuracy. After matching, each point contains the measured value of pCO2, environmental variables, and corresponding spatiotemporal information (year, month, lat, lon).

2.3.2 Analysis of Outliers

The study conducted quality control on the matched data by removing missing values generated during the matching process. According to data statistics and previous research experience (Wu et al., 2024), measured data below 200 µatm and above 600 µatm are classified as outliers. The spatial distribution of outliers is mainly concentrated in coastal areas, reflecting the variability of land sea interaction effects. Outliers are valuable sample data for the study of pCO2. Through comparative analysis of each route, it was found that many outliers matched the route, and it was determined that their outliers were caused by environmental changes rather than measurement errors. Therefore, valid outliers were retained and only obvious measurement error data were removed. For other environmental variable values, abnormal data was identified and removed based on the 3σ criterion (μ±3σ).

2.3.3 Data Balancing

The processed global ocean data was divided into far sea and near sea datasets (Fig. 1a, b, c). Statistical analysis shows that the spatial and temporal distribution of data is uneven. Therefore, a 0.25° × 0.25° grid was used for spatial binning, and time binning was performed monthly to construct a spatiotemporal joint binning unit. The granularity setting of this box not only meets the research accuracy requirements, but also maintains compatibility with the spatiotemporal resolution of multi-source data.

Take the arithmetic mean of the data within each unit as the representative value, with the spatial position represented by the grid center point, and the time calculated as the weighted average based on the distribution of data points (Eq. 1). This method effectively balances the data distribution while ensuring accuracy.

(1)tavg=i=1nwitii=1nwi(2)wi=Δti

In the formula, tavg is the weighted average time of the spatiotemporal box, n is the total amount of data in the spatiotemporal box, wi is the weight of the ith data point, ti is the time of the ith data point, and Δti is the sampling time interval between the ith data point and the previous point. After data balancing processing, the dataset for this study was finally constructed, laying a solid data foundation for the construction of multi-scale models.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f01

Figure 1The spatiotemporal distribution of datasets at different scales. (a) Global spatial distribution of ocean data. (b) Spatial distribution of data in far sea areas. (c) Spatial distribution of data in near sea areas.

2.4 Spearman correlation analysis of pCO2 drivers

The potential influencing factors involved do not fully follow a normal distribution, and there is a non-linear relationship between pCO2. Therefore, selecting appropriate correlation indicators is particularly crucial. The Spearman correlation coefficient can effectively reveal the correlation between data (Eq. 3).

(3) ρ = 1 - 6 i = 1 n D i 2 n n 2 - 1

In the formula, ρ represents the correlation coefficient, D represents the level difference of the variable, and n represents the sample size of the variable. The range of values for ρ is between 1 and 1, where 1 indicates a complete negative correlation between the influencing factors and pCO2, 1 indicates a complete positive correlation, and 0 indicates no correlation.

2.5 Model selection

To evaluate the modeling ability of different algorithms for pCO2, we constructed eight comparative models at different research regions, including multiple linear regression (MLR),convolutional neural network (CNN), gated recurrent unit (GRU), long short term memory (LSTM),generalized additive models (GAM), extreme gradient boosting (XGBoost), least squares boosting (LSBoost), and random forest (RF). MLR serves as a baseline that linearly links temperature, salinity and nutrients to sea-surface pCO2. CNN extracts spatial features via convolution and pooling layers to produce fine-scale pCO2 distributions, while GRU and LSTM, with their update-reset gates and memory cells, capture long-term temporal dependencies of oceanic periodic changes on pCO2 for historical-to-future prediction. GAM relaxes the linearity assumption by modeling each predictor's additive nonlinear effect on pCO2. XGBoost and LSBoost iteratively optimize tree ensembles through gradient boosting or weighted residuals to uncover complex nonlinear relationships between high-dimensional features and pCO2. Finally, RF constructs and averages many decision trees on random feature subsets, delivering robust pCO2 estimates for large-scale ocean datasets.

2.6 Performance evaluation

The datasets at different research regions were randomly divided into training, validation, and testing sets in an 8:1:1 ratio. Five statistical methods, Mean Absolute Error (MAE, µatm) – the average absolute difference between predicted and in-situ pCO2, indicating overall bias; Mean Absolute Percentage Error (MAPE, %) – the relative error scaled by the observed pCO2, enabling comparison across regions with contrasting background concentrations; Mean Squared Error (MSE, µatm2) – the squared deviations averaged over all samples, emphasizing larger pCO2 discrepancies; Root Mean Squared Error (RMSE, µatm) – the square root of MSE, providing a metric in the original pCO2 units that is sensitive to outliers; Coefficient of Determination (R2 ) – the proportion of pCO2 variance explained by the model, with values approaching unity signifying high predictive skill.

(4)MAE=1ni=1ny^i-yi(5)MAPE=100%ni=1ny^i-yiyi(6)MSE=1ni=1ny^i-yi˙2(7)RMSE=1ni=1ny^i-yi2(8)R2=1-i=1nyi-y^i2i=1nyi-yi2

In the formula, n is the number of pCO2 observations; yi denotes the in-situ measured pCO2 (µatm) for the ith sample, y^i is the corresponding model-estimated pCO2, yi represents the mean of all measured pCO2 values.

3 Results and discussion

3.1 Correlation detection

3.1.1 Interaction detection

Interactive detection of variables was conducted in global oceans, far sea areas, and near sea areas (Fig. 2). The concentration of chlorophyll and the volume attenuation coefficient of downwelling radiative flux have a ρ-value of 1 at all research area scales, indicating collinearity in numerical values. However, they respectively reflect marine biological activity and optical properties, providing comprehensive information for fitting surface pCO2. The ρ value between the aragonite saturation state in sea water and aragonite in seawater is also 1, and they are positively correlated with the same magnitude of change. This usually stems from chemical equilibrium processes in seawater, where the dissolution and precipitation processes are influenced by similar physical and chemical conditions. The correlation between sea water potential temperature and sea water temperature is extremely high, but their physical meanings are different. The former reflects the equivalent temperature after considering pressure, while the latter reflects the actual temperature. Both can comprehensively capture temperature characteristics and improve the accuracy of surface pCO2 evaluation.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f02

Figure 2Results of interaction detection between variables at different research area scales. (a) Global Ocean Interaction Detection Results. (b) Interaction detection results in far sea areas. (c) Interactive detection results in near sea areas.

Download

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f03

Figure 3Single factor detection results at different research area scales. (a) Global ocean single factor detection results. (b) Far sea single factor detection results. (c) Near sea single factor detection results.

Download

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f04-part01

Figure 4Model performance at the global ocean (a MLR, b CNN, c GRU, d LSTM, e GAM, f XGBoost, g LSBoost, h RF).

Download

3.1.2 Single factor detection

The correlation between surface pCO2 and various influencing factors (Fig. 3) was analyzed. The results indicate that at different regional scales, there is a significant negative correlation between pCO2 and ph, meaning that the stronger the acidity of seawater, the higher the surface pCO2; the stronger the alkalinity, the lower the surface pCO2. At the same time, surface pCO2 is significantly positively correlated with temperature. In far sea areas, the negative correlation between pCO2 and chlorophyll concentration and diffuse reflectance attenuation coefficient is more significant, indicating that it has higher stability and balance in regulating pCO2. In contrast, the above correlation in near sea areas is weaker due to land-based pollution, human activities, and environmental changes, but the negative correlation between pCO2 and seawater acidity is stronger. When selecting variables, the study included factors with a p-value greater than 0.1 or less than 0.1 in the screening range to ensure the validity of the results and improve model performance (Table 2). Additionally, SHAP method was used to quantitatively evaluate the contributions of various influencing factors to surface pCO2 (Ge et al., 2022a, b). There were differences in the contributions of influencing factors at different scales. The ph is the core driving factor at all scales, but its contribution intensity follows a distribution pattern of “far sea areas > global oceans > near sea areas”; The contribution of other factors shows significant regional heterogeneity, such as talk being the second key factor at the global ocean scale, while the contribution rate of O2 in near sea areas has significantly increased, making ar a region specific factor.

3.2 Model construction and evaluation

3.2.1 Construction and evaluation of global ocean surface pCO2 model

Based on the correlation analysis results of the above factors, this study selected key driving factors to construct and evaluate a global sea surface pCO2 reconstruction model. Owing to the large amount of data, we randomly selected some data from all the fitting results to show the observation performance. Different models exhibit significant performance differences in evaluating surface pCO2 at the global ocean scale (Fig. 4). Specifically, there is a significant gap between the model values of MLR, CNN, and GRU and the true values, especially in the low value (< 300 µatm) and high value (> 500 µatm) ranges where the fitting effect is poor (Table 3). The deviation is due to the model's insufficient ability to capture nonlinear relationships in complex marine environments, limitations in handling extreme values, and the model's own structure is not sufficient to adapt to complex data features. The LSTM and GAM models have relatively large errors and poor performance, indicating deficiencies in capturing the characteristics of surface pCO2 changes. When extreme fluctuations occur in surface pCO2, the fitting ability significantly decreases. The comprehensive performance of XGBoost and LSBoost has significantly improved, with MAE reduced to 15–18 µatm, RMSE reduced to 25–30 µatm, and R2 exceeding 0.7. The effective explanation of multivariate nonlinear relationships and the application of model ensemble strategies have improved the accuracy of the two models within the normal range (300–500 µatm), but the extreme values processing still needs to be improved. The performance of RF is the best among all models, with MAE reduced to below 4 µatm, RMSE reduced to around 6 µatm, and R2 reaching above 0.9. It not only achieves accurate fitting in the range of 300–500 µatm values, but also in the low and high value ranges. The good adaptability of RF to high-dimensional data and a large number of samples makes it perform well in fitting tasks in complex marine environments.

3.2.2 Construction and evaluation of surface pCO2 model in far sea areas

The far sea environment is relatively stable, and the model performance has been improved (Table 4). The bias of MLR, CNN, and GRU models has been reduced, with MAE ranging from 14–15 µatm, RMSE above 26 µatm, and R2 remaining around 0.6. The MAE of LSTM and GAM is around 14 µatm; RMSE is above 25 µatm, and R2 is around 0.64. The performance of the two models has improved compared to extreme value ranges, thanks to the ability of LSTM to process time series data and capture the dynamic characteristics of surface pCO2 over time, and GAM fitted the relationship between surface pCO2 and influencing factors by constructing a nonlinear additive model. XGBoost and LSBoost perform even better in far sea areas, especially with high fitting accuracy in the range of 300–500 µatm, MAE around 11–13 µatm, RMSE reduced to below 23 µatm, and R2 increased to around 0.8. The model performance of RF in far sea areas is also optimal, relying on strong generalization ability and feature selection mechanisms to effectively address the variability factors in marine environments.

Table 2Selection results of influencing factors at different area scales.

Note: The asterisk * in Mlost is used only to visually distinguish it from the separately sourced Mlost variable listed below. It does not denote a special property or uncertainty.

Download Print Version | Download XLSX

Table 3Performance parameters of different models in the global ocean.

Download Print Version | Download XLSX

Table 4Performance parameters of different models in the far sea areas.

Download Print Version | Download XLSX

3.2.3 Construction and evaluation of surface pCO2 model in near sea areas

Due to various complex factors, the spatiotemporal distribution of surface pCO2 in the near sea area exhibits high variability, resulting in a decrease in the performance of the constructed surface pCO2 models. Table 5 results show that MLR, CNN, and GRU have limitations in handling complex nonlinear relationships. In the low and high value ranges, the MAE of the three models reaches over 34 µatm, RMSE reaches over 62 µatm, and R2 is below 0.5. LSTM constructs a nonlinear additive model through its gating mechanism and GAM, which improves the fitting ability to a certain extent. The MAE of the model is in the range of 33–34 µatm; the RMSE is in the range of 56–58 µatm, and the R2 remains in the range of 0.55–0.60, but there is still deviation in the extreme numerical range. XGBoost and LSBoost improved the accuracy of fitting extreme values by constructing multiple weak learners to combine the fitting results. The MAE of both models decreased to around 23–27 µatm, the RMSE remained around 35–42 µatm, and the R2 increased to the range of 0.75–0.85. RF constructed multiple decision trees and integrated the fitting results to adapt to the variability and variability of the near sea environment, demonstrating robust fitting performance. Its MAE was below 5 µatm; RMSE was about 8 µatm, and R2 remained above 0.95, significantly outperforming other models.

Table 5Performance parameters of different models in the near sea areas.

Download Print Version | Download XLSX

3.3 Independent validation of the model

The surface pCO2 models were independently validated at different regional scales, inputting data independent of the model construction, comparing the accuracy of the fitted values with the true values, and evaluating the applicability and accuracy of the model in complex marine environments. The scatter plot with true values as the x-axis and fitted values as the y-axis was drawn, with colors representing kernel density to reflect the distribution trend of points. At the global ocean scale (Fig. 5), the scatter distribution of MLR, CNN, GRU, LSTM, and GAM shows a large elliptical shape, and the fitted values deviate significantly from the true values, especially around the extreme value of pCO2 on the sea surface. The scatter distributions of XGBoost and LSBoost have shrunk. The RF model has the best fitting performance, with a clear convergence of the scatter distribution, concentrated on Y=X line, and can effectively avoid errors in the extreme value region, indicating that its fitted value is consistent with the true value and has good stability.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f05

Figure 5Independent verification performance of the models in the global ocean, right axis: Normalized probability density of model residuals (a MLR, b CNN, c GRU, d LSTM, e GAM, f XGBoost, g LSBoost, h RF).

Download

In far sea areas (Fig. 6), the scatter points of MLR, CNN, GRU, LSTM, GAM, and XGBoost models exhibit elliptical distribution and diverge at both ends, indicating their limitations in dealing with extreme fluctuations of surface pCO2. The scatter distribution ellipse of the LSBoost model significantly shrinks, and the divergence situation converges at extreme values, improving the fitting accuracy. The scatter distribution of the RF model is a flat ellipse, with the minimum difference between the fitted value and the true value, effectively reducing extreme errors.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f06

Figure 6Independent verification performance of the models in the far sea areas, right axis: Normalized probability density of model residuals (a MLR, b CNN, c GRU, d LSTM, e GAM, f XGBoost, g LSBoost, h RF).

Download

In the independent validation of models in near sea areas, each model showed different performances (Fig. 7). The scatter of MLR, CNN, GRU, and LSTM shows an irregular distribution, with significant differences between the fitted values and the true values, and severe divergence in high-value areas. This is due to the high variability in near sea areas, which makes it difficult for the model to cope with. The scatter distribution of GAM and XGBoost has begun to show an elliptical shape, which has certain adaptability to complex environments. The scatter distribution of LSBoost shows a clear elliptical shape, which improves the fitting stability. The RF model shows significant improvement in performance, with overall convergence of scatter distribution and no significant divergence in both low and high value oceans. It can effectively reduce extreme errors and reconstruct surface pCO2 with high accuracy in complex near sea environments.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f07

Figure 7Independent verification performance of the models in the near sea areas, right axis: Normalized probability density of model residuals (a MLR, b CNN, c GRU, d LSTM, e GAM, f XGBoost, g LSBoost, h RF).

Download

3.4 Reconstruction of surface pCO2

The multi-source data was input into the constructed RF model at different area scales, with extracting the variable values of influencing factors from the multi-source data grid by grid to fit the surface pCO2 values of the corresponding grid. If there are missing values in a certain grid in the multi-source data, the corresponding surface pCO2 value at that location will be output as a blank value, ensuring that the reconstructed results are completely based on the original data. The blank values are mainly due to the systematic exclusion of land pixels and the limitations of data acquisition in high latitude sea areas: the former is excluded because it does not participate in ocean processes, while the latter is due to the lack of satellite data for key parameters caused by sea ice coverage or insufficient light, resulting in the inability to reconstruct the values in the region. The final generation of the surface pCO2 distribution map for the year 2000–2019 at 0.25° × 0.25° is based on the original data.

The reconstruction results of surface pCO2 at the global ocean scale are consistent with the distribution characteristics of LDEO actual observation data, confirming that the RF model can effectively capture the spatial distribution pattern of global ocean surface pCO2. Through the reconstruction results (Fig. 8), it was found that the spatial distribution of surface pCO2 exhibits a clear latitude dependence, with a distribution pattern of “high at the equator and low at the poles”. The independent observation data based on the route was compared with the reconstruction results obtained at the closest collection time. The global ocean surface pCO2 reconstruction result showed MAE of 11.067 µatm, MAPE of 0.037, MSE of 396.060 µatm2, RMSE of 19.901 µatm, and R2 of 0.816. This indicates that the deviation between the reconstructed results and the actual observed data is small, and can accurately reflect the average distribution characteristics of surface pCO2.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f08

Figure 8Surface ocean pCO2 products in the global ocean.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f09

Figure 9Comparison of surface ocean pCO2 products from different studies (a Zhong et al., 2022 product. b Copernicus global ocean surface carbon product).

Compared with other existing studies on the reconstruction of surface pCO2 (Fig. 9), these methods are highly consistent with our results in the reconstructed spatial model pattern (Chau et al., 2022, 2024; Zhong et al., 2022). Although different studies have used different data sources, models, or methods, similar conclusions can be drawn when describing the overall distribution characteristics of pCO2 on the global ocean surface, which to some extent verifies the reliability and accuracy of the reconstructed results. This study uses high-resolution data and RF models to make the reconstruction results more detailed, especially in the high latitude marginal sea areas of the North and South Poles.

The reconstruction results of the far sea region showed that the surface pCO2 in the equatorial low latitude region was higher, while the surface pCO2 in the polar high latitude region was lower (Fig. 10). We evaluated the difference in fitting accuracy between the far sea regional model and the global ocean model in the far sea areas, by comparing independent observation data based on flight routes with the reconstructed results of the two models. The results showed that the MAE of the far-sea model was 9.060 µatm, the MAPE was 0.027, the MSE was 269.511 µatm2, the RMSE was 16.417 µatm, and R2 was 0.826; the MAE of the global model was 9.125 µatm, the MAPE was 0.027, the MSE was 275.582 µatm2, the RMSE was 16.601 µatm, and R2 was 0.822.The reconstruction accuracy of the far sea area model has slightly improved compared to the global ocean model in the far sea area (Fig. 11), indicating that the optimization of the far sea area model in local areas has improved the reconstruction accuracy. However, the global ocean model can still provide accurate surface pCO2 fitting in the far sea area by adapting to the overall ocean environment.

To verify the accuracy of the time series reconstruction of the model, a comparative analysis was conducted on the temporal changes between the observation data of the Hawaii Ocean Time series (HOT) and the reconstruction results of the global ocean and far sea areas (Fig. 12). The results showed that the temporal trends of both scales were consistent with the actual measurement data of the Hawaii observation station. Research has shown that the model performs well in fitting the dynamic changes of time series and can accurately reflect the temporal evolution of surface pCO2.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f10

Figure 10Surface ocean pCO2 products in the far sea areas.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f11

Figure 11Comparison of reconstruction accuracy in the far sea areas using different scale models, right axis: Normalized probability density of model residuals.

Download

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f12

Figure 12Independent verification based on time-series observation stations.

Download

The reconstruction results of surface pCO2 in the near sea area showed (Fig. 13) that the surface pCO2 values in the low latitude near sea areas on both sides of the equator were higher, which was closely related to factors such as high seawater temperature and vigorous evaporation. The seawater temperature in high latitude oceans is lower, causing changes in ocean circulation and mixing processes, and the overall trend of surface pCO2 is decreasing. A comparison was made between the fitting accuracy of the near sea area model and the global ocean model in the near sea region. The results showed that the MAE of the near-shore model was 20.145 µatm, the MAPE was 0.065, the MSE was 983.726 µatm2, the RMSE was 31.364 µatm, and R2 was 0.797; the MAE of the global model was 20.324 µatm, the MAPE was 0.065, the MSE was 999.147 µatm2, the RMSE was 31.609 µatm, and R2 was 0.794. The reconstruction effect of the near sea area model has been improved compared to the reconstruction results of the global ocean model in the near sea area (Fig. 14), indicating that the use of RF can model the complex marine environment in the near sea area and accurately reflect the distribution characteristics of surface pCO2 in the region.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f13

Figure 13Surface ocean pCO2 products in the near sea areas.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f14

Figure 14Comparison of reconstruction accuracy in the near sea areas using different scale models, right axis: Normalized probability density of model residuals.

Download

3.5 Spatiotemporal analysis of surface pCO2

At the global oceanic scale (Fig. 15), the equatorial region experiences strong solar radiation and high temperatures, resulting in relatively low solubility of CO2. Additionally, the presence of upwelling brings deep seawater rich in CO2 to the surface, leading to an increase in surface pCO2 concentration. Due to the low temperature environment in polar oceans, the solubility of CO2 in seawater significantly increases. The sea ice coverage and strong wind fields in polar waters promote gas exchange between the atmosphere and the ocean, resulting in relatively low concentrations of pCO2 on the sea surface. The surface pCO2 in the Antarctic region is generally higher than that in the Arctic region, because the circulation system transports a large amount of seawater with high surface pCO2 from low latitudes to high latitudes. At the same time, the melting and formation of sea ice also have an important impact on the distribution of surface pCO2. Due to the wider coverage of sea ice, the Arctic region is less affected by the North Atlantic warm current, and its surface pCO2 concentration is lower compared to the Antarctic region. In terms of time, the global ocean surface pCO2 shows a trend of increasing year by year, which is related to global warming. The rising sea temperature in mid latitude waters leads to a decrease in CO2 solubility and promotes an increase in surface pCO2 concentration.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f15-part01

Figure 15Annual spatiotemporal variations of surface ocean pCO2 in the global ocean.

In the far sea areas (Fig. 16), the surface pCO2 is higher in the low latitude areas near the equator, particularly in the eastern equatorial Pacific. Mainly due to the upwelling of seawater in the region, which brings cold water rich in CO2 from deep layers to the surface of the ocean, resulting in an increase in pCO2 concentration on the sea surface. In the mid to high latitudes of the far sea region, the surface pCO2 shows a low characteristic, which is due to the ocean circulation pattern promoting the mixing of surface seawater and deep seawater, resulting in relatively low surface pCO2 concentration. The low temperature and strong biological pumping effect enhance the absorption of atmospheric CO2 by the ocean, leading to a low surface pCO2 concentration. In terms of time, the surface pCO2 shows a trend of increasing year by year, especially after 2015. This is closely related to global climate change, changes in ocean circulation patterns, and the impact of human activities.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f16-part01

Figure 16Annual spatiotemporal variations of surface ocean pCO2 in the far sea areas.

The exchange of CO2 between seawater and atmosphere is frequent, and the surface pCO2 value is relatively high. In mid to high latitude oceans, low-temperature seawater, polar cold water sinking, and deep seawater upwelling result in relatively low concentrations of pCO2. The reconstruction results of surface pCO2 in the near sea area (Fig. 17) show that the equatorial region has strong solar radiation, high temperature seawater, and the influence of tropical cyclones and trade winds. The distribution characteristics of surface pCO2 are significant along the eastern coast of Asia in the mid latitude region of the Northern Hemisphere. The surface pCO2 in the Yellow Sea and Bohai Sea oceans is significantly lower than that in the coastal areas of eastern North America, which is related to the East Asian monsoon circulation and complex marine ecosystems. The surface pCO2 in the border waters between Southeast Asia, the Indian Peninsula, North America, and South America is relatively high. Due to the influence of monsoon climate and tropical cyclones, high sea temperatures, as well as marine pollution caused by human activities, have collectively led to an increase in surface pCO2. Temporally, the surface pCO2 in near sea areas has been increasing year by year. Due to the increase in temperature in low latitude sea areas, the solubility of CO2 in seawater decreases, and the upward trend of surface pCO2 is more pronounced.

https://bg.copernicus.org/articles/23/967/2026/bg-23-967-2026-f17-part01

Figure 17Annual spatiotemporal variations of surface ocean pCO2 in the near sea areas.

4 Conclusions

This study is based on a multi-scale analysis framework of the global ocean, far sea areas, and near sea areas. Using LDEO measured data combined with multi-source data, multiple machine learning models were used to construct and reconstruct the annual surface pCO2 distribution of 0.25°×0.25° from 2000 to 2019, revealing its spatiotemporal variation patterns and driving mechanisms. The research results indicate that the Random Forest (RF) model exhibits optimal performance at different scales and can effectively capture the spatiotemporal distribution characteristics of surface pCO2. The distribution pattern of surface pCO2 shows a pattern of “high at the equator and low at the poles” in space, and an increasing trend year by year in time. Different oceans exhibit different characteristics of changes due to the combined effects of natural factors and human activities. The acidity and alkalinity of seawater are the main driving factors for changes in surface pCO2, and the contributions of other influencing factors vary at different scales.

Although this study has achieved certain results, the complexity of ocean carbon sinks still needs further exploration. Future research can focus on optimizing models, developing hybrid models, and combining advanced algorithms with ocean mechanism models; At the same time, we will strengthen interdisciplinary studies such as oceanography, ecology, and climatology to comprehensively reveal the process of ocean carbon cycling and provide scientific basis for addressing climate change.

Code availability

All raw data and code are available from the corresponding authors upon reasonable request.

Data availability

This study reconstructs global ocean surface pCO2 (2000–2019) using multi-source data and machine learning, identifying RF as the optimal model and revealing equatorial-high/polar-low patterns with rising trends. Data will be made available on request.

Author contributions

Conceptualization: HW; methodology: HW and YJ; software: XL and YJ; validation: WZ, LC, and LW; formal analysis: YJ; investigation: WZ, L.W. and LC; resources: XL and YJ; data curation: XL and YJ; writing–original draft preparation: YJ, YW and M.; writing–review and editing: HW and ZL; visualization: XL and LC; supervision: HW; project administration: HW; funding acquisition: HW. All authors have read and agreed to the published version of the manuscript.

Competing interests

The contact author has declared that none of the authors has any competing interests.

Disclaimer

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. The authors bear the ultimate responsibility for providing appropriate place names. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Special issue statement

This article is part of the special issue “Biogeochemical processes and Air–sea exchange in the Sea-Surface microlayer (BG/OS inter-journal SI)”. It is not associated with a conference.

Acknowledgements

The sea surface pCO2 data used in this study were obtained from the Global Surface pCO2 (LDEO) Database Version 2019, hosted by the Ocean Carbon Data System of NOAA's National Centers for Environmental Inform ation (NCEI). We thank the principal investigators and all cont ributors to this database.We thank the two anonymous reviewer s for their thoughtful comments on this manuscript.

Financial support

This research was funded by Key Laboratory of Land Satellite Remote Sensing Application, Ministry of Natural Resources of the People' s Republic of China, grant numbers G202211, and the Ministry of Education Industry-University Collaborative Education Project, grant numbers 220504039151258, and the National Natural Science Foundation of China, grant numbers 42574035.

Review statement

This paper was edited by Peter S. Liss and reviewed by two anonymous referees.

References

Chau, T. T. T., Gehlen, M., and Chevallier, F.: A seamless ensemble-based reconstruction of surface ocean pCO2 and air–sea CO2 fluxes over the global coastal and open oceans, Biogeosciences, 19, 1087–1109, https://doi.org/10.5194/bg-19-1087-2022, 2022. 

Chau, T. T. T., Chevallier, F., and Gehlen, M.: Global analysis of surface ocean CO2 fugacity and air-sea fluxes with low latency, Geophys. Res. Lett, 51, e2023GL106670, https://doi.org/10.1029/2023GL106670, 2024. 

Chen, S., Hu, C., Barnes, B. B., Wanninkhof, R., Cai, W.-J., Barbero, L., and Pierrot, D.: A machine learning approach to estimate surface ocean pCO2 from satellite measurements, Remote Sens. Environ, 228, 203–226, https://doi.org/10.1016/j.rse.2019.04.019, 2019. 

Chierici, M., Signorini, S. R., Mattsdotter-Björk, M., Fransson, A., and Olsen, A.: Surface water fCO2 algorithms for the high-latitude Pacific sector of the Southern Ocean, Remote Sens. Environ., 119, 184–196, https://doi.org/10.1016/j.rse.2011.12.020, 2012. 

Falkowski, P., Scholes, R. J., Boyle, E., Canadell, J., Canfield, D., Elser, J., and Linder, S.: The global carbon cycle: a test of our knowledge of earth as a system, Sciences, 290, 291–296, https://doi.org/10.1126/science.290.5490.291, 2000. 

Fu, Z., Hu, L., Chen, Z., Zhang, F., Shi, Z., Hu, B., and Liu, R.: Estimating spatial and temporal variation in ocean surface pCO2 in the Gulf of Mexico using remote sensing and machine learning techniques, Sci. Total Environ., 745, 140965, https://doi.org/10.1016/j.scitotenv.2020.140965, 2020. 

Ge, W., Patino, J., Todisco, M., and Evans, N.: Explaining deep learning models for spoofing and deepfake detection with SHapley Additive exPlanations, Paper presented at the ICASSP, https://doi.org/10.1109/ICASSP43922.2022.9747476, 2022a. 

Ge, W., Patino, J., Todisco, M., and Evans, N.: Explaining deep learning models for spoofing and deepfake detection with SHapley Additive exPlanations. Paper presented at the ICASSP, https://doi.org/10.1109/ICASSP43922.2022.9747476, 2022b. 

Gregor, L., Kok, S., and Monteiro, P.: Empirical methods for the estimation of Southern Ocean CO2: support vector and random forest regression, Biogeosciences, 14, 5551–5569, https://doi.org/10.5194/bg-14-5551-2017, 2017. 

Gregor, L., Lebehot, A. D., Kok, S., and Scheel Monteiro, P. M.: A comparative assessment of the uncertainties of global surface ocean CO2 estimates using a machine-learning ensemble (CSIR-ML6 version 2019a) – have we hit the wall?, Geosci. Model Dev., 12, 5113–5136, https://doi.org/10.5194/gmd-12-5113-2019, 2019. 

Landschützer, P., Gruber, N., Bakker, D. C., Schuster, U., Nakaoka, S.-i., Payne, M. R., and Zeng, J.: A neural network-based estimate of the seasonal to inter-annual variability of the Atlantic Ocean carbon sink, Biogeosciences, 10, 7793–7815, https://doi.org/10.5194/bg-10-7793-2013, 2013. 

Landschützer, P., Gruber, N., Bakker, D. C., and Schuster, U.: Recent variability of the global ocean carbon sink, Global Biogeochemical Cycles, 28, 927–949, https://doi.org/10.1002/2014GB004853, 2014. 

Landschützer, P., Gruber, N., and Bakker, D. C.: Decadal variations and trends of the global ocean carbon sink, Global Biogeochemical Cycles, 30, 1396–1417, https://doi.org/10.1002/2015GB005359, 2016. 

Lohrenz, S. E., Cai, W.-J., Chakraborty, S., Huang, W.-J., Guo, X., He, R., and Tian, H.: Satellite estimation of coastal pCO2 and air-sea flux of carbon dioxide in the northern Gulf of Mexico, Remote Sens. Environ., 207, 71–83, https://doi.org/10.1016/j.rse.2017.12.039, 2018. 

Marrec, P., Cariou, T., Macé, E., Morin, P., Salt, L. A., Vernet, M., Taylor, B., Paxman, K., and Bozec, Y.: Dynamics of air–sea CO2 fluxes in the northwestern European shelf based on voluntary observing ship and satellite observations, Biogeosciences, 12, 5371–5391, https://doi.org/10.5194/bg-12-5371-2015, 2015. 

Nakaoka, S.-i., Telszewski, M., Nojiri, Y., Yasunaka, S., Miyazaki, C., Mukai, H., and Usui, N.: Estimating temporal and spatial variation of ocean surface pCO2 in the North Pacific using a self-organizing map neural network technique. Biogeosciences, 10, 6093–6106, https://doi.org/10.5194/bg-10-6093-2013, 2013. 

Telszewski, M., Chazottes, A., Schuster, U., Watson, A., Moulin, C., Bakker, D., and Lüger, H.: Estimating the monthly pCO2 distribution in the North Atlantic using a self-organizing neural network, Biogeosciences, 6, 1405–1421, https://doi.org/10.5194/bg-6-1405-2009, 2009.  

Wang, Y., Li, X., Song, J., Li, X., Zhong, G., and Zhang, B.: Carbon sinks and variations of pCO2 in the Southern Ocean from 1998 to 2018 based on a deep learning approach, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 14, 3495–3503, https://dor.org/10.1109/JSTARS.2021.3066552, 2021. 

Wu, H., Wang, L., Ling, X., Cui, L., Sun, R., and Jiang, N.: Spatiotemporal reconstruction of global ocean surface pCO2 based on optimized random forest, Sci. Total Environ., 912, 169209, https://doi.org/10.1016/j.scitotenv.2023.169209, 2024. 

Zhang, S., Rutgersson, A., Philipson, P., and Wallin, M. B.: Remote sensing supported sea surface pCO2 estimation and variable analysis in the Baltic Sea, Remote Sens., 13, 259, https://doi.org/10.3390/rs13020259, 2021. 

Zhong, G., Li, X., Song, J., Qu, B., Wang, F., Wang., Y., and Wang, Z.: Reconstruction of global surface ocean pCO 2 using region-specific predictors based on a stepwise FFNN regression algorithm, Biogeosciences, 19, 845–859, https://doi.org/10.5194/bg-19-845-2022, 2022. 

Download
Short summary
This study reconstructs global ocean surface pCO2 (2000–2019) using multi-source data and machine learning, identifying Random Forest (RF) as the optimal model and revealing equatorial-high/polar-low patterns with rising trends.
Share
Altmetrics
Final-revised paper
Preprint