|I want to thank the authors for their extensive review, their detailed answers to all major and minor comments, and their patience with me when I did not understand everything. The extension of the analysis to the upper 2000 m are a great plus and after the response, I entirely agree that it is best to apply the method to the ARGO data in a separate study.|
However, I have two last outstanding major question with respect to the Methods that need to be clarified before publication and that could substantially improve the results. Although I have clicked 'major revisions', I do not want you to see this as a 'traditional' major revisions. The manuscript is in great shape but I think the two major points would make it much better. The first point would have to be adressed in text if the model drift was accounted for and needs additional analysis if it was not. The second point could be adressed by changing the text, but I believe an additional analysis might substantially strengthen the study.
1) I could not find information if the model output was detrended using the pre-industrial control output. Trends are usually small to negligible close to the ocean surface due to exchanges with the atmosphere but can be substantial in the subsurface ocean. If such drifts exist and vary across the models in the ensemble, it might be almost impossible to obtain the optimal weights from these models. Such a drift may well cause the bipolar optimal coefficient pattern for T and S that is shown in Figure 7. If the drift is not accounted for, I think it has to be accounted for. The best way is probably to fit a spline to the pre-industrial control run and to remove it from the historical simulation. In this way, the inter-annual or decadal variability, which are likely similar in the piControl and historical run, is not removed with the drift.
2) It is somehow concerning that problems arise for MPI-ESM1.2-LR because of the higher spatial resolution. I believe that a mistake was made. Following Séférian et al. (2020) (https://link.springer.com/article/10.1007/s40641-020-00160-0), MPI-ESM1.2-LR has one of the coarsest resolutions. If the MPI-ESM1.2-LR resolution is indeed coarser, the argument would have to be altered. At the moment, the weak performance of the DIC reconstruction under MPI-ESM1.2-LR is discussed essential weakness of this method when being applied to observations and regions with lots of mesoscale processes. However, this might be wrong if MPI-ESM1.2-LR was not highly resolved. From my experience, MPI-ESM1.2-LR is more likely an outlier and may not affect the performance of the methods when applied to observations. I believe an adjustment of the Discussion is absolutely necessary. However, I believe the authors could significantly increase trust in the reconstruction method, and I believe the method merits it, by including GFDL-ESM4 (1/2° resolution) and GFDL-CM4 (1/4° resolution). Both model versions are highly resolved compared to other models and GFDL-ESM4 is the best performing model with respect to historical carbon uptake (Annex in Terhaar et al., 2022). It would take a large amount of work and I hence cannot ask for it but I think it would be a worth it.
In addition, I have one comment that is neither major nor minor:
1) Section 3.1 seems to be complicating the message. The co-evolving trends of atmospheric CO2 and warming lead to correlations that are not related to the effect of warming on the ocean system, i.e., changes in circulation and solubility. In fact, the ensemble coefficients in Figure 4 show this as the positive correlation from temperature is entirely accounted for by the increase in pCO2 and the pCO2 coefficient, whereas the T coefficient accounts for the solubility. I believe that Figure 4 really shows the strength of the approach and merits to be a centerpiece of the results.
I am not an author of this study but would seriously consider removing section 3.1, as it confused me more than it helped. However, I am only one person and others may disagree. And as nothing is wrong with the section, I only wanted to voice my opinion and do not want to ask for changes.
Please find below some minor comments:
1) Figure 8: It seems that the reconstruction performs best, where most carbon is stored (North Atlantic and Southern Ocean) and performs poorly where little to no carbon is stored, for example in the deep North Pacific. Thus, the maps in Figure 8 might suggest that the reconstruction performs poorer than it does. Maybe it would be better to show differences between the reconstructed and the true DIC, like Figure 9.
2) Line 3 of the abstract: I would maybe highlight here that the temperature and salinity coverage is much less sparse than the carbon observation coverage. The contrast between both coverages is what this is all about.
3) Line 20: Should probably be Terhaar et al. (2022) and not 2020.
4) Line 28: would replace ‘sufficient’ by ‘enough’. The coverage of, 1.5% is likely not sufficient (L. Gloege, G. A. McKinley, P. Landschützer, A. R. Fay, T. L. Frölicher, J. C. Fyfe, T. Ilyina, S. Jones, N. S. Lovenduski, K. B. Rodgers, S. Schlunegger, Y. Takano, Global Biogeochem. Cycles, in press, doi:https://doi.org/10.1029/2020GB006788.)
5) Lines 38 to 47: This paragraph should probably also include reference to the application of the eMLR* method by Gruber et al. (2019) (https://www.science.org/doi/10.1126/science.aau5153)
6) Lines 56-57: What exactly are these well-understood relationships? Maybe worth to elaborate a bit more.
7) Line 59: Maybe worth to reference Weiss et al. (1974). (https://www.sciencedirect.com/science/article/pii/0304420374900152)
8) Line 234-235: I believe that this sentence is not correct. Fig. 2 is rather a combination of the parallel increase in atm CO2 and temperature and the solubily effect of T on pCO2.
9) Line 336: How many regions are ‘most regions’.
The manuscript “Reconstructing ocean carbon storage with CMIP6 models and synthetic Argo observations” by Turner et al. is well written and easy to follow. The figures are well chosen, good to read, and facilitate the understanding of the study. The methods are sound and well presented, although some precisions at a few points would be needed, I think.
However, I am somewhat concerned about the novelty and significance of this work for this journal, although this might well be due to a misunderstanding of the methods. To be sure that no misunderstandings have occurred, I am first summarizing the paper very(!) briefly in my own words. If the authors detect significant flaws, please consider them when reading my following comments.
As far as I understand, the here presented method builds on the assumption that the DIC (averaged over the first 100 m in the ocean) at a point x,y can be expressed as the average DIC from 1955 to 2014 at this point x,y plus a delta DIC. This delta DIC can, for any moment in time, be approximated by a linear combination of delta pCO2, delta T, and delta S at locations more or less close to point delta x,y.
My comments are:
To me, it seems somewhat trivial that the long-term trajectory of DIC in the first 100 m, a depth that is more or less representative of the average mixed layer depth, is mainly determined by the atmospheric pCO2. The provided BATS example is an excellent example (blue dashed line in Figure 7b). By knowing also temperature, which affects solubility, and salinity, which is closely related to alkalinity, at only one point close to the point of interest, slight deviations of this long-term trend can be represented, as well as inter-annual and decadal variability that is not due to changes in primary production or remineralization can be represented. This is nicely represented by the HOT example (blue dashed line in Figure 7c). As the authors describe in their paper, the underlying mechanisms are well-understood: air-sea CO2 flux due to increasing atmospheric pCO2, decreased solubility with increasing temperatures, and more DIC with increasing alkalinity. Therefore, I think that the predictability of DIC in the mixed layer from pCO2, T, and S as well as a background DIC, is not surprising.
When the detrended response is looked at, pCO2, which is responsible for the largest part of the trend, should not play any role anymore as annual globally averaged pCO2 has little variability. At this point, the RMSE improvement seems to drop (Figures 5d-f and 6d-f). It remains mainly high in the relatively slow and calm subtropical gyres and becomes smaller in the more dynamical regions like the tropical Pacific, eastern upwelling regions, or the Southern Ocean. In these regions, variability in close-to-surface DIC is likely influenced by a variability in the circulation and hence the upwelling of nutrients and carbon. The negative correlation between DIC and T in the tropical eastern Pacific (2d) due to El Nino is a good example. It would be nice to discuss and analyze a little bit more, why these regions are less predictable. As these regions are usually also the regions with the highest variability in surface ocean DIC, I was wondering how the statement that the RMSE of the detrended signal was reduced by 60% was calculated. Did you calculate the area-averaged eta from Figures 5e and 6e? Or did you compare the detrended true signal of globally averaged DIC to the predicted signal of globally averaged DIC? I am not sure which way is better as both have potential pitfalls. Taking the area-weighted average of eta gives potentially to strong emphasize to regions with little variability, as the subtropical gyres, that contribute only little to the global ocean DIC signal. However, by taking the area-averaged DIC first, errors in different directions may globally compensate. Maybe it would be best to use the local eta at each point x,y and weight them by the true DIC variability at that point x,y?
Another point that I am not sure about is the importance of this analysis for the global ocean carbon sink. How much of the additional DIC is in the first 100 m? The Global Carbon Budget 2021 (https://essd.copernicus.org/articles/14/1917/2022/essd-14-1917-2022.html) estimates that from 1960 to 2020, 115 Pg C has entered the global ocean. Recent estimates from observationally constrained ESM output suggests that it might be a bit more. Thus from 1960 to 2014, the ocean has taken up very approximately ~100 Pg C. The upper ocean carbon inventory shoes an uptake of ~14 Pg C (tried to read that number from Figure 7a). Thus, the upper ocean is ‘only’ responsible for 14% of the global ocean carbon sink. Wouldn’t it be more important to quantify how much is transferred from the surface ocean to the interior ocean, i.e., subducted below the mixed layer (https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/gbc.20092, https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2015GL065073). Recent studies indicate that salinity and temperature might indeed be able to reproduce the interior ocean Cant (https://www.science.org/doi/10.1126/sciadv.abd5964, https://www.nature.com/articles/s41586-020-2360-3, https://bg.copernicus.org/articles/18/2221/2021/bg-18-2221-2021.html#&gid=1&pid=1, and https://bg.copernicus.org/preprints/bg-2022-134/). However, the interior ocean transport might make this much harder. The authors mention that this will happen in the next step, but I think it could/should also be incorporated in here if(!) the long-term trend remains a focus of this study.
Figure 7a suggests, however, that the fist 100 m might be very interesting for the inter-annual or decadal variability. In Figure 7a, you show the variability of the upper ocean 100 m DIC. How much of this is caused by the air-sea CO2 flux and how much by transport to the interior ocean. Could the variability in the air-sea CO2 flux maybe largely be derived from variability in the ocean T and S? Or in the variability of subduction on these time-scales? Can your method provide an estimate of the subduction of DIC on an annual basis using the surface ocean air-sea CO2 flux estimates in combination with your estimates? The example of HOT in NorESM is striking as it suggests that the upper-ocean has taken up almost no carbon between 1990 and 2014 despite an increase in atmospheric pCO2.
After having read through the manuscript, I was really disappointed that this new method is not yet applied to observations right now. There is much potential, and I understand the idea of making two publications, one for ‘proof-of-concept’ and one for the application, but now it seems as if something is missing in this paper. If no institutional, or PhD-related restrictions exists, I would recommend to also present the application here.
My overall recommendation would therefore be to de-emphasize the long-term trend, focus on the variability, apply the method to observations, compare it to estimates of the air-sea CO2 flux, and try to address the subject of the decadal and inter-annual variability of the ocean carbon sink.
As you already have the models, you might even be able to draw conclusions why and where models do not capture the variability. Are the regions with the largest variability in the air-sea CO2 flux also the regions with the largest variability in the upper ocean DIC?
If the focus is more on the detrended signal, the subsurface ocean would not fit anymore and could be left for another study.
While this sounds somewhat negative, I am in strong favor of publishing this manuscript in Biogeoscience but with an application to observations. If not applied, maybe GMD is more suited for 'proof-of-concept' studies? But that is an editorial decision.
Line 22, page 1: While the Global Carbon Budget summarizes the numbers, I prefer to give credit to the original studies (for example: https://bg.copernicus.org/articles/10/2169/2013/, https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2013GB004739, and recently https://bg.copernicus.org/preprints/bg-2022-134/)
Line 25, page 2: Gruber et al. is about compound extreme events and rather summarizes literature. I would recommend to replace it by this earlier study on global ocean acidity extremes (https://bg.copernicus.org/articles/17/4633/2020/) and this regional study (https://www.nature.com/articles/s43247-021-00254-z) that provides one example in detail.
Line 57: I think this is the first time that you use the term ‘upper ocean’. Maybe define it here.
Lines 60-66 and lines 99-100: Would suggest to add the Southern Ocean studies (https://www.science.org/doi/10.1126/sciadv.abd5964, https://www.nature.com/articles/s41467-022-27979-5), and one study from the Arctic Ocean (https://www.nature.com/articles/s41586-020-2360-3) that show how surface observations (and variations) of S and T can alter the CO2 uptake. Or maybe keep that for later in the Discussion when you describe the potential to go below the surface.
Line 112: Does the method account for inter-dependencies between the predictors? Sorry about the question, but I am not very familiar with this kind of method.
Line 132-133: Just wanted to say that I really like that you provided the regridding method here in such detail. Often it is not possible to reproduce data because the regridding software is not named.
Line 141: Something is wrong in the sentence, I think.
Lines 147-154: Just to be sure that I understand it right. When you use the ARGO 2002 coverage, you assume that you have data at each of these points for all years from 1955-2014? Is that right? If yes, how do changes in the temporal coverage effect time series as shown in Figure 7? If you have little data for the first 30 years and then more, you should get a better representation of the variability over time, right? Figure 7c seems to suggest that there is only 1 observation for the blue line, but that means one observation per year in the vicinity of the HOT station, right?
Section 3: I found this section a little bit difficult to follow. What could help (maybe it doesn’t), is to look only at the long-term signal and only at the detrended for the correlations. Than the long-term should give you a correlation between pCO2, T, and DIC. T and DIC because the ocean is warming and taking up carbon (but one does not cause the other) and the other between pCO2 in the atmosphere that causes DIC in the ocean to increase. For the detrended signal, I would expect no correlation between pCO2 and DIC as the pCO2 trend is gone. T and DIC should be negatively correlated (like shown in Figure 3b). And salinity should be positively correlated. A little bit along the lines of Figure 2 in https://www.nature.com/articles/s41467-022-32120-7
General: Did you think about including the area of sea ice in each cell as well? As this blocks the air-sea gas exchange to some extent, it might be a powerful predictor in the high-latitude oceans. Just a guess.
Section 6: While I completely understand the reason to only choose Nor-ESM to make these tests, I would be curious how other models perform.