Using machine learning and Biogeochemical-Argo (BGC-Argo) floats to assess biogeochemical models and optimize observing system design

Mignot, Alexandre; Claustre, Hervé; Cossarini, Gianpiero; D'Ortenzio, Fabrizio; Gutknecht, Elodie; Lamouroux, Julien; Lazzari, Paolo; Perruche, Coralie; Salon, Stefano; Sauzède, Raphaëlle; Taillandier, Vincent; Teruzzi, Anna

doi:https://doi.org/10.5194/bg-20-1405-2023

Articles | Volume 20, issue 7

https://doi.org/10.5194/bg-20-1405-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Special issue:

Biogeochemistry in the BGC-Argo era: from process studies...

https://doi.org/10.5194/bg-20-1405-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 20, issue 7

Research article

|

12 Apr 2023

Research article |

| 12 Apr 2023

Using machine learning and Biogeochemical-Argo (BGC-Argo) floats to assess biogeochemical models and optimize observing system design

Alexandre Mignot, Hervé Claustre, Gianpiero Cossarini, Fabrizio D'Ortenzio, Elodie Gutknecht, Julien Lamouroux, Paolo Lazzari, Coralie Perruche, Stefano Salon, Raphaëlle Sauzède, Vincent Taillandier, and Anna Teruzzi

Download

Final revised paper (published on 12 Apr 2023)
Preprint (discussion started on 20 Jan 2021)

Interactive discussion

Status: closed

RC1:
'Comment on bg-2021-2', Anonymous Referee #1, 17 Feb 2021

The comment was uploaded in the form of a supplement: https://bg.copernicus.org/preprints/bg-2021-2/bg-2021-2-RC1-supplement.pdf

Citation: https://doi.org/10.5194/bg-2021-2-RC1
- AC1: 'Reply on RC1', Alexandre Mignot, 30 Jun 2021
  
  The comment was uploaded in the form of a supplement: https://bg.copernicus.org/preprints/bg-2021-2/bg-2021-2-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/bg-2021-2-AC1
RC2:
'Comment on bg-2021-2', Marcello Vichi, 07 Mar 2021
General comments

This manuscript is indeed a valid compendium of diagnostics for assessing global ocean ecosystem models, which has been prepared with the aim to demonstrate the use of the multi-disciplinary dataset made available by the BGC-Argo array. The authors should thus be praised for their intention to bring together the community and follow the steps taken by Russel et al. (2018). However, that paper had different entry points, since it was specifically dedicated to a poorly sampled oceanic region and offered a multi-model analysis. This manuscript is well written and constructed, but only conveys a demonstrative message. I am thus not fully convinced by the scope of this present version of the manuscript, as well as by its effective novelty, since it does not add further knowledge to the existing literature.

I have chosen to sign this review because I deem important to fully disclose my intentions and avoid any misunderstanding. Over the past 20 years, I have personally been one of the modellers who felt the need to engage with the thorough validation of global ocean models and the related limitations. My review may sound over-critical, and I would like the authors to appreciate my intention to give a constructive critique, which is meant to assist with model improvements.

Hence, I have carefully thought about how to write this review, and realised that the most relevant point of clarity would be to illustrate some cases of how readers could approach it. From a point of view of someone approaching modelling validation as a student or early career researcher, this manuscript offers a limited perspective, and one would gain more theoretical and methodological background in the 2009 JMS special issue (Lynch et al., 2009, and all the other papers in the issue), if not from earlier papers in the ecological modelling literature (Oreskes et a’, 1994; Rykiel, 1996). If a reader is interested in the validation of the global version of PISCES, this manuscript is insufficient, because it provides a series of figures with few comments and discussions. It is surely of interest to the PISCES developers who are knowledgeable of the model details and possible deficiencies, but then an internal report would suffice. Finally, for experienced global ocean modellers, this manuscript is an illustration of the minimum set of assessments (which I prefer to the term “validation”) that serious modellers have been doing in the last ten years when evaluating their model results. In terms of “metrics”, it gives indications to compare the model output against the state variables that can be measured by the array of floats and to add derived state variables from applications of artificial intelligence. Ultimately, the assessment is based on visual comparisons of coarsely gridded spatial maps and time series, or through the use of basic univariate scores (bias and RMSD) and cumulative diagrams that combine the same skill scores (e.g. the Taylor diagram, which also includes linear correlation).

The BGC-Argo data are certainly invaluable, and this is the reason why the community has strived to develop the technology and the financial support to deploy them. The authors did not however succeed in showing their enhanced value for model assessment, beyond the obvious consideration that this increases the number of data, which would be much more evident if this same assessment was done by comparing datasets with and without the contribution of the BGC-Argo.

In summary I have found two major issues with this manuscript that the authors have not considered to a satisfactory extent:

The loose definition of metrics and the absence of uncertainties’ treatment. The authors use the term metrics in a rather ambiguous way. They also do not differentiate between measured data and artificially generated data. This implies that the evaluation process does not necessarily lead to an improvement of the model(s).

The unconvincing enhancement of the effective role of BGC-Argo data in model assessment. Basically, the question I have is: why BGC-Argo are good enough and should be used separately and not as part of a global compilation of data such as the World Ocean Atlas? (which incidentally includes or will include the BGC-ARgo data) Since BGC-Argos are ultimately increasing the availability of data that are usually collected by means of traditional oceanographic cruises, what is indeed their value in model validation? The authors state that their aim was to demonstrate the invaluable opportunities offered by the BGC-Argo observations for evaluating global BGC models. I’m afraid this intent cannot be met unless some of the above questions are addressed

For clarity, I would like to elaborate more on the first concept above, while the second point is mostly derived from the specific comments detailed in the next section. Russel et al (2018) also use the concept of metrics in a wider sense, although they define metrics as “any quantity or quantifiable pattern that summarizes a particular process or the response in a model to known forcings”. The strength of the ACC transport at Drake Passage or the latitude of the maximum zonal mean winds over the Southern Ocean are “metrics” in this context. They are combinations of state variables, or values of state variables at specific locations.

In this context, all the surface state variables listed in Table 2, are indeed components of the biological carbon pump, but they are not metrics. They are simply state variables. Only when considered together to evidence emergent patterns they may give indications of proper process functionality (e.g. the ratio of particulate organic carbon to total chlorophyll, de Mora et al, 2016). I agree that the DCM and the “nutricline” (which would deserve a more appropriate definition, see specific points below) are “metrics”, as well as the depth of the hypoxic layer. Mixing together indicators of processes with state variables is confusing, unless a rigorous link between a single state variable and the process is established.

This manuscript increases the risk of misinterpretation by mixing together “metrics” and skill scores. Neither Russel et al (2018) and this manuscript expand on the concept of metrics performance and objective assessment (performance indicators, skill scores, cost functions, are all synonyms that depend on the specific discipline), which was instead done by Allen et al. (2007), Friedrichs et al. (2009), Vichi and Masina (2009) and others in the JMS special issue. For ease of simplicity, I will use the term skill score, which is the one used in the more mature field of weather forecasting. State variables can be assessed using univariate skill scores, and this is a necessary exercise for any modeller to ensure the model has some grip with reality. Figure 3 and the other density plots in the Appendix give a visual indication of the skill score, but they do not quantify it (e.g. Smith and Rose, 1995; Rose and Smith, 1998). I also have another question linked to my Point 2 (and further detailed in the specific comments): why should this exercise be done only with the BGC-Argo and not also including the other existing data? Since BGC-Argo are evaluated against cruise cast benchmarks, then those data are usually considered always superior, and should be used. Again, the real value of the BGC-Argo would have been shown if the score had been substantially modified with the inclusion of the Argo data.

Specific comments

P2L1 - Earlier work has specifically addressed the impact of assimilation on the carbonate system (Visinelli et al., 2017)

P2L26-29 - This sentence is mixing together sensor accuracy, which has been assessed by Johnson et al and Mignot et al, in two specific regions of the world ocean) and temporal/vertical resolutions, which have not been assessed as far as I am aware. This is misleading. 10 days may not be sufficient for all variables, as well as the vertical binning that is done. The comparisons have assessed the equivalence between rosette casts and the floats, but they say nothing about the temporal and vertical resolution. For certain processes, such as carbon exchange and phytoplankton biomass through chlorophyll and backscattering proxies, a resolution of 10 days would lead to sampling aliases either of the mean or of the variability (Monteiro et al., 2015, Little et al., 2018). These are examples from the Southern Ocean, where there is the highest density of buoys.

P2L32-34 - The authors should be more specific. Other datasets, such as for instance remote sensing, are less limited in terms of temporal and spatial resolutions. This is connected to the concerns expressed in Point 1 above.

P4L3-5 This sentence seems to imply that one can only perform point-by-point comparisons when there are few floats, which is odd. Again linked to my main Point 1 above. The authors should explain why given the current computing capability, they only suggest to perform diagnostics for few selected tracks and not for the overall dataset (Section 5.d).

P4L12-16 The connection between the variables and the ocean health/ecosystem functioning is not made explicit in the text. Taking as an example the ocean health index (http://www.oceanhealthindex.org/), establishing ocean health is obtained as a multivariate analysis of several data layers, forming a selected set of drivers and their associated thresholds. The authors should be more explicit about their intent here.

P5L12-13 This is not an objective criterion. What is an acceptable level of compromise? P5L22 There are many other relationships, and they have been shown to give different results (e.g. Thomalla et a., 2017l). The authors should explain why they are recommending this one.

P6L12-15 It appears that this method of linear resampling would artificially increase the number of data, and hence bias the statistical results, especially in conditions where there are not enough data.

P7L10-12 The authors do not discuss what would happen if the MLD is different between the observations and the model.

P7L29-30 Related to my point 1 above. The relationship between the state variables and the ecosystem functions is not made explicit. The term “useful” should be motivated.

P8L7-8 Same as above, the value of DCM as an indicator should be contextualized. Why are BGC-Argo data providing a better estimate of this metric than other data?

P8L13 Please explain what H is.

P8L14-16 This may be confusing for some readers, since it's not technically a gradient. The cited paper uses and justifies this definition. I'd suggest the authors to be more precise and give their definition and how this is an effective metric of the carbon pump. Also, there is a difference in sampling between argo and the layers of discrete models. How is this taken into account?

P8l28-30 At P4L11 it is reported “depth of the OMZ”. This the depth of the oxygen minimum. It should be explained how and why this is a good indicator, and why the BGC-Argo data are superior in its identification.

P9L26 This statement about non-linearity is odd in the context of model goodness-of-fit (Smith and Rose, 1995; Pineiro et al, 2008; Vichi and Masina, 2009). If it’s non-linear, then the assessment is failed.

P10-8-12 The choice of the binning interval should be discussed. What is the advantage of losing the variability measured by the floats? Why not using the standard deviation as an indicator of the model skill to reproduce the proper scales? These are enhanced features that only the BGC-Argo data would allow to compute.

P10L22-24 Allen et al (2007) warned against the visual comparison of time series. This sentence is generic and should be explained in the context of the augmented data provided by the BGC-Argo.

P11L11-14 The results are not presented according to the concept of the biological carbon pump “metric”. It is evident that the nutrients are correlated while all carbon flux variables are not performing. Which ultimately questions the use of surface nutrients as indicators of carbon cycling.

P11L31 I cannot see the data “around” the line. I rather see an overestimation. (it is either Cape Verde or Cap Vert)

P12-L2-17 Linked to Point 2 above. The authors seem to imply that BGC-Argo data are more suitable than ocean colour for model assessment. I acknowledge that this is not explicitly written, but there is no clear rationale. This kind of map would certainly be superior in terms of spatial and temporal resolution when using that product as benchmark.

P12-section-d This is the section that mostly led to the inclusion of Point 2 above. The shown time series is 2 years long, which is an invaluable source of data from a region that has been influential in shaping our understanding of the spring bloom. I am missing the point why the authors are writing the term spring bloom in quotes. The advantage of time series from floats that remained in a given province of the global ocean is of huge potential in model validation. The offered description is quite generic, which could have been done even using monthly climatological time series obtained from the WOA, or from the existing long-term observational ocean sites (BATS, PAPA, HOT). The BGC-Argo floats are an unprecedented source of multiple opportunities to do validation in several regions of the world ocean (with some limitations), but this present form of the manuscript does not offer any specific recommendation of what numerical modellers should do to unleash this potential. I would be very interested in seeing an exploitation of the multivariate nature of BGC-Argo, while I only see multi-panel plots.

P13L4-5 The authors should do more than simply say “correctly represented”. This is a subjective statement, which is based on a visual comparison, exactly what the community challenged in the last 10-15 years. The advantage is that now we can use a frequency of 10 days, when initially phenology analysis was based on monthly data. Again, the authors are missing an opportunity to demonstrate the intrinsic value of this new data set.

P13-L13-20 This is a more detailed analysis of this specific model, which indeed brings in some of the advantages of a multivariate data set. However, there is a combination of measured and derived variables, which are treated as if they were equivalent. Quite a few questions come to mind: Is there a possibility that there is artificial correlation in the derivation of the phosphate and silicate concentration? What is the error associated with the CANYON-B method? Which is the effective (measured) variable mostly responsible for the response of the other estimated nutrients? The reduced consumption occurs during the spring period, and is continued during summertime. Hence, there is a factor at play during the late spring period, which is less likely to be reduced uptake from smaller phytoplankton during summer as suggested. It may thus be a delayed onset of the phytoplankton succession, or maybe a faster remineralization occurring in the upper layers, which retain more inorganic nutrients closer to the surface. This may indeed be beyond the scope of the manuscript, but it has been the authors’ decision to propose some mechanistic explanations of this discrepancy. Showing a complete example of how the use of multivariate data allows modellers to investigate model deficiencies would offer guidelines to other modellers.

P13-L22-23 This sentence bears lots of assumptions. This is really where BGC-Argo can make a difference. The related uncertainties should however be highlighted, together with recommendations to other modellers on how to best approach the assessment of the carbon cycle metrics.

P13L26-29 This argument is flawed. If the occurrence of the peak is matched in the mesopelagic layer rather than at the surface, it is a clear indication of vertical mismatches in the export. I would thus argue that POC concentration is a proper metric for the export component of the carbon cycle. I would again encourage the authors to replace the use of subjective terms such as “consistent” with objective indicators (see Allen et al., 2007). For instance the comparison of the skill score computed in two consecutive years would give indication if there is some variability or if the model tends to repeat the same pattern.

P14L16-19 I would recommend more clarity on this statement. Are these sensors not available on the global ocean floats? It is not clear why this example is presented for Mediterranean floats, and not introduced earlier as one major advantage of the BGC-Argo floats.

P14L26-28 This sentence is similar to the statements done in the earlier sections. This is not technically a perspective statement.

P15L1-6 The question is whether these data should be used “on their own” or in conjunction with the other existing datasets. The authors should clearly explain in the conclusion why this dataset should be exploited as a separate unit.

P15L32-P16L3 I would thus recommend the authors to thoroughly address the issue of how the uncertainties should be treated. This is particularly important in the case of mixing measured and derived variables. If BGC-Argo are capable, within their limits, to reduce uncertainties in model assessment exercise, this should be adequately argumented. The fact that there are more data available is undoubtedly of relevance, but I wonder if it does help to reduce uncertainties in model states.

P16L15-18 Please highlight in which part of the results this is shown.

P17L2 Please add in the caption the meaning of the codes (or a link to where they are explained more in detail). Also, in the heading of the 3rd column, correct Date with Data.

Figure 2 Taylor diagrams are based on geometric properties of the circle. Hence they should be presented using equal axes.

References

Allen, J.I., Somerfield, P.J., Gilbert, F.J., 2007. Quantifying uncertainty in high-resolution coupled hydrodynamic-ecosystem models. Journal of Marine Systems 64, 3–14.

de Mora, L., Butenschön, M., and Allen, J. I.: The assessment of a global marine ecosystem model on the basis of emergent properties and ecosystem function: a case study with ERSEM, Geosci. Model Dev., 9, 59–76, https://doi.org/10.5194/gmd-9-59-2016, 2016.

Friedrichs, M.A.M., Carr, M.-E., Barber, R.T., Scardi, M., Antoine, D., Armstrong, R.A., Asanuma, I., Behrenfeld, M.J., Buitenhuis, E.T., Chai, F., Christian, J.R., Ciotti, A.M., Doney, S.C., Dowell, M., Dunne, J., Gentili, B., Gregg, W., Hoepffner, N., Ishizaka, J., Kameda, T., Lima, I., Marra, J., Melin, F., Moore, J.K., Morel, A., O’Malley, R.T., O’Reilly, J., Saba, V.S., Schmeltz, M., Smyth, T.J., Tjiputra, J., Waters, K., Westberry, T.K., Winguth, A., 2009. Assessing the uncertainties of model estimates of primary productivity in the tropical Pacific Ocean. Journal of Marine Systems 76, 113–133.

Little, H.J., Vichi, M., Thomalla, S.J., Swart, S., 2018. Spatial and temporal scales of chlorophyll variability using high-resolution glider data. Journal of Marine Systems 187, 1–12. https://doi.org/10.1016/j.jmarsys.2018.06.011

Lynch, D.R., McGillicuddy, D.J., Werner, F.E., 2009. Skill assessment for coupled biological/physical models of marine systems. Journal of Marine Systems, Skill assessment for coupled biological/physical models of marine systems 76, 1–3. https://doi.org/10.1016/j.jmarsys.2008.05.002

Monteiro, P.M.S., Gregor, L., Lévy, M., Maenner, S., Sabine, C.L., Swart, S., 2015. Intraseasonal variability linked to sampling alias in air-sea CO2 fluxes in the Southern Ocean. Geophysical Research Letters 42, 8507–8514. https://doi.org/10.1002/2015GL066009

Oreskes, N., Shrader-Frechette, K., Belitz, K., 1994. Verification, Validation, and Confirmation of Numerical Models in the Earth Sciences. Science 263, 641–646.

Pineiro, G., Perelman, S., Guerschman, J.P., Paruelo, J.M., 2008. How to evaluate models: Observed vs. predicted or predicted vs. observed? Ecological Modelling 216, 316–322.

Rykiel, E.J., 1996. Testing ecological models: the meaning of validation. Ecological Modelling 90, 229–244.

Rose, K.A., Roth, B.M., Smith, E.P., 2009. Skill assessment of spatial maps for oceanographic modeling. Journal of Marine Systems 76, 34–48.

Rose, K.A., Smith, E.P., 1998. Statistical assessment of model goodness-of-fit using permutation tests. Ecological Modelling 106, 129–139.

Russell, J.L., Kamenkovich, I., Bitz, C., Ferrari, R., Gille, S.T., Goodman, P.J., Hallberg, R., Johnson, K., Khazmutdinova, K., Marinov, I., Mazloff, M., Riser, S., Sarmiento, J.L., Speer, K., Talley, L.D., Wanninkhof, R., 2018. Metrics for the Evaluation of the Southern Ocean in Coupled Climate Models and Earth System Models. Journal of Geophysical Research: Oceans 123, 3120–3143. https://doi.org/10.1002/2017JC013461

Smith, E.P., Rose, K.A., 1995. Model goodness-of-fit analysis using regression and related techniques. Ecological Modelling 77, 49–64.

Thomalla, S.J., Ogunkoya, A.G., Vichi, M., Swart, S., 2017. Using Optical Sensors on Gliders to Estimate Phytoplankton Carbon Concentrations and Chlorophyll-to-Carbon Ratios in the Southern Ocean. Frontiers in Marine Science 4, 34. https://doi.org/10.3389/fmars.2017.00034

Vichi, M., Masina, S., 2009. Skill assessment of the PELAGOS global ocean biogeochemistry model over the period 1980-2000. Biogeosciences 6, 2333–2353.

Visinelli, L., Masina, S., Vichi, M., Storto, A., Lovato, T., 2016. Impacts of data assimilation on the global ocean carbonate system. Journal of Marine Systems 158, 106–119. https://doi.org/10.1016/j.jmarsys.2016.02.011
Citation: https://doi.org/10.5194/bg-2021-2-RC2
- AC2: 'Reply on RC2', Alexandre Mignot, 30 Jun 2021
  
  The comment was uploaded in the form of a supplement: https://bg.copernicus.org/preprints/bg-2021-2/bg-2021-2-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/bg-2021-2-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (11 Aug 2021) by Tina Treude

ED: Reconsider after major revisions (11 Aug 2021) by Ciavatta Stefano (Co-editor-in-chief)

AR by Alexandre Mignot on behalf of the Authors (01 Oct 2021) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (05 Oct 2021) by Tina Treude

RR by Anonymous Referee #3 (29 Oct 2021)

Suggestions for revision or reasons for rejection

The manuscript “Using BGC-Argo floats for the assessment of marine biogeochemical models: a case study with CMEMS global forecast system” by Mignot et al. proposes 22 metrics for the assessment of biogeochemical models and applies them to a single model. As such, the analysis is a very welcomely comprehensive application of ocean BGC Argo observations, but is done in a vacuum without reference to previous or alternative modeling efforts. While this approach is fine from a technical report documentation perspective, it does not fit the standard of a scientific research paper. As such, it would seem more appropriate for “Geoscientific Model Development” than “Biogeosciences” in its present form. The null hypothesis for establishing that the model is “good” should be defined. Also, there are some really interesting of the value and needs for BGC Argo observations in the conclusions that are completely unsupported by the body of the manuscript… if the authors want to bring some of this Appendix material into the manuscript body so as to support these conclusions, (and leave the focus on just the current model) that would also be an appropriate means of turning the paper from a technical report on diagnostics into a scientific research paper. Finally, the paper includes a multitude of language mistakes which I have tried to rectify in my technical comments.
Technical comments:
1-16 – “a major tool” should be “major tools”
2-1 – “or” should be “and”. Also, is there a difference between “These metrics” and “The metrics in the sentence before? If not, “. These metrics” should be ”and”
2-3 – “suggest” seems an odd word here given that nearly all scientific papers display plots. Perhaps instead of “suggest” should be “recommend as a community standard”
2-7 – “had” should be “has”
2-14 – No, numerical simulations are not necessary “to monitor these ongoing changes”. Instead, the authors could say, “to contextualize monitoring of ongoing changes”
2-23 – remove “being”. Also, the attribution here with “mostly” is overconfidently placed on lack of BGC understanding. In many instances it is lack of understanding of the physics and lack of characterization of the forcing that are the bigger issues than the BGC parameterization.
2-25 – add comma before “and” .
2-30 – add “a” before “few”
3-3 – The list should reflect back to the same part of the sentence, not three different parts. If reflecting back to “to test their”, then it should be “to test their predictive skills, ability to reproduce BGC processes, and confidence intervals on model predictions” or if these are separate statements reflecting back to “their” and “to”, then “to test their predictive skills and ability to reproduce BGC processes and estimate confidence intervals on model predictions”
3-11 – “All these datasets neither have a” Should be “These datasets have neither”
3-12 – remove “can”
3-23 – “so far essentially sampled” should be “well sampled only”
3-24 – Add “the” before “regional”
3-24 – remove comma before “large”
3-25 – remove comma before “like”, and replace “or” with ‘and”
4-4 – “represent” should be “represents”. Also, while this statement may be true in terms of quantity of data for a few parameters, it is not true in terms of either accuracy or comprehensiveness. Just because you can derive an estimate of SiO4 from an O2 sensor does not mean the dataset is better than actually measuring O2 from a Winkler titration, much less using that value to extrapolate SiO4.
4-6 – I do not know what “interoperability” means in this context. Is it something about the inherent environmental variability, or the measurement uncertainty?
4-8 – I don’t know what “separately” is being used for here. Are the authors saying that they are initializing the model with “WOA/WOD” and then evaluating performance separately with BGC-Argo? Or that the initialization of the model is done independently from WOA/WOD and then both WOA/WOD and BGC-Argo are used for independent evaluation?
4-18 – The sentence “We expect that the methodology employed here (from the data handling to the use of assessment metrics) would be useful and informative for other research teams interested in model evaluation with BGC-Argo floats.” Belongs in the discussion/conclusions, not in the introduction.
4-27 – “them” should be “these metrics”
4-29 – “. These metrics” should be “and”
4-31 to 5-5 – Again, the sentences beginning “Further, our validation framework could...” to “… is not addressed in this study” Belongs in the discussion/conclusions, not in the introduction.
4-33 – This sentence needs a lot of work. The authors could try adding “have” before “demonstrated”, “flux” before “calculation” and “the” before “basin”, remove the commas and remove “of mass fluxes and process rates” and see if it makes sense.
5-3 “use of the word “arduous” seems odd here. Whether something is hard to do is not necessarily relevant. More relevant is whether the effort is warranted… would it be too uncertain so as not to be robust?
5-7 – “follow: s” Should be “follows. S”
5-26 – “variable” should be “variables,”
5-32 – It would be helpful to site the WCRP standard here for essential climate variables, e.g Bojinski et al, 2014, “The concept of essential climate variables in support of climate research, applications, and policy”, BAMS
6-1 – Unclear what is intended for “highest” here. Is it “highest quality” or “highest density” or something else?
6-11,12 – “points” should be “point”
6-18 – So this means that the low values are biased high as the chance of a low positive value includes the possibility of the value being zero. How big is this problem? What fraction of the data had to be adjusted to zero?
6-23 – “floats” should be “float”
6-24 – add comma after “salinity”
6-25 – there should be a statement here on the carbon system data source that is used for the training of the algorithm… eventually the skill has to be traced back to the GLODAP or other data source.
6-34:7-2 – The authors should note that whether or not it is “reasonable” to draw these conclusions is also entirely reliant on both the BGC Argo data and the model capturing the underlying environmental variability.
7-8 remove comma
7-10 – remove “it” after “and”
7-19 – what is the advantage, if there is one, of saving only weekly and then recreating the daily values with interpolation? Is this to speed the model or otherwise reduce data size?
7-26 – remove “values”. Again, is there an advantage of calculating output offline? Are CO2 fluxes calculated online and saved out? Perhaps it would be better to move this to the next section where the CO2 flux calculation is discussed.
7-32 – and “space to” between “and” and “the”
8-3 – The bias in MLD is provided, but what is the average MLD that would allow me to know the % bias?
8-28 – This is a strange phrasing. It sounds from this that acidification does not impact the subsurface down to 200 m, on the “surface” and the 200-400 m range… Why not just say that acidification is expected to have its largest impact in the upper 400 m and then separately that the present analysis chooses the 200-400 m range of Kwiatkowski? Presumably the surface and 200-400 m ranges are shown to highlight different signals rather than to suggest the area in between is unimportant. This should b clarified.
9-12 – I would replace “first level” with “most simple but indirect level”, 9-13 – replace “of” with “associated with” and 9-17 – replace “second level” with “more process level” since the “second level” isn’t being pursued.
9-22:9-26 – A brief statement and reference on the motivation for providing these mesopelagic estimates is warranted. Also, is there a reference or other rationale for this choice of varying depth range? This MLD-1000 m variable depth definition would seem to include the part of the euphotic zone below the mixed layer as “mesopelagic”, at least during the growing season. I would have thought the area below the mixed layer within the euphotic zone to look more like the surface than the mesopelagic, or “twilight zone”, a constant 200-1000 m range would have been easier to interpret, particularly against the 200-400 definition for pH.
9-33 – remove second “processes” and end sentence after “production”
10-10 – This sentence is very misleading. The vertical supply of NO3 to the surface is accompanied with remineralized DIC which is the reverse of the biological carbon pump. This sentence should be reworded.
20-32 - Why define a biased average for O2 300? Shouldn’t the average oxygen between 250-300 be referred to O2 275? Why not use the same 200-400 definition as pH? Or 250-350?
11-2 – Similarly, why define O2 1000 as O2 950-1000? Should this be o2 975, or alternatively, defined as 950-1050… do the floats only go down to 1000m? This would seem a reasonable justification if it were the case since gradients at this depth tend to be weak, but still wouldn’t explain the odd 250-300 definition.
12-12 – “on a climatological level” should be “as a climatology”
12-13 – what is the purpose of “etc..”? “imposes” should be “requires”
12-14 – “in a climatological way” should be “as a climatology”
12-19 – why is “Biogeochemical-Argo Planning Group, 2016” in parenthesis here. Was this means of gridding a recommendation from this group? If so, please be explicit.
12-21 – “clarity” should be “clarity in visualization” or “simplicity in visualization”
12-26 – Add “While” before Taylor” and replace “but” with a comma in the next sentence. That would make It more clear that you are introducing a new topic rather than simply revisiting how great are the first three presentation methods.
12-33 – “for” should be “in”
13-7 – The sentence “Examples of the diagnostic plots described in section 4 in combination with the metrics defined in Section 3 are shown.” Seems redundant with the orientation statement in the introduction section and should be removed.
13-16:13-25 – The null hypothesis that the reader should use to define “well represented” are not clear. Isn’t much or all of this fidelity due to the initial condition derived through the assimilation? I am not sure what to take from this. Is there an “unassimilated” version of the model with which the assimilation should be compared? Or a previous generation model? Or other unassimilative models such as CMIP6? Or is the objective just to show the broad contrast in pattern agreement between model and observations across variables? Why is pH so poorly predicted?
14-1:14-4 – This discussion of the value of Taylor diagrams is very superficial and somewhat misleading. The presentation here certainly shows what patterns and variability in different variables are relatively well reproduced, but whether this should inform future model development priorities entirely depends on the intended use of the model and associated requirements. Further, the most common scientific use of Taylor diagrams is the comparison of the same metric across models so that one can quantify the improvements.
14-25 – Without a frame of reference, it is not at all clear whether the model is good or bad. Like in the case of the Taylor diagrams, it seems like the analysis is being done in a vacuum without any awareness of other modeling efforts. There is also the lack of appreciation of the satellite derived estimate for this metric.
18-18 – The conclusions “Here, we showed that the spatial maps of model-observations comparison are also informative a posteriori, with respect to the network design, as they highlight sensitive areas where BGC-Argo observations are critical and where sustained BGC-Argo observations are required to better constrain the model. These maps correspond to the regions where the model uncertainty (see RMSD spatial maps in Figs. A22-A44) is the highest, i.e., the Equatorial belt with respect to the carbonate system variables, the Southern Ocean with respect to the nutrients and the DCM variables, and the western boundary currents and OMZs with respect to oxygen.” Are very interesting scientific research conclusions but are not at all discussed in the body of the manuscript. This is totally unacceptable. The paper cannot bring in unsupported information at the conclusion stage referencing Appendix material. The authors need to show this or restate these conclusions as hypotheses for future work.

Hide

RR by Marcello Vichi (31 Oct 2021)

Suggestions for revision or reasons for rejection

I have been positively impressed with this authors’ revision of the previous manuscript. This new version and their answers to my comments have convinced me of the value of the exercise. The aims are now much clearer, which allowed me to focus more on the context of the presented results. I have therefore some further, more specific comments, which I would like the authors to address before the manuscript can be published.

General comments

There is some need to further strengthen the concept of why the BGC-Argo data should be considered the most appropriate reference dataset for global model assessment, and how they relate to the other existing datasets (especially satellites, which are going to be superior for evaluating surface chlorophyll than BGC-Argo; see my comment 4 below). There is little doubt that the BGC-Argo program will become a reference climate data record in the longer term. Maybe the authors should provide some clearer recommendations to the readers in their final section. As it stands, the conclusion section appears truncated, with a series of comments that one would mostly expect in a report rather than in a journal article (see in this regard my comment 3 below).
Section 3 is still confusing. I apologise with the authors if this is due to my own limitation, but I feel there could be other readers raising questions like mine. Somehow, the previous version of the manuscript was clearer, although I realize that this may be a consequence of all the other changes in this revision. I would suggest the users be clear with their definitions. They now indicate that 22 metrics can be extracted from the BGC-Argo datasets, but they do not explain clearly that these metrics have been grouped according to key components/processes of marine ecosystem functioning (i.e. the 4 sub-sections presented in Sec. 3). This grouping is evident in Table 2, but the text is unsatisfactory. The confusion is further augmented by naming one of the key processes “Oceanic pH” (one of the metrics) instead of “Ocean acidification”. The authors say: “The metrics are described below”, but actually they first describe the processes, and then how the metrics derived from the BGC-Argo data can be used to quantify these processes. They should also explain why certain metrics are included in one grouping rather than another. For instance, the surface partial pressure of CO2, which is essential for estimating the air-sea flux, can be computed from pH and DIC, which have been included in two different groups. It is true that inorganic carbon is linked to both the physical solubility pump and the biological carbon pump, and this ambiguity should be recognized.
I am (now) aware of the main intent of this manuscript. However, more effort should be put into demonstrating that this exercise is a contribution to the literature on global biogeochemical models and their assessment, rather than a report that could have been produced by CMEMS as part of their operational endeavour. For this reason, I would recommend the authors to improve their description of results, which is often written as a dry reporting of the model discrepancies. This is instead well done in Sec. 6, which is now very clear and combines the demonstrative aims with the provision of some directions for future research and/or analyses. I have given some more specific comments in the next section.
The authors rightly claim the unicity of this data set as well as its multivariate nature. However, this is not always put into practice in a demonstrative sense. I am particularly critical with Section 5.c, in which surface Chl is presented as an example of the maps. Why using sChl as the demonstrative metrics? This field is far better represented in terms of temporal frequency and spatial coverage by the satellite record and I’m sure the authors would recommend modellers to use this product for their validation. I am also sure satellite Chl has been used thoroughly before making the CMEMS model publicly available as a shared product. I question the decision to not include in the main text one of the other variables that would not be available without the BGC-Argo dataset and CANYON-B. They are in the Appendix, and to me far more informative than sChl. As a modeller, my main question when reading this section is not how relevant the BGC-Argo dataset is to assess model performances, but rather how surface Chl from that dataset compares with the satellite record. This issue also applies to the results presented for the Atlantic time series in Sec 4.d. Why choose variables that have previously been used to assess models (nutrients and chlorophyll), instead of selecting new variables such as pH, DIC and POC, which would definitely give information on the processes of interest. In this case, these figures are not provided in the Appendix, which is a missed opportunity to demonstrate one of the main aims of this paper. If this is done because these results are not very good, then it is even more worrying.

Specific comments
P1 L20-22 This is a generic statement for an abstract. The same can be said of BGC-Argo data, since rates are also not directly measured

P2 L7 Has taken

P3 L7 All datasets are incomplete and have limitations, including the BGC-Argo

P3 L30 Please explain why these AI methods cannot be applied to the other datasets

P4 L4 The dataset represents

P5 L2-L5 This sentence is unclear. What does it mean to be arduous? Is it a problem with the data set? Should the readers abstain from attempting it because it would not be possible? This sentence would be understandable if further discussed in the conclusions. As it stands, it seems the authors are justifying themselves for not having done it.

P5 L19-20 There is a need to clarify from the beginning which are the variables directly measured with the on-board sensors of the BGC-Argo devices (primary variables?) and which ones are further derived (secondary or derived variables?). This would help in understanding the author's definition of metrics. This is further complicated in the reminder because some variables are a combination of derived and measured products (pH, NO3), and it is not always clear what is the percentage (for instance, in Sec. 5.d).

P7 L1-2 This is a very relevant addition. However, I do not see this concept further used in the presentation of the results. It is for instance not discussed when showing the RMSD in Fig. 5 and 6.
P7 L20-22 This sentence is not connected with the following. It is customary to use the lower time frequency or coarser spatial resolution when comparing data and models (as done with the spatial maps in the results section). Why did the authors decide not to use weekly averages of the Argo data?

P7 L25 to match

P7 L30 Was this done using the daily interpolation?

P12 L13 I would suggest using sparseness rather than scarcity. Argo data are still scarce.

P12 L21-22 Unclear sentence. Does it mean that showing this would confound the reader?

P13 L14 This section is presented as a technical report. I would recommend adding a few more sentences that point at the relationship between the metrics and the processes in section 3. For instance, when referring to the oxygen levels, make an explicit connection with sec. 3.a, and the same with the other variables. I think the value of the message would be further enhanced if there is a more direct connection between Sec. 3 and Sec. 5. The demonstrative aim of the manuscript is clear, but because there is no discussion section it would help to have some additional comments. Many questions arise, for instance, why Chl performs badly while nutrients don’t, while DIC is also good and spCO2 and spH are similarly worse? I am not asking the authors to offer full explanations since this would be beyond the scope of the work, but the indication that the BGC-Argo data help to highlight these discrepancies, which would not be possible with other datasets.

P13 L20 close to the

P14 L2-4 This is another sentence that would be improved through references and linkages to Sec 3.

P14 L12 as well as

P14 L16 There is also a lack of sensitivity in the model for very low oxygen regions close to 0 umol kg-1. The model can have any number between 0 and 30 umol kg-1 when observed values are close to 0 umol kg-1. The feature reported in the text is relevant but the number of data is not very high. While the discrepancy around zero has a higher data density.

P14 L17 Cape Verde (https://en.wikipedia.org/wiki/Cape_Verde)

P14 L29 Figure 1 shows data counts, not Chl patterns. Please clarify.

P14 L31 Please explain the meaning of coherent. This should not be the first time this model is assessed against surface chl from satellites.

P14 L34 This is another comment that I would expect in a report. My understanding is that the aim of the work is to highlight what can be learned from the use of BGC-Argo data that is not possible with other datasets (e.g. satellite data).

P15 L6 Is there a reason for using quotation marks for the spring bloom?

P15 L18-22 This is another dry sentence used for a major misestimation, which would require some more context or a brief discussion. The percentages are extremely high. I am not questioning the model quality, rather the value of offering interpretations based on the assessment exercise.

P15 L28 Please indicate if these percentages are satisfactory with respect to the reference uncertainties indicated in the methods (P7 L1-2 and previous lines). This comment also applies to the previous point.

P15 L30-32 It would be helpful if the authors could add some comments on how the multivariate data from BGC-Argo allow to constrain models in a way that was sparse and more difficult 15 years ago. Consider for instance Vichi, Masina and Navarra (2007), in which all possible existing data were used to assess a global ocean BGC model. There is no need to add this reference, it’s just one of the examples of how model assessment has been done in the literature.

P17 L10 I suggest to use “limited” instead of lack

P17 L11 Increased number is not the only advantage. They are coherent, consolidated and sustainable. They could become equivalent to the concept of climate data records used for satellite data.

P17 L20-21 I would suggest to refer to the processes presented in Sec. 3

P21 Table 2 Please clarify in the section text if the definition used here is the same for both the model and the data

Fig. 4 and all the maps. It would be very helpful to add the maximum and minimum values of the range in the colorbar, to better understand the spread of data values

Hide

ED: Reconsider after major revisions (03 Nov 2021) by Tina Treude

ED: Reconsider after major revisions (03 Nov 2021) by Katja Fennel (Co-editor-in-chief)

AR by Alexandre Mignot on behalf of the Authors (21 Dec 2022) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (04 Jan 2023) by Tina Treude

RR by Anonymous Referee #3 (06 Jan 2023)

ED: Publish subject to minor revisions (review by editor) (30 Jan 2023) by Tina Treude

ED: Publish subject to minor revisions (review by editor) (30 Jan 2023) by Katja Fennel (Co-editor-in-chief)

AR by Alexandre Mignot on behalf of the Authors (06 Mar 2023) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (08 Mar 2023) by Tina Treude

ED: Publish subject to technical corrections (08 Mar 2023) by Katja Fennel (Co-editor-in-chief)

AR by Alexandre Mignot on behalf of the Authors (09 Mar 2023) Manuscript

Short summary

Numerical models of ocean biogeochemistry are becoming a major tool to detect and predict the impact of climate change on marine resources and monitor ocean health. Here, we demonstrate the use of the global array of BGC-Argo floats for the assessment of biogeochemical models. We first detail the handling of the BGC-Argo data set for model assessment purposes. We then present 23 assessment metrics to quantify the consistency of BGC model simulations with respect to BGC-Argo data.