The authors have included most of my comments and I understand the reason for not expanding the study to real observations. At the same time, I think that DA results obtained from synthetic observations may not generalize well to real observations for reasons pointed out in my last comments.
# General comments
At this point I have mainly one major comment that is regarding the accessibility/readability of the manuscript text. The manuscript is difficult to follow in places and at times I had to reread sentences several times to understand their intended meaning. Some sentences are convoluted and difficult to follow, for example, "Also, this indicates that the improvement in correlation estimation in the offline experiment of Fig. 3 translates (at least on average) into the online testing period, where the updates of a given DA cycle feed into subsequent cycles." (l. 658) Here, it would be helpful to put the main message of the sentence front and center. Other sentences seem to just restate what was described before: "This exceeds a value of 1 as we are measuring expected error of ensemble members, not error to the ensemble mean, as described in Sect. 3.4.1. ... This means that when we calculate the error of each ensemble member, rather than the error of the ensemble mean, we get this difference in error." (l. 640). There are also instances where it is still not entirely clear to me what is meant: "However, as will become clear in Sec. 4.3, this strategy for assigning importance or priority to variables in the DA scheme does not necessarily correspond to the dynamical importance of a variable in the model, highlighting the need for specific results." (l. 578) Here, I am not sure what exactly is meant by "specific results", and the sentence again seems a bit convoluted. I have highlighted other instances below; overall I would encourage all coauthors to go through the text again with fresh eyes and rephrase sentences that may be difficult to understand for readers, including those without extensive knowledge of BGC DA.
# Specific comments (line numbers are based on the tracked changes version)
L 99: "non-linear/flow-dependent relations": This statement seems to imply that flow-dependent is equivalent to non-linear. Because this term is used throughout the manuscript, it would be useful to define and explain it briefly.
L 101: "Thus, we are not attempting to emulate or improve BGC models via ML, ...": But of course the goal is to improve the estimates of BGC models -- perhaps not directly, but indirectly through improvements to the DA system. I would suggest adding a "directly" to this sentence, as in "Thus, we are not attempting to emulate or improve BGC models directly via ML, ...".
L 103: "problem of propagating information from observed to unobserved variables in single-model deterministic runs": As opposed to probabilistic multi-model ensembles using different types of BGC models? I think the sudden emphasis on single-model is confusing -- also does not the model itself propagate information from observed to unobserved variables? I would suggest rephrasing the description of this problem in the context of DA.
L 118: "BGM" -> "BGC"
L 118: "end-to-end learning": Please introduce or define this term.
L 123: "the existing univariate scheme": add "DA" to make it clear to the reader what scheme this statement is referring to. Here, too, would be a good place to mention that the scheme is 1D only.
L 220: "Offline refers here to a setup in which the ML-OI analysis is not then used as the initial condition for the next forecast": This is the first mention of "ML-OI" and it is not explained or defined. Either introduce the term or use "DA" instead of "ML-OI" in this sentence.
L 251: It should be plural: "state vectors".
L 254: "While the state vector only considers surface values (0D), the entire DA process itself is 1D, using an idealised structure function to spread increments uniformly throughout the mixed layer.": Mention here somewhere that the mixed layer consists of more than one GOTM grid cell; otherwise, it appears like a handwavy argument that 0D is really 1D because it represents a column of water.
L 267: "We assimilate only total chlorophyll, y, at the surface.": Why introduce the symbol y here? It is not used in the following sentences and makes it seem like y is used to from hereon to denote "chlorophyll" when it is really meant to represent a chlorophyll observation. I would suggest introducing y when it is first used, which may be in Eq. 6.
L 244: "x^b is the background state (which in this work is equivalent to a seven-day forecast state)": The phrase "seven-day forecast state" suggests a vector containing entries for 7 days of model states. Is that the case?
L 247: What does it mean that P^b is "appropriately flow-dependent"? The authors just mentioned that P^b is of special interest in this study, but then this sentence does not provide a good explanation why it is of special interest. I would suggest including more information.
L 248: Here would be a good place to mention that the DA schemes described later use different ways to update other variables, but that all schemes use the same technique to update PFT chlorophyll which is described in the following.
Eq 4. and 5.: The mathematical notation here is not ideal as \chi and \zeta are each meant to represent 4 different symbols, one for each PFT. It is not clear, for example, that the x^b_\chi / x^b_\zeta ratio is based on the same PFT.
L 253: What does the "potentially" imply in this statement?
L 323: "long training EnKF run": Does this mean it was trained or run for a long time? It is unclear to the reader why "training" is used here.
L 368: "ANN" has not yet been defined.
L 400: "while making a bold assumption": This phrasing is not very scientific and suggests that the assumption is likely false.
L 405: This text is difficult to follow and I do not understand the jump to variances here. The previous sentences described the way correlations are computed and normalized and now there is a jump to variances and a statement about variances and how they can be estimated more reliably than correlations. How does this help when estimating the correlations?
L 548: Is the light limitation due to the deep mixing or low seasonal light level? If it is the seasonal light level, mention this point first.
L 615: "Though, it is clear that the correlations of the ensemble still vary more strongly than either other method." It is not the methods that vary but their correlation estimates: "Though, it is clear that the correlations of the ensemble still vary more strongly than the climatological and ML-based correlation estimates."
L 619: Contrary to the statement, based on Figure 3, it looks like both the climatological and ML estimates can capture the "moderately negative correlation" state. What am I getting wrong?
L 622: "... the DA updates are also small and have very little impact. This means that any improvement or degradation in correlation estimates at this time of year are less likely to result in any great improvement to the system, as there is weak relationship between chlorophyll and nitrate DA increments.": This last sentence is confusing: the "This means" links the lack of great improvement to the small DA updates in the previous sentence. However, the second part of the sentence beginning with "as there is weak relationship" links the lack of great improvement to the weak relationship between chlorophyll and nitrate increments. What is the main reason? Please rephrase.
L 730: "It also slightly degrades the ammonium and silicate concentrations": What is meant by degrading concentrations?
L 941: I can see why one would choose synthetic observations in some initial DA experiments but I would not consider it a "necessary" first step. In fact, as I have mentioned in my last comments, synthetic observations may be too idealised to provide guidance for DA experiments with real observations. I would suggest removing "necessary" here. |
The manuscript presents machine learning-based techniques aimed to improve biogeochemical data assimilation, by speeding it up or increasing its performance through increments to unobserved variables. The results of this study are compelling but the data assimilation setup is highly idealized, limiting its significance.
General comments
Machine learning (ML) techniques will likely gain more and more traction in marine biogeochemical data assimilation applications. The authors present an interesting study with compelling results, but I worry that the choice of using synthetic data limits strongly how much we can learn about how to implement ML-assisted DA in less idealized situations. The allure of synthetic data is apparent: there is a true model state, we know all about this truth and can thus examine DA improvement even regarding unobserved variables. However, in this study, the synthetic observations were generated from a nature run that uses the same biogeochemical and physical model as the DA systems. Thus, the setup is highly idealized, and the ML techniques can learn correlations between variables that directly relate to the dynamics that generated the observations. Consequently, the ML-based DA strongly benefits from accurately improving (some) unobserved variables. But is this still an advantage if the true unobserved variables do not follow model dynamics? I would suggest a follow-up experiment in which the DA techniques are confronted with "real" data. Various data are collected for the L4 station, satellite data could be used. Do the improvements in chlorophyll forecast skill seen for the synthetic data translate to improvements for real data compared to the more standard DA approaches?
The authors single out zooplankton as unobserved variables that "do not update well" (Sec 5, par 3). However, it could be that this issue with zooplankton updates is due to the way the biogeochemical model is parameterized. Depending on the way zooplankton grazing, phytoplankton mortality and other processes are parameterized in the model, phytoplankton/PFTs and zooplankton can be more strongly or weakly linked. This effect could mean that the result that zooplankton, specifically is difficult to estimate, does not generalize to other model configurations. Yet, I agree with the authors that there can always be some variables that are difficult to estimate, perhaps even more so in a DA setup without synthetic observations.
When I first read the title, I was expecting an approach to better inform 3D biogeochemical models with observations. The abstract then mentions that 1D models are used -- which seems like a good choice given the complexity of 3D models. However, when I read that the DA was basically performed in 0D, I felt that an opportunity was missed to examine if ML approaches can yield improved spatial increments. Variational DA requires the specification of length scales, ensemble-based approaches typically rely on localization to limit the DA update spatially, but ML approaches could potentially learn how to best spread the increment spatially. But by eliminating all spatial dimensions from the DA and simply applying any DA update throughout and only in the mixed layer, this important DA aspect is not examined in this study at all. Why are only 0D increments used in this study, how much additional effort would be required to create a full 1D update?
Depending on the setup (and more obviously in a DA configuration with real observations), the surface-only DA approach could lead to worse performance of the DA systems. The study does not provide many specifics about the ensemble generation or the nature run, but if the model state below the mixed layer is sufficiently different in the data-generating nature run and the simulation used in the DA, then there is an error source that the DA system (with or without ML) cannot adequately correct. An example: Higher nitrate (or higher DOM, perhaps simply more C or N) below the mixed layer in the DA compared to the nature run could result in continuous overestimation and subsequent downward corrections of nitrate, chlorophyll and zooplankton in the mixed layer without addressing the underlying issue of too much nutrient input. This type of scenario could also manifest itself in zooplankton drifting away from the truth despite a series of beneficial corrective DA updates.
The authors trained their ML models in one location and show that this approach yields comparatively bad results when transferring the models to a new location. Here it would have been interesting to train the models on data from 2 or more (sufficiently different) locations to examine if these more generalized models would (1) perform similarly well on one of their training locations as the specialized models and (2) if the generalized training would make the models more transferable to new locations. This is likely beyond the scope of this study, but perhaps the authors could discuss this point.
As mentioned in the manuscript, the DA update in Eq. 1 represents the best linear unbiased estimator (BLUE) and the authors kept the ML-based DA approaches in the linear framework by estimating elements of Eq. 1. But one could imagine going beyond that and estimate nonlinear relationships between model state, observations, and increment, for example using neural networks. In how far do the authors think the DA framework presented here could be improved further using other ML techniques?
Specific comments (based on the preprint document, which does not have numbered lines)
P 1, par 1: I would suggest changing "and lead" to "which can lead to".
P 2, par 1: "size-class chlorophyll [...] and other types of in situ data": This makes it sound like size-class chlorophyll is an in situ data product, when it typically is based on satellite estimates.
P 2, par 2: "The multivariate updates can happen in the DA step, through ensemble-informed background covariances (as in the EnKF), or, through balancing schemes, such as the scheme of Hemmings et al. (2008) based on nitrogen mass conservation ...": In variational DA, multivariate updates also happen through the tangent-linear and adjoint models. However, this type of update has issues as well and can lead to unrealistic updates because there are typically no prescribed covariance terms between different variables. For example, in Mattern et al. (2017; DOI: 10.1016/j.ocemod.2016.12.002), the authors had to reduce the updates to unobserved nitrate in order to avoid unrealistic nitrate accumulation at the ocean surface.
P 3, Sec 2.1: "It provides a sufficient balance between realism and computational cost": How "sufficient" a 1D model is, may be very dependent on the application. I would suggest motivating it a bit better based on the goals of this study.
Eq. 1: There is no error here, but it looks like the Δx was meant to be included in Eq 1.
P 5, Sec 2.4: "x^f is the background state": This notation is used in many studies but maybe for some who do not know it, either motivate the superscript "f" by mentioning "forecast" or change it to "b" to match "background".
P 6, par 1: "In this work, the state of the system, both x^f and x^a, comprises the surface values of most pelagic variables in ERSEM.": Based on the abstract and previous text, I was expecting a 1D DA system, and not 0D. Reading a bit further, I see that this is not simply a 0D update, but that the full vector is updated but only within the mixed layer
Eq. 4: Why is the summing of the 4 chlorophyll variables to total chlorophyll not included in H? That is, why doesn't H have the form [1, 1, 1, 1, 0, 0, ...] or similar, where the four non-zero entries are associated with the chlorophyll variables?
Eq. 7: Mention what "R" is here.
Table 2, line 3: I would consider a phrase like "ensemble-based" more useful than the "itself".
Eq. 9: This may be a Latex issue, but the "COR" looks very much like "CO*R" with the R from the previous equation. Why not use a lower-case rho, which is often used to denote correlations?
Eq. 9: Are these properties obtained from the long simulation or prescribed some other way? Reading on, I see that COR_i,c is being estimated using the ML techniques. This could be made explicit here.
Sec 3.2, par 1: "to predict the state-dependent correlations between observed and unobserved quantities in Eq. (9)." I would suggest adding the symbol (currently COR_i,c) so the reader knows immediately what is being estimated. Furthermore, I would suggest using "estimate" rather than "predict", as this is not done in a forecast sense.
Sec 3.2, par 1: "is scaled according its climatological maximum": Does this mean that COR_i,c is estimated with data from all locations, and it is then rescaled for each location? A better description and some more information would be useful here. Apparently COR_i,c is location-specific, is some time-dependence assumed, if so, what kind? How many parameters need to be estimated here? Stating these basic facts early on helps the reader understand the main idea before going into implementation details.
Sec 3.2, par 1: What motivates the use of the OI name here?
P 9, par 2 "In ML-EtE, we assume that the essential properties of each statistical object that creates an analysis increment, such as covariances and observation uncertainty, can be more effectively captured by directly predicting the analysis increment rather than predicting every component individually and allowing errors to compound across multiple independent predictions that are then combined into a single value.": This sentence is too long and difficult to understand. Also, "each statistical object that creates an analysis increment, such as covariances and observation uncertainty" sounds like the covariances create an analysis increment and also the observation uncertainty creates an analysis increment. Please rephrase. But furthermore, if I understand this sentence correctly, it seems to argue for directly estimating the Kalman gain matrix rather than its components?
Sec 3.4, par 1: "(1) a set-up where we choose to update only nitrate": I presume chlorophyll is updated as well? Maybe rephrase to "a set-up where the only unobserved variable that is updated is nitrate".
Sec 3.6.1: What are "cycles" here, and is the trajectory more than just a snapshot? Based on Eq. 10, one could assume a cycle is a time step.
Sec 3.6.1: "The expected RMSE": Why not call it the "ensemble average RMSE"?
Fig. 2: I think it is counter-intuitive to have nitrate shown in green and chlorophyll in black. I would suggest switching the colors. Also counter-intuitive, but maybe to a lesser extent: the light-limited time-period is shown in the brightest color. Finally, why show an unspecified "arbitrary" year instead of the climatology in the top panel, when the correlation is based on climatological values?
P 12, par 1: Make it explicit that these RMSE values are for the correlation estimates.
P 12, par 3: Explain better what the terms offline and online are referring to here. I would suggest introducing the offline term at the very start of the section when the experiment is introduced.
P 12, par 3: "so that any update to the system can have dynamical impact on later DA cycles as the model": This sentence ends abruptly without a period.
Fig. 4: The "relative ratio" label is not helpful, "relative to observational error" would be better. The title says "Nitrates"; there is a syntax error in the unit label of the second panel -- which also appears in Fig. 5.
Fig. 4: Why is the performance of the EnKF seemingly much worse than the RUS scheme for observed chlorophyll? Mention this in the text.
P 13, par 1: "as a percentage of the observation error": Are these truly percentage values?
P 13, par 1: "error. exceeds": There is an issue with the sentence.
Fig. 5: I find it difficult to see improvement in the plots, but perhaps I haven't stared at them long enough. I would think it would be more informative to show "forecast-truth" and "analysis-truth": it would clearly show when improvement occurs ("analysis-truth" closer to zero) and analysis-forecast is simply the space between the curves (which could be shaded in the figure).
Fig. 5 and others: Why aren't EnKF results shown for comparison?
P 17, par 1: "so values shown in Fig. 7 are RMSEs for 7-day forecasts relative to the RMSE of the RUS method": But Fig. 7 only shows a one value for each RMSE value, are these averages across several 7-day forecasts? Please explain better what is shown.
P 19: "(Sect. ??)"