The authors have improved the manuscript and addressed my initial comments. The addition of coastal regions in their neural network (NN) model evaluation is very helpful, and it provides better support for the reported results of the NN estimates. The revised manuscript is also easier to follow, and only minor revisions are needed to improve the writing in places.
In parts of the manuscript, one of my earlier criticisms still applies; some sentences remain difficult to understand. I especially had trouble with the two NN prediction approaches that are not clearly distinguished from one another in Section 2.4. Having read through the full manuscript, I think I now understand the two approaches, but when I initially read Section 2.4, I was confused by the description. There are a few more instances where sentences are confusing or misleading for a first-time reader; please see my specific comments for details.
The revised manuscript contains many large figures; some of them could be moved to a supporting information document. For example, the open ocean and coastal region raw-value chlorophyll-a figures (11 and 13) are shown alongside anomaly figures (12 and 14), each taking up a page. Here the authors could decide to focus on showing either raw values or the anomalies, or not displaying all regions.
# specific comments (line numbers are based on the revised manuscript, not the tracked changes version)
L 4: "a neural architecture based on the U-Net that reconstructs surface, near-global chlorophyll-a based on observations and four physical predictors": This sentence can easily interpreted to say that observations are used as inputs to the neural network, when actually model (reanalysis or forecast) fields are used. The next sentence is a bit more helpful, but the reader may think that the listed inputs "mixed layer depth, sea surface height, salinity, and temperature" are observations. Please rephrase to avoid confusion.
Abstract: This is just an aside that the authors can ignore: the authors modified the title and changed "resource-efficient" to "lightweight". Yet in the abstract "lightweight" has been removed and "resource-efficient" is used twice.
L 71: "Because the target exhibits strong seasonality and our focus was not necessarily maximal architectural expressiveness, we did not adopt a single monolithic model covering the full range of variability.": What exactly does this mean? The following sentences make the reader understand the methodology better but this sentences is too general and not helpful. Please rephrase.
L 75: "... each network was trained on six-month time series starting from its initialization month (m_1-m_6)": The use of "m_1-m_6" is confusing here: the symbols are not used elsewhere in the section and the reader is lead to believe forecasts start in months 1 through 6 when I presume the m_1 through m_6 are meant to refer to the "six-month time series" and not the "initialization month". Please rephrase and maybe just do not use the symbols yet.
L 75: "For example, the network initialized in January was trained on January-June data from 1998-2016, while the December network was trained on December 1998-May 2017, and so on.": The date range "December 1998-May 2017" may make most readers believe all months in that time period were used when (I assume) this is not the case. I suggest using the same formulation from earlier in the sentence: "December-May data (December 1998 to May 2017)". For some readers it may further be helpful to mention the word "climatology" in this section.
L 110: "As mentioned above, the network operates on log-transformed chl-a values, where
p_i = log(ŷ_i)...": because it was mentioned above that the NN produces log-transformed chl-a values, I would have expected ŷ to be in log-space already.
L 147: "we used 6-month forecasts from SEAS5": The previous sentence stated that the neural network is "using the reanalysis (GLORYS12) as input". Are both SEAS5 and GLORYS12 being used here, and how? This is a bit confusing. Currently it reads like the first sentences of Section 2.4 are describing one single approach, please be more specific and mention the two approaches explicitly.
L 160: "monthly resolution": Here, everything is expressed in months, for both the temporal resolution and the lead time. What happened to the 5-day predictions mentioned in the first sentence of the section? Are these 5-day predictions? If so, please add that information somewhere, if not, modify the first sentence. -- This comment is related to the last one, please describe the forecast generation approaches better.
L 199: "Physics=G12": Is G12 the same as GLORYS12, GLO12 used in line 179, or something else? Without additional information this statement is not very helpful. If it is just a reference to Fig. 5, I don't think it needs to be included here.
Fig. 5 and 7: Why not include the BIO4 results here?
Fig. 8: The color scale is broken: white color does not indicate 0. When recreating the figure, I would suggest using the same color scale as in Fig. 6 for easy comparison.
Fig. 11 and following: I would suggest using the same color for the same NN output, i.e., "Init. JAN" should have the same color in different figures.
L 308: "and so this framework would fall short whenever three-dimensional biogeochemical consistency is required.": There are many cases where surface chlorophyll-a is not representative of phytoplankton biomass, for example in the presence of a deep chlorophyll maximum. So, I agree that the framework as presented falls short in some cases, but the expression "three-dimensional biogeochemical consistency" is awkward and not helpful to the reader.
L 309: "lead time 1" I think it would be useful to mention the units here and say that this is a 1-month lead time. I would prefer if the months were added throughout the manuscript, but they are important to mention here, as some readers may just skip to the Discussion.
L 329: "... can be penalizing of modest timing and spatial shifts in bloom features.": What is meant by "modest timing" here? I suggest changing to something like "can penalize even modest changes in timing or small spatial shifts of chl-a blooms."
L 330: "the monthly evaluation record over is relatively short": something is missing here. |
The authors use a neural network to estimate surface chlorophyll-a, a computationally efficient approach that appears to outperform traditional approaches like mechanistic biogeochemical ocean models. The manuscript presents some compelling results, but the experimental setup is not described well enough, and it is unclear why the comparison of chl-a estimates does not include any coastal regions.
general comments:
The manuscript is mostly well written and was easy to follow -- with a major exception: the basic setup of the experiments and implementation details are not well described and after reading through the whole manuscript I still do not quite know what, for example, "6-month predictions" are in the manuscript. Does the "6-month" imply a 6-month lead time, a 6-month forecast length, a 6-month time average or something else? Is there a distinction between "prediction" and "forecast" in the manuscript, if so, what is it? Sentences that are meant to explain experiments sometime increase the confusion of the reader, for instance: "These months correspond to lead-times one out of the six months of each forecast." (l. 156). Sentences like this example are confusing to the reader and could be improved considerably by rephrasing and adding some details. Please take the time and space to clarify how the experiments are set up and what is compared at what resolution (this includes space and time).
Even a reader who does not know much about marine chl-a might find it surprising that the regions where performance evaluated, shown in Fig. 3, do not include any "yellow" values and seem to focus only on open-ocean regions (as an aside, a color bar or at least a description of what property is shown in Fig. 3 would be useful). That is, why weren't any coastal regions with high chl-a concentrations included in the comparison? The authors mention "fisheries management" and "harmful algal blooms" but then neglect to evaluate the model in the biologically active regions where most blooms occur and fishery is prevalent. In general, the chl-a estimates were compared mostly as a global average (Fig. 4, 5) or as averages in the large open-ocean regions (Fig. 7, 9); only Fig. 6 shows the performance on a finer spatial scale. Even in the computation of the RMSE, a spatial average appears to be used: "The spatially-averaged reconstructed time series has a RMSE of 0.01 ..." (l. 151). Why is the RMSE based on a spatial average? The use of spatial averaging is not explained well or mentioned when the RMSE is introduced. Please ensure that the reader knows at all times how key metrics are being computed. In addition, I would suggest using nearshore regions in the comparison and evaluating the model performance at a higher resolution, both in space and time.
Furthermore, the authors later ponder how the decrease in ACC observed in Fig. 6 aligns with little to no increase in RMSE and other metrics in Fig. 9. They explain that "it is likely that the neural network’s ability to capture the strong seasonal dynamics in the data (Figs. 7 and 8) is compensating for the decrease in performance with respect to the anomalies" (l. 168). That could well be, but if the RMSE is based on some spatially averaged chl-a, the averaging could have removed most of the effect of the anomalies. Unfortunately, a reader can only guess here, as it is unclear how the RMSE was computed.
Due to their distribution, when plotting and comparing chl-a values, they are often log-transformed. The authors mention once that a log-transformation was used, but it is unclear where and to what extent: "The physical ocean data was normalized using min-max normalization and the chl-a data was log-transformed" (l. 82) is the only information the reader gets. Was a log-transformation used when computing the ACC, NRMSE etc., are r_i and p_i in Eq 1-4 log-transformed? How were the climatologies computed? More importantly, perhaps, was a log-transformation used in the loss function for the neural network? The authors mention that they needed to modify the loss function: "so we modified the standard mean squared error (MSE) loss function by adding a small penalty for underestimation." (l. 79). With a log-transformation applied to chl-a, one would expect underestimation to be quite heavily penalized by the MSE. More information is needed to better interpret the results and the setup of the neural network.
specific comments:
L 1: "Marine chlorophyll-a is an important indicator of ecosystem health, and accurate forecasting, even at the surface level, can have significant implications for climate studies and resource management a lightweight, resource-efficient neural architecture based on the U-Net that reconstructs surface, near-global chlorophyll-a from four physical predictors.": Accurately forecasting/estimating surface chl-a is a good check for "traditional" mechanistic models to verify that they can recreate some key biogeochemical dynamics. How would the output of a neural network model that only estimates surface chl-a be able to inform climate studies and resource management? Maybe this is a point that could be discussed further in Section 4.
L 59: "The goal of this work is to demonstrate that we can not only estimate chl-a from these four variables, but that by using publicly available forecasts of these as input, we are able to generate an ensemble of skillful chl-a predictions for six months into the future.": Here it would be useful for the reader to be more specific: are the 6-month predictions reliant on a 6-month forecast or are they produced from input 6 months into the past?
L 74: "Skip connections link matching layers in the encoder and decoder, facilitating the transfer of information.": Does this mean the first Conv3D layer is linked to the last one, etc.?
Eq 1: It would be good to explain the terms in the equation a bit better (is the data log-transformed?) and move the equation up to where MSE and the terms are introduced.
L 85: What motivated the choice of the 12 "monthly" neural networks? How much worse is the use of a single one for all months?
L 90: "The optimal architecture found for this task has approximately six million trainable parameters...": Is this for one or all 12 of the networks?
L 97: "...provides daily and monthly data...": Here, or somewhere early on, mention if the networks produce daily or monthly mean estimates.
L 122: "lead-time two": Does this mean a 2-month lead time?
Eq 2-4: How do these metrics compare to the cost function used for training the network, why not report/show that value as well? And mention if any of these chl-a values are log-transformed in these metrics.
L 146: "Rather than a direct comparison, we use BIO4 as a benchmark, recognizing that it simulates a wide range of interconnected biogeochemical processes across various depths, whereas our data-driven approach is specifically designed for surface chl-a prediction.": This sentence is a bit confusing. It makes sense to compare the neural network approach to a more classic reference approach for estimating surface chl-a. But why is this dependent on BIO4 also estimating a wide range of other properties? Maybe I just do not understand what "direct comparison" refers to in this context.
L 150: The first sentence of Sec 3 is almost identical to that of Sec 2.2. Unfortunately, it is still not clear to me what a "set of 5-day predictions" means.
L 151: "The spatially-averaged reconstructed time series...": What kind of spatial averaging is performed here, before computing the RMSE etc.?
L 168 and following figures: Are the BIO4 estimates that are shown forecasts as well? For what lead time?