Forecasting seasonal global sea surface chlorophyll <i>a</i> with a lightweight data-driven approach

Martinez Balbontin, Gabriela; Jouanno, Julien; Benshila, Rachid; Lamouroux, Julien; Perruche, Coralie; Ciavatta, Stefano

doi:10.5194/bg-23-2601-2026

Articles | Volume 23, issue 7

https://doi.org/10.5194/bg-23-2601-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/bg-23-2601-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 23, issue 7

Research article

|

17 Apr 2026

Research article |

| 17 Apr 2026

Forecasting seasonal global sea surface chlorophyll a with a lightweight data-driven approach

Gabriela Martinez Balbontin, Julien Jouanno, Rachid Benshila, Julien Lamouroux, Coralie Perruche, and Stefano Ciavatta

Download

Final revised paper (published on 17 Apr 2026)
Preprint (discussion started on 09 Apr 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-1246', Anonymous Referee #1, 28 May 2025

The authors use a neural network to estimate surface chlorophyll-a, a computationally efficient approach that appears to outperform traditional approaches like mechanistic biogeochemical ocean models. The manuscript presents some compelling results, but the experimental setup is not described well enough, and it is unclear why the comparison of chl-a estimates does not include any coastal regions.
general comments:
The manuscript is mostly well written and was easy to follow -- with a major exception: the basic setup of the experiments and implementation details are not well described and after reading through the whole manuscript I still do not quite know what, for example, "6-month predictions" are in the manuscript. Does the "6-month" imply a 6-month lead time, a 6-month forecast length, a 6-month time average or something else? Is there a distinction between "prediction" and "forecast" in the manuscript, if so, what is it? Sentences that are meant to explain experiments sometime increase the confusion of the reader, for instance: "These months correspond to lead-times one out of the six months of each forecast." (l. 156). Sentences like this example are confusing to the reader and could be improved considerably by rephrasing and adding some details. Please take the time and space to clarify how the experiments are set up and what is compared at what resolution (this includes space and time).
Even a reader who does not know much about marine chl-a might find it surprising that the regions where performance evaluated, shown in Fig. 3, do not include any "yellow" values and seem to focus only on open-ocean regions (as an aside, a color bar or at least a description of what property is shown in Fig. 3 would be useful). That is, why weren't any coastal regions with high chl-a concentrations included in the comparison? The authors mention "fisheries management" and "harmful algal blooms" but then neglect to evaluate the model in the biologically active regions where most blooms occur and fishery is prevalent. In general, the chl-a estimates were compared mostly as a global average (Fig. 4, 5) or as averages in the large open-ocean regions (Fig. 7, 9); only Fig. 6 shows the performance on a finer spatial scale. Even in the computation of the RMSE, a spatial average appears to be used: "The spatially-averaged reconstructed time series has a RMSE of 0.01 ..." (l. 151). Why is the RMSE based on a spatial average? The use of spatial averaging is not explained well or mentioned when the RMSE is introduced. Please ensure that the reader knows at all times how key metrics are being computed. In addition, I would suggest using nearshore regions in the comparison and evaluating the model performance at a higher resolution, both in space and time.
Furthermore, the authors later ponder how the decrease in ACC observed in Fig. 6 aligns with little to no increase in RMSE and other metrics in Fig. 9. They explain that "it is likely that the neural network’s ability to capture the strong seasonal dynamics in the data (Figs. 7 and 8) is compensating for the decrease in performance with respect to the anomalies" (l. 168). That could well be, but if the RMSE is based on some spatially averaged chl-a, the averaging could have removed most of the effect of the anomalies. Unfortunately, a reader can only guess here, as it is unclear how the RMSE was computed.
Due to their distribution, when plotting and comparing chl-a values, they are often log-transformed. The authors mention once that a log-transformation was used, but it is unclear where and to what extent: "The physical ocean data was normalized using min-max normalization and the chl-a data was log-transformed" (l. 82) is the only information the reader gets. Was a log-transformation used when computing the ACC, NRMSE etc., are r_i and p_i in Eq 1-4 log-transformed? How were the climatologies computed? More importantly, perhaps, was a log-transformation used in the loss function for the neural network? The authors mention that they needed to modify the loss function: "so we modified the standard mean squared error (MSE) loss function by adding a small penalty for underestimation." (l. 79). With a log-transformation applied to chl-a, one would expect underestimation to be quite heavily penalized by the MSE. More information is needed to better interpret the results and the setup of the neural network.

specific comments:
L 1: "Marine chlorophyll-a is an important indicator of ecosystem health, and accurate forecasting, even at the surface level, can have significant implications for climate studies and resource management a lightweight, resource-efficient neural architecture based on the U-Net that reconstructs surface, near-global chlorophyll-a from four physical predictors.": Accurately forecasting/estimating surface chl-a is a good check for "traditional" mechanistic models to verify that they can recreate some key biogeochemical dynamics. How would the output of a neural network model that only estimates surface chl-a be able to inform climate studies and resource management? Maybe this is a point that could be discussed further in Section 4.
L 59: "The goal of this work is to demonstrate that we can not only estimate chl-a from these four variables, but that by using publicly available forecasts of these as input, we are able to generate an ensemble of skillful chl-a predictions for six months into the future.": Here it would be useful for the reader to be more specific: are the 6-month predictions reliant on a 6-month forecast or are they produced from input 6 months into the past?
L 74: "Skip connections link matching layers in the encoder and decoder, facilitating the transfer of information.": Does this mean the first Conv3D layer is linked to the last one, etc.?
Eq 1: It would be good to explain the terms in the equation a bit better (is the data log-transformed?) and move the equation up to where MSE and the terms are introduced.
L 85: What motivated the choice of the 12 "monthly" neural networks? How much worse is the use of a single one for all months?
L 90: "The optimal architecture found for this task has approximately six million trainable parameters...": Is this for one or all 12 of the networks?
L 97: "...provides daily and monthly data...": Here, or somewhere early on, mention if the networks produce daily or monthly mean estimates.
L 122: "lead-time two": Does this mean a 2-month lead time?
Eq 2-4: How do these metrics compare to the cost function used for training the network, why not report/show that value as well? And mention if any of these chl-a values are log-transformed in these metrics.
L 146: "Rather than a direct comparison, we use BIO4 as a benchmark, recognizing that it simulates a wide range of interconnected biogeochemical processes across various depths, whereas our data-driven approach is specifically designed for surface chl-a prediction.": This sentence is a bit confusing. It makes sense to compare the neural network approach to a more classic reference approach for estimating surface chl-a. But why is this dependent on BIO4 also estimating a wide range of other properties? Maybe I just do not understand what "direct comparison" refers to in this context.
L 150: The first sentence of Sec 3 is almost identical to that of Sec 2.2. Unfortunately, it is still not clear to me what a "set of 5-day predictions" means.
L 151: "The spatially-averaged reconstructed time series...": What kind of spatial averaging is performed here, before computing the RMSE etc.?
L 168 and following figures: Are the BIO4 estimates that are shown forecasts as well? For what lead time?

Citation: https://doi.org/10.5194/egusphere-2025-1246-RC1
- AC2: 'Reply on RC1', Gabriela Martinez Balbontin, 23 Sep 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1246/egusphere-2025-1246-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-1246-AC2
RC2:
'Comment on egusphere-2025-1246', Anonymous Referee #2, 03 Sep 2025

Review of manuscript "Estimating Seasonal Global Sea Surface Chlorophyll-a with Resource-Efficient Neural Networks" by Martinez-Balbontin et al., submitted to Biogeosciences

This article presents an interesting data-driven approach to predict chlorophyll-a (chl-a) fields on a near-global scale based on seasonal forecasts of some oceanic physical properties, which offers an alternative tool to mechanistic biogeochemical models. Previous efforts in data-driven global chl-a reconstruction from oceanic physical properties have primarily targeted long-term retrospective analyses. In contrast, this study explores shorter-term seasonal forecasts of chl-a and argues that the results compare favorably with biogeochemical models. The topic is timely, the scientific question is original and clearly formulated, and the overall structure of the manuscript is easy to follow. However, I think several points would require further elaboration and justification to make the manuscript more rigorous and convincing.

General comments

1) Choice of predictors. The rationale for selecting only four predictors deserves further justification. While they are relevant, some may contain redundancy and some processes are not explicitly considered. It would help to discuss assumptions about neglected drivers and how this may influence the spatial and temporal variability of results. For instance, light availability is a key driver, particularly at high latitudes and in some tropical regions (see for instance Fig3A-B of Racault et al., 2017). Although SST may correlate with PAR seasonally, this relationship does not hold consistently at inter-annual timescales. Why was this predictor not included, for example? Is the potential improvement in performance considered negligible compared to the gain in terms of computing time for model training?

2) Temporal resolution. The use of 5-day data for training, while the final application relies on monthly inputs, is not fully explained. Clarifying the advantages and limitations of this choice would help readers understand whether it may contribute to the underestimation of inter-annual variability.

3) Reproducibility and robustness. Certain technical details are not sufficiently specified to ensure the reproducibility of the experiments (e.g., learning rate, number of training epochs, criteria applied to stop training, etc) and to convince readers of the robustness of the approach. While possible overfitting is mentioned to explain the network's weaker ability to predict inter-annual variability than seasonal one, giving more details regarding any regularization techniques used to monitor and limit this would strengthen confidence in the robustness of the approach.

4) Seasonal vs inter-annual variability. The manuscript could be strengthened by further analyses and discussion on how the data-driven approach performs relative to mechanistic models in representing inter-annual variability. This variability is more difficult to reproduce but highly relevant for societal applications (e.g., ENSO impacts), whereas reproducing the climatological seasonal cycle is comparatively less challenging.

5) Machine learning positioning. Clarifying the distinction between the efficiency of the overall data-driven framework versus the neural network architecture itself would avoid confusion. Six million parameters can be seen relatively large compared with some published CNNs, and the choice of U-Net over simpler alternatives could be justified more explicitly in terms of added value.

Specific comments

Title : I would have rephrased the title into something like : « Forecasting Seasonal Global Sea Surface Chlorophyll-a with a lightweight data-driven approach » to emphasize the forecasting dimension and the efficiency of the overall method (vs. the efficiency of the architecture itself).

L4 : I would recommend a sentence like : « We propose a data-driven resource-efficient alternative : a neural architecture based on the U-Net that reconstructs surface, [..] from four physical predictors »

L15 : I think « long-term » is not appropriate here, I would recommend « seasonal »

L69-70 : « While advances in computational power and data availability have enabled the development of more complex architectures, the simplicity and efficiency of the U-Net makes it an effective and resource-efficient choice for this task ». Although the U-Net architecture may appear simpler than more recent transformer-based models, it is still more complex than basic CNNs with fewer parameters that have been used in some previous studies. While I am convinced that a Unet is well suited for this application, I would have justified this choice differently, for example by emphasizing its ability to better capture different spatial scales.

L77 : « The hyperparameters [...] were optimized using random search ». Please provide a detailed description of all hyperparameters used and clarify on which dataset they were tuned: was optimization based on the training period (1998–2017) or the validation period (2017–2023)? If the latter, how did the authors account for potential overfitting? Why was an entirely independent time period not used to assess the model’s generalization performance more objectively? Could the authors also provide the training and validation loss curves? Finally, did the authors check that the model's learning was stable from one run to the next?

L85 : « To simplify the predictive task, which consists of using a 6-month forecast of the physics to predict chl-a for the same time period, we trained twelve dedicated neural networks, each corresponding to the starting month of the forecast ». The choice seems to increase methodological complexity while reducing the training data available for each model, potentially favoring overfitting. I would have clarified how this risk was assessed and justified this strategy more explicitly, including its potential advantages and drawbacks. Have the authors compared reconstruction performance over the six-month period when using a single model trained across the whole dataset versus the proposed approach based on twelve separate models?

L90-92 : « For context, […], has between 20 and 30 million ». I would have removed that sentence. In Ansari et al. (2022), some of the Unets mentioned have almost six times fewer parameters. If a comparison in terms of parameters is to be made, I think it would be more relevant to compare them with other deep learning models published on the topic of Chl reconstruction.

L104 : « Since remotely sensed chl-a is limited by sunlight availability, we focus on the -60° to 80° latitudes ». Spatial coverage is not symmetric in latitude, suggesting factors beyond sunlight, e.g., cloud-dependent pixel availability (higher in the Southern Ocean). I would have justified the footprint selection more objectively, for example using a minimum pixel density.

L117-L119 : « The SEAS5 forecast [..]. To avoid these from biasing […], and then subtracting this difference from SEAS5 before using it as input ». I am not sure I understand the rationale of this approach. When applying the model to future forecasts, the corresponding GLORYS12 reanalyses will not be available—how will this be addressed? Is only a fixed annual climatology subtracted?

L165 : « the ACC decreases with lead time due to the increased uncertainty in the forecasted physics». For Figure 6b (6-month lead-time ACC), have the authors checked whether the poorly predicted (blue) areas correspond to regions where the seasonal forecasts of the four physical predictors are less accurate, based on literature or error maps?

In the discussion (L235-237), the use of biogeochemical variables as predictors is mentioned. Can’t these forecasts carry larger errors than the physical fields, and couldn’t their inclusion risk degrading six-month forecast quality?

FIG 7 : Have the authors tried plotting seasonal chl‑a forecasts initialized for several months using SEAS5, in the same way as Figure 5, across the different regions? Providing these outputs in the Supplementary Material could help assess spatio-temporal heterogeneity in model performance over multiple lead-time months.

L169-170 : « These figures demonstrate that the neural network is able to capture the seasonal dynamics of the data across regions, regardless of the input physics. » This statement could be complemented by a discussion on interannual variations. ACC metrics for the different regions could support this discussion.

FIG8 & 9: Figure 8 is currently under-described; a more detailed analysis is recommended to strengthen the argument. For Figure 9, adding ACC curves would provide a metric specific to interannual variations and show their evolution over time.

L190 : « a simple neural network » : I would remove the term « simple ». Similarly, L217, I would either remove « shallow », or insist on the fact that this is a lighter approach than a classical mechanistic model. L246 : I would recommend removing « lightweight »

L249 : « These predictions maintain low error and high correlation to observations across different regions and lead-times ». I would qualify that statement, as this may not be the case with regard to interannual variability.

Technical corrections

L26: even ‘when’ limited instead of « even if limited » ?

L44 : « variables » or « predictors » instead of « parameters »

L60 : « available forecast of these as input » may be replaced by « available forecasts of those former variables as input» for clarity ?

L83 : « the physical ocean data was » → the physical ocean data were

L91 : « Ronneberger et al. »→ please precise the date of the reference.

L94-97 : Please precise the initial spatio-temporal resolution of the different used dataset, as well as the final resolution used in this study. If pre-processing was made (resampling, etc), please precise it.

L110 : « it uses the IFS cycle 43r1 for the atmosphere, NEM0 for the ocean, and LIM2 for sea ice ». Please precise the acronym and what it corresponds to for non numerical modeling readers.

Fg2 : Perhaps I have misunderstood, but why not have six reconstructed time series, with lead times 1-6 (see lead time of 6 shown in Figure 6) instead of five in this diagram ?

Fig 3 : it would be great to have the lat/lon axis plotted on the maps.

FIG 7 : I would recommend adding on the Figure some quantitative correlation metrics between the different time series to better compare performance between areas.

References

Racault, M. F., Sathyendranath, S., Brewin, R. J., Raitsos, D. E., Jackson, T., & Platt, T. (2017). Impact of El Niño variability on oceanic phytoplankton. Frontiers in Marine Science, 4,133

Citation: https://doi.org/10.5194/egusphere-2025-1246-RC2
- AC1: 'Reply on RC2', Gabriela Martinez Balbontin, 23 Sep 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1246/egusphere-2025-1246-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-1246-AC1

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (23 Sep 2025) by Tina Treude

AR by Gabriela Martinez Balbontin on behalf of the Authors (30 Jan 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (07 Feb 2026) by Tina Treude

RR by Anonymous Referee #1 (09 Mar 2026)

Suggestions for revision or reasons for rejection

The authors have improved the manuscript and addressed my initial comments. The addition of coastal regions in their neural network (NN) model evaluation is very helpful, and it provides better support for the reported results of the NN estimates. The revised manuscript is also easier to follow, and only minor revisions are needed to improve the writing in places.

In parts of the manuscript, one of my earlier criticisms still applies; some sentences remain difficult to understand. I especially had trouble with the two NN prediction approaches that are not clearly distinguished from one another in Section 2.4. Having read through the full manuscript, I think I now understand the two approaches, but when I initially read Section 2.4, I was confused by the description. There are a few more instances where sentences are confusing or misleading for a first-time reader; please see my specific comments for details.

The revised manuscript contains many large figures; some of them could be moved to a supporting information document. For example, the open ocean and coastal region raw-value chlorophyll-a figures (11 and 13) are shown alongside anomaly figures (12 and 14), each taking up a page. Here the authors could decide to focus on showing either raw values or the anomalies, or not displaying all regions.

# specific comments (line numbers are based on the revised manuscript, not the tracked changes version)

L 4: "a neural architecture based on the U-Net that reconstructs surface, near-global chlorophyll-a based on observations and four physical predictors": This sentence can easily interpreted to say that observations are used as inputs to the neural network, when actually model (reanalysis or forecast) fields are used. The next sentence is a bit more helpful, but the reader may think that the listed inputs "mixed layer depth, sea surface height, salinity, and temperature" are observations. Please rephrase to avoid confusion.

Abstract: This is just an aside that the authors can ignore: the authors modified the title and changed "resource-efficient" to "lightweight". Yet in the abstract "lightweight" has been removed and "resource-efficient" is used twice.

L 71: "Because the target exhibits strong seasonality and our focus was not necessarily maximal architectural expressiveness, we did not adopt a single monolithic model covering the full range of variability.": What exactly does this mean? The following sentences make the reader understand the methodology better but this sentences is too general and not helpful. Please rephrase.

L 75: "... each network was trained on six-month time series starting from its initialization month (m_1-m_6)": The use of "m_1-m_6" is confusing here: the symbols are not used elsewhere in the section and the reader is lead to believe forecasts start in months 1 through 6 when I presume the m_1 through m_6 are meant to refer to the "six-month time series" and not the "initialization month". Please rephrase and maybe just do not use the symbols yet.

L 75: "For example, the network initialized in January was trained on January-June data from 1998-2016, while the December network was trained on December 1998-May 2017, and so on.": The date range "December 1998-May 2017" may make most readers believe all months in that time period were used when (I assume) this is not the case. I suggest using the same formulation from earlier in the sentence: "December-May data (December 1998 to May 2017)". For some readers it may further be helpful to mention the word "climatology" in this section.

L 110: "As mentioned above, the network operates on log-transformed chl-a values, where
p_i = log(ŷ_i)...": because it was mentioned above that the NN produces log-transformed chl-a values, I would have expected ŷ to be in log-space already.

L 147: "we used 6-month forecasts from SEAS5": The previous sentence stated that the neural network is "using the reanalysis (GLORYS12) as input". Are both SEAS5 and GLORYS12 being used here, and how? This is a bit confusing. Currently it reads like the first sentences of Section 2.4 are describing one single approach, please be more specific and mention the two approaches explicitly.

L 160: "monthly resolution": Here, everything is expressed in months, for both the temporal resolution and the lead time. What happened to the 5-day predictions mentioned in the first sentence of the section? Are these 5-day predictions? If so, please add that information somewhere, if not, modify the first sentence. -- This comment is related to the last one, please describe the forecast generation approaches better.

L 199: "Physics=G12": Is G12 the same as GLORYS12, GLO12 used in line 179, or something else? Without additional information this statement is not very helpful. If it is just a reference to Fig. 5, I don't think it needs to be included here.

Fig. 5 and 7: Why not include the BIO4 results here?

Fig. 8: The color scale is broken: white color does not indicate 0. When recreating the figure, I would suggest using the same color scale as in Fig. 6 for easy comparison.

Fig. 11 and following: I would suggest using the same color for the same NN output, i.e., "Init. JAN" should have the same color in different figures.

L 308: "and so this framework would fall short whenever three-dimensional biogeochemical consistency is required.": There are many cases where surface chlorophyll-a is not representative of phytoplankton biomass, for example in the presence of a deep chlorophyll maximum. So, I agree that the framework as presented falls short in some cases, but the expression "three-dimensional biogeochemical consistency" is awkward and not helpful to the reader.

L 309: "lead time 1" I think it would be useful to mention the units here and say that this is a 1-month lead time. I would prefer if the months were added throughout the manuscript, but they are important to mention here, as some readers may just skip to the Discussion.

L 329: "... can be penalizing of modest timing and spatial shifts in bloom features.": What is meant by "modest timing" here? I suggest changing to something like "can penalize even modest changes in timing or small spatial shifts of chl-a blooms."

L 330: "the monthly evaluation record over is relatively short": something is missing here.

Hide

ED: Publish subject to minor revisions (review by editor) (09 Mar 2026) by Tina Treude

AR by Gabriela Martinez Balbontin on behalf of the Authors (17 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (31 Mar 2026) by Tina Treude

AR by Gabriela Martinez Balbontin on behalf of the Authors (08 Apr 2026) Manuscript

Short summary

This study uses machine learning to predict global sea surface chlorophyll a, which is important for monitoring marine ecosystems and the carbon cycle. Using forecasts of sea surface temperature, salinity, height, and mixed layer depth, we generate global predictions up to six months ahead in just minutes. Our approach matches state-of-the-art numerical methods while being faster and more resource-efficient.