An observation-based evaluation and ranking of historical Earth
System Model simulations for regional downscaling in the northwest
North Atlantic Ocean

Abstract. Continental shelf regions in the ocean play an important role in the global cycling of carbon and nutrients but their responses to global change are understudied. Global Earth System Models (ESM), as essential tools for building understanding of ocean biogeochemistry, are used extensively and routinely for projections of future climate states; however, their relatively coarse spatial resolution is likely not appropriate for accurately representing the complex patterns of circulation and elemental fluxes on the shelves along ocean margins. Here, we compared 29 ESMs used in the IPCC’s Assessment Rounds (AR) 5 and 6 and a regional biogeochemical model for the northwest North Atlantic (NWA) shelf to assess their ability to reproduce observations of temperature, nitrate, and chlorophyll. The NWA region is biologically productive, influenced by the large-scale Gulf Stream and Labrador Current systems, and particularly sensitive to climate change. Most ESMs compare relatively poorly to observed nitrate and chlorophyll and show differences with observed temperature due to spatial mismatches in their large-scale circulation. Model-simulated nitrate and chlorophyll compare better with available observations in AR6 than in AR5, but none of the models performs equally well for all 3 parameters. The ensemble means of all ESMs, and of the five best performing ESMs, strongly underestimate observed chlorophyll and nitrate. The regional model has a much higher spatial resolution and reproduces the observations significantly better than any of the ESMs. It also simulates reasonably well vertically resolved observations from gliders and bi-monthly ship-based monitoring observations. A ranking of the ESMs suggests that the top 3 models are appropriate as boundary forcing for regional projections of future changes in the NWA region.



Regional model
The ACM is a high-resolution, regional configuration of the Regional Ocean Modeling System (ROMS, version 3.5;Haidvogel 95 et al., 2008) for the NWA, nested within the larger ocean-ice model of Urrego-Blanco and Sheng (2012), that includes the Gulf of Maine, Scotian Shelf and Grand Banks (Figure 1). The coupled physical-biogeochemical model has 30 vertical layers and an average horizontal resolution of 9.5 km on the shelf (Table 1). Detailed descriptions and physical model validation are presented in Brennan et al. (2016) and Rutherford and Fennel (2018). The biogeochemical model is based on Fennel et al. (2006Fennel et al. ( , 2008 but was expanded by splitting phytoplankton and zooplankton state variables into size-based functional groups, 100 i.e. nano-micro-phytoplankton and micro-meso-zooplankton. The model was also modified by including temperaturedependent biological rates for nutrient uptake, phytoplankton and zooplankton mortality, grazing and zooplankton egestion and excretion (see supporting text). The model has 10 state variables: nitrate, ammonium, and two size classes each for phytoplankton, chlorophyll, zooplankton and detritus ( Figure 2). This ecosystem structure is of intermediate complexity similar to the model of Aumont et al. (2015), which is used in 6 of the ESMs included in our study. Model parameters were optimized 105 by Kuhn (2017) and are listed in supporting Table S1. The model description and equations are available in the Supporting

Information.
Initial and open boundary conditions for nitrate (NO3) were defined from a monthly climatology (Kuhn, 2017) based on insitu observations and the World Ocean Atlas 2009 (Garcia et al., 2010). Other biological variables were set to 0.1 mmol N m -3 with a phytoplankton-to-chlorophyll ratio of 0.76 mmol N (mg Chl) -1 (Bianucci et al., 2016). The model was initialized on 110 January 1, 1999 and run through December 31, 2014. The first year was considered spin up. Monthly climatologies of surface chlorophyll, nitrate, and temperature were calculated for comparison with the ESMs.

Model resolution
The 30 models differ dramatically in their horizontal resolution and do not evenly cover the 3 regions of interest ( Figure 3, Table 1). The regional ACM has a much higher resolution than any of the ESMs with about 16 times more horizontal grid 115 cells than the highest resolution ESM and almost 300 times more than the lowest resolution ESM. Among the ESMs the highest resolution is achieved by models 16 and 28, which share the same grid. These two have more than twice the number of horizontal grid cells than the next highest resolution models (3,18,(20)(21). The lowest resolution ESMs are models 3 and 12-14 with only 26 horizontal grid cells within the NWA shelf resulting in a coarse representation, particularly in the SS region.
The median number of grid cells in the NWA shelf region is 72 and 102 for the CMIP5 and CMIP6 models, respectively, 120 compared to 6875 in the ACM.

Comparison metrics
For comparison with the observations, each model was mapped onto the SeaWiFS, WOA and OSTIA grids. Since some areas, such as the nearshore and the Bay of Fundy, are covered by only a few models, grid cells that are active in less than 85% of all models were excluded from the analysis to avoid biases. In the low-resolution WOA climatology, the months November to 135 January were excluded because poor data availability in these months resulted in unrealistic patterns.
Three zones were defined for a high-level comparison with the observations: the Gulf of Maine (GoM), Scotian Shelf (SS), and Grand Banks (GB) (Figure 1). Subsequently, the term NWA shelf refers to the region covered by all 3 zones (GoM, SS and GB).
Following the method of Rickard et al. (2016), a score is calculated for each model variable, (i.e., surface temperature, 140 chlorophyll, and nitrate), for each month, , in the climatology as the sum of the centered Root Mean Square Difference (RMSD) and bias between the observations ( ) and the model ( ), such that: where the index refers to a grid cell and is the total number of grid cells within the NWA shelf. The lower the score the better the match between model and observations. Annual mean scores ̅ ( ) were calculated for each model variable by 145 averaging over t. For each variable, the models were ranked based on their annual mean score. The overall rank was determined by ranking models by the averages of their ranks for surface temperature, chlorophyll, and nitrate. For models with equal averages the ranking was determined by the average of chlorophyll and nitrate ranks.
To facilitate the comparison with observations, the ESMs were grouped into CMIP5 and CMIP6 and the ensemble means of all models and of the 5 highest ranked models were calculated for each group. 150 deviate from each observed variable and subsequently used to calculate the scores and then rank the models. Finally, additional, high-resolution comparisons between models and observations are presented to further assess the regional model's 155 performance.

Model-data comparisons
First, we compare the spatially averaged climatological surface temperature (Figure 4Figure

Model statistics
Error statistics, i.e. RMSD and bias, are now analyzed and used to calculate the model scores. The distribution and relationships 190 between scores are explored and then the ranks calculated.
The RMSD between the spatially averaged climatological observations and models are not consistent between variables, as indicated by the increasing temperature RMSD in Figure 6. However, temperature and chlorophyll RMSD are correlated (r = 0.51, p = 0.0043). For temperature, models 3, 20-21, and 24-25 have the largest discrepancy with observations and some clearly represent better the annual cycle than others. For chlorophyll, the largest discrepancies with observations are in models 195 4, 8 14 and 19-21, but overall chlorophyll RMSD are relatively large and homogeneous, except for a few models that have lower RMSD (e.g. models 22-23). Interestingly, the magnitude of the spring bloom in model 18 (CMIP5 group) is somewhat close to the observations. However, the time shift of the bloom (May-June) results in a poor agreement with observations. The mismatch between observed and simulated nitrate is much higher for models 5, 7, 18 and 29 and some models are much better at representing the observed annual cycle ( Figure 6). The models with lowest RMSD for all 3 parameters are models 22-23 200 (CMIP6 group). The RMSDs of the ACM are about a third of the average RMSD of the ESMs for both chlorophyll (ESM RMSDs are ×2.0-4.1 that of the ACM) and nitrate (×1.4-11.4) and a quarter for temperature (×1.1-10.4).
Model scores (see Sect. 2.3) represent the spatial and temporal mismatch within the NWA shelf region (Figure 7). In general, the scores provide similar results as the RMSDs in Figure 6, although groups tend to emerge from the score calculation. As observed previously in Figure 6, the scores of ESMs have a much larger range of variability for temperature (1.5-7.8) and 205 nitrate (1.4-13.2) than for chlorophyll (0.81-1.42) due to the large mismatch observed with a few models ( Figure 7, supporting Figures S1-S5). For temperature, 4 of the 6 poorest (largest) scores (> 4.5) are in the CMIP6 group. They all markedly overestimate temperature, especially in the GM (see supporting Figures S1, S4-S5). The range of variability in chlorophyll scores did not reduce from CMIP5 to CMIP6 and given the improvement of a few CMIP6 models (i.e. 22 and 23), the range is larger in the CMIP6 group (0.8-1.4, Figure 7, right panel) than in the CMIP5 group (1-1.4, Figure 7, left panel). With the 210 exception of model 29, which has a very poor (high) score for nitrate, the range of variability in nitrate is reduced in the CMIP6 group. In total, 5 models (3, 5, 7, 18, 29) have very poor scores for nitrate (> 4) strongly overestimating surface nitrate, except for model 3 in the Gulf of Maine (see supporting Figure S1). The remaining models have more homogeneous nitrate scores ( Figure 7) with the best (lowest) scores in models 25, 24, 9 and 6 (Table 2). Models that underestimate nitrate (2, 8, 14 and 19, see supporting Figures S1-S4) have a better score because they match the low nitrate observations in late spring-summer 215 (Table 2). Overall, ACM has the best scores, ̅ ( ), for temperature (1.14), chlorophyll (0.64) and nitrate (1.27).
Among the 3 variables, and including the regional model, we found a correlation between the scores of chlorophyll and temperature (r = 0.53, p = 0.0025), but not between nitrate and chlorophyll (r = 0.03, p = 0.88) or nitrate and temperature (r = 0.06, p = 0.74). As can be seen in Figure 6, the ESMs with a poor representation of nitrate are not necessarily performing https://doi.org/10.5194/bg-2020-265 Preprint. Discussion started: 17 July 2020 c Author(s) 2020. CC BY 4.0 License.
poorly with respect to chlorophyll or temperature. Model 7 for instance has the poorest score for nitrate and a relatively poor 220 score for temperature but the best score of the CMIP5 group for chlorophyll (Figure 7, left panel). In fact, only models 3 and 18 have poor scores for all variables. Similarly, models 24 and 25 have the best scores for chlorophyll among the ESMs but are among the worst for temperature. On average, models have worse scores in the GM (3.97, 1.73, 3.15) than on the SS (3.35, 0.94, 2.22) and GB (2.53, 0.72, 2.46) for temperature, chlorophyll and nitrate, respectively.
Overall, 4 groups emerge on the chlorophyll-nitrate space in Figure 7. This grouping is somewhat arbitrary but follows the 225 general ranking presented in Figure 8, with a few exceptions. Group A includes the 14 best models (6 CMIP5 and 8 CMIP6) except for model 9 and 30 whose ranking is degraded due to poor representation of temperature. Group B includes the 4 intermediate-score models (15,16,17,2). Group C includes the 8 models with poor chlorophyll scores (5 CMIP5 and 3 CMIP6) and Group D the 5 models with poor nitrate scores (4 CMIP5 and 1 CMIP6). Most of the models with poor scores for temperature are included in Group C, i.e. with the poor chlorophyll scores. 230 The overall model ranking (average of chlorophyll, nitrate and temperature ranks) indicates the gap between ACM and ESMs, as well as within ESMs ( Figure 8). As expected, ACM ranks first, following the best scores for both chlorophyll and nitrate.
The gap between ACM and model 22 (the best overall ESM) indicates that none of the ESM performs best for both chlorophyll and nitrate. This is also shown by the large range in individual ranks (dark grey lines in Figure 8) in most models. Group A includes the 5 best ranking models, all from CMIP6 (22, 28, 25, 24, 23, respectively). The most consistent in term of 235 chlorophyll and nitrate ranking is model 28, the other ones having a relatively large spread. The best ranked CMIP5 models are 10 and 13. On the other side of the spectrum models 20, 3, 21 and 18 (Groups C and D) have the poorest ranks because of their consistently poor scores for chlorophyll and nitrate. Despite their poor performance with respect to nitrate, models 7 and 29 are ranked within the mid-range of the ESMs because they are among the best ESMs with respect to chlorophyll (rank 4 and 8, respectively). 240

Additional model-data comparisons for regional ACM
While the resolution of the ESMs does not allow for a comparison at smaller spatial scales, we further compare the regional ACM to cross-shelf transects and station observations (Figure 9) along the Halifax Line (see Figure 1). The ACM reproduces the seasonal variation and the vertical gradient in chlorophyll and nitrate along the transect (Figure 9), although the simulated distributions are smoother than the glider observations. The summer subsurface chlorophyll maximum is located at the 245 appropriate depth (28 m simulated versus 32 m observed, on average). The ACM somewhat underestimates the depth of the nitracline in the offshore waters (34 m versus 43 m, > 150 km) and overestimates surface nitrate in spring and fall, as seen in Figure 4.
Station 2, which is located nearshore on the Halifax Line (see Figure 1), provides additional, vertically resolved information with high temporal resolution that is useful for model validation (Figure 10

Overall model performance
There are significant discrepancies with observations and a large variability among ESMs in the representation of surface temperature, chlorophyll and nitrate in the NWA shelf (Table 2, Figure 6 and supporting Figures S1-S5). A warm bias resulting from a mismatch in the location of the Gulf Stream was present in most models, in line with the previous results of Loder et 265 al. (2015) and Saba et al. (2016). Chlorophyll concentration was also systematically underestimated. The spring and fall blooms, which are characteristic annual features of the NWA region (Greenan et al., 2004(Greenan et al., , 2008 were absent in some and most models, respectively. The correlation between temperature and chlorophyll scores indicated that errors in surface chlorophyll concentration were likely driven by the misrepresentation of the general circulation and, more generally, of ocean physics. 270 Following Rickard et al. (2016), who used a similar ranking procedure, the 29 ESMs can be divided into an inner and an outer model ensemble. The outer ensemble includes 17 models that clearly misrepresent surface conditions in the NWA shelf (models 2-5, 7-8, 11, 14-16, 18-21, 24-25 and 29) and were selected as follows. The 8 models with lowest ranks (2-4, 8, 18-21) were included because they consistently misrepresent surface fields on the NWA shelf. Five of those were different generations (CMIP5 and CMIP6) of the same model, i.e. CanESM (2, 19) and CESM (3, 20-21). Their large scores imply that 275 CanESM and CESM have fundamental issues with representing biogeochemistry in the NWA. Models 15-16 and 24-25 were also included in the outer ensemble because of their misrepresentation of surface nitrate and temperature, respectively. Since nitrate scores neither correlate with chlorophyll nor temperature, the mismatch with nitrate observations is likely related to intrinsic biogeochemical model behaviour rather than to a mismatch in circulation. Models with persistent positive or negative biases in surface nitrate (4-5, 7-8, 11, 14, 19 and 29, Figures S1-S5) were selected because they misrepresent the seasonal 280 nitrate dynamics and therefore the other biogeochemical variables driven by nitrate are questionable.
The inner ensemble includes 12 models (6, 9-10, 12-13, 17, 22-23, 26-28, 30). Can those be used as a multi-model (optimal) ensemble to characterize the future state of the NWA shelf region? Unfortunately, we found that an ensemble mean of these models, and even of the best five models, poorly represents historical surface fields due to the large variability within the ensemble ( Figure 5) and the biases in the ensemble surface temperature and chlorophyll concentration (Figure 4). 285 The regional model clearly outperformed the ESMs in our assessment, with a consistent representation of the surface and subsurface fields in all shelf areas. The high spatial resolution of the regional model also allowed for a fine scale model validation that was not possible for the ESMs. The complementary glider transects and time series stations provide a highresolution dataset of in-situ chlorophyll and nitrate concentrations and shows that the regional model resolves seasonal and vertical variation in chlorophyll and nitrate on the Scotian Shelf, something that none of the ESMs were able to reproduce. 290

Impact of spatial resolution
In general, the coarse horizontal resolution of the ESMs affects the representation of the NWA region in comparison to the regional model, particularly on the relatively narrow Scotian Shelf. The poor representation of coastal areas is a known limitation of global models (Holt et al., 2017) and results in a global underestimation of primary productivity in these regions (Bopp et al., 2013;Schneider et al., 2008). 295 There is no correlation between grid resolution and ESM rank (Figure 11) despite the fact that the best ranked ESM (MPI-ESM1-2-HR) has also the highest resolution (Table 1Table 2). This result shows that higher grid resolution, as called for by Lavoie et al. (2013) for the NWA and by McKiver et al. (2015) for the global ocean, is not a guarantee for improved model performance. In fact, some very coarse resolution models from the CMIP5 group were ranked as well or better than the other models and models with the second highest resolution (3,18,(20)(21) had all low ranks. The improved ranks at constant (e.g. 300 models 22, 24, 25, 28) and even lower (model 29) ocean grid resolution in the CMIP6 group (Table 2, Figure 12) was also an indication that the discrepancies with observations, and the improvement in the CMIP6 models (see below), were not associated with the ocean grid resolution but rather resulted from the physical and biogeochemical setup of the models. Another hint at the lack of relationship between resolution and model rank is the similar ranking of the two MPI models in the CMIP5 group, MPI-ESM-LR and MPI-ESM-MR, despite an important difference in model grid resolution (Figure 8). Much higher resolution 305 will be necessary to refine the projections in coastal areas (e.g., Holt et al. (2017), Saba et al. (2016)), which is not currently computationally feasible in ESMs (Holt et al., 2009(Holt et al., , 2017.

Impact of biogeochemical model structure
Although model performance is likely influenced by the biogeochemical model structure, we did not find a clear relationship between biogeochemical model and performance. While the inner and outer ensembles share only 4 biogeochemical models 310 (PISCES, HAMOCC, TOPAZ2, NOBM) out of 13, there was no indication of consistently better performance for the biogeochemical models in the inner ensemble. For example, models using similar ocean biogeochemistry (e.g., PISCES: 5,[12][13][14]22,26,18,[28][29] had very different ranks, with no obvious relationship between overall model rank and the ocean biogeochemical model component. Moreover, 4 biogeochemical models were represented in the 5 best ranked ESMs, similar to previous findings by Rickard et al. (2016).

Improvement from CMIP5 to CMIP6
Model performance improved in the new CMIP generation, but not uniformly across models and variables. The 4 best ranked ESMs were from the CMIP6 group, although the average rank was not very different between the two groups, i.e. R 8 = 17.4 and 14.0 for CMIP5 and CMIP6, respectively (Figure 8, Table 2). The change in performance between the two generations of models can be assessed by evaluating the subset of models that are available for CMIP5 and CMIP6. There are nine such 320 models ( Figure 12). All CMIP6 models have improved overall ranks, indicating better performance (Figure 12). The overall improvement was large only for models that had average to low ranks in the CMIP5 group (ranks 15-22, x-axis in Figure 12).
Temperature did not improve except for GFDL-ESM2M and degraded in some cases. The change in ranking is therefore mainly associated with better surface fields for chlorophyll and nitrate. This is particularly the case for model pairs 3, 5, 6 and 8, which ranked much better for chlorophyll (+8.2) and nitrate (+12.7) in the CMIP6 group ( Figure 12). The chlorophyll rank 325 in model pair 4 improved significantly (+18) but this improvement was counteracted by degraded temperature and nitrate ranks. The lack of improvement in surface temperature indicates that the temperature bias detected in the CMIP5 group was not solved in CMIP6, as seen in Figure 4.
We can only speculate about the source of improvement in the CMIP6 models. Kwiatkowski et al. (2020) recently showed that projected surface temperature, nitrate and net primary production differ significantly in CMIP5 and CMIP6 model 330 ensembles. Higher climate sensitivity in CMIP6 models partly explain this difference but the source of change in primary production was not resolved. In the historical simulations, better surface chlorophyll and nitrate fields in CNRM-ESM2-1 may be associated with the transition from a climate model with ocean biogeochemistry to a fully coupled ESM, even though such transition may degrade historical simulations due to the replacement of observations by prognostic schemes that are poorly constrained (Séférian et al., 2019). Updated land and ocean biogeochemistry may have improved the representation of surface 335 chlorophyll and nitrate in MPI-ESM1-2-HR (Müller et al., 2018), whereas the improvement in surface temperature and nitrate fields from GFDL-ESM2M to GFDL-ESM4 seem to be associated with the physical ocean component of the model, given that GFDL-ESM2G already performed well in the CMIP5 group. Danabasoglu et al. (2020) found a significant improvement for CESM2 at the global scale but a poor representation of the Gulf Stream-North Atlantic Current system, resulting in a large surface temperature bias. This is in line with our assessment for the NWA shelf where both physical and biological parameters 340 had poor scores and the model was not found appropriate for shelf studies in the NWA.

Other coastal regions
Our results may also apply for other coastal regions, given the poor representation of coastal areas in ESMs, but the details are probably region specific. Discrepancies with observations in the NWA are partly driven by poor representation of large-scale circulation features such as the Gulf Stream and Labrador Current in most of the models. The representation of large-scale 345 currents may improve (or worsen) in other regions, resulting in a different ranking there. For example, Rickard et al. (2016) found a different model selection in the inner model ensemble around New Zealand. Seven (out of 11) of their inner ensemble https://doi.org/10.5194/bg-2020-265 Preprint. Discussion started: 17 July 2020 c Author(s) 2020. CC BY 4.0 License. models (models 2-5, 7-8, 14) are not included in our inner ensemble. Model 3, perhaps the best model in their assessment, ranked 29 out of 30 in the NWA shelf region (Figure 8, supporting Figure S1). The representation of the dynamic NWA circulation is a known issue in ESMs and further regional comparisons will be necessary to assess if our results are 350 representative for the global coastal ocean.
We caution against using model ensembles, either directly or in downscaling future projections for the NWA shelf. The 355 regional model (ACM) clearly outperformed the global models and is a good candidate for downscaled projections in combination with one of the top ranked ESMs. Further refinement in the ACM should focus on the mechanisms that determine the magnitude of the spring bloom.
Similar comparisons should be carried out in coastal areas before using CMIP model projections. While it is not clear how the presented model ranking will hold in other regions, it is highly likely that some models do not perform well in coastal areas 360 generally and should not be used for regional investigations.
Given the lack of a direct relationship between model skill and horizontal resolution, it is unlikely that feasible grid refinement will significantly improve model performance in the NWA region. The improvement in scores from CMIP5 to CMIP6 shows that refining ocean biogeochemical components can improve the model performance.
Code and data availability. The ROMS code and the observations are available from the links referenced in the manuscript. 365 Supplement link. The supplement related to this article is available on-line at: Author contribution. AL and KF conceived the study. AL and AK set up the ACM model. AL conducted the analyses. AL wrote the manuscript with input from KF Competing interests. The authors declare that they have no conflict of interest Acknowledgements. The ACM was run on Compute Canada resources under the resource allocation project qqh-593-ac. 370 Financial statement. We acknowledge funding from the Canada First Research Excellence Fund, through the Ocean Frontier Institute, the MEOPAR Network of Centres of Excellence through the Prediction Core, and an NSERC Discovery Grant held by KF. Table 1. Information about the regional model and the 29 ESM models. For the CMIP5 models (2-18) the r1i1p1 ensemble was used.