Evaluating the ocean biogeochemical components of Earth system models using atmospheric potential oxygen and ocean color data

The observed seasonal cycles in atmospheric potential oxygen (APO) at a range of midto high-latitude surface monitoring sites are compared to those inferred from the output of six Earth system models (ESMs) participating in the fifth phase of the Coupled Model Intercomparison Project phase 5 (CMIP5). The simulated air–sea O2 fluxes are translated into APO seasonal cycles using a matrix method that takes into account atmospheric transport model (ATM) uncertainty among 13 different ATMs. Three of the ocean biogeochemistry models tested are able to reproduce the observed APO cycles at most sites, to within the large TransCom3-era ATM uncertainty used here, while the other three generally are not. Net primary production (NPP) and net community production (NCP), as estimated from satellite ocean color data, provide additional constraints, albeit more with respect to the seasonal phasing of ocean model productivity than overall magnitude. The present analysis suggests that, of the tested ocean biogeochemistry models, the community ecosystem model (CESM) and the Geophysical Fluid Dynamics Laboratory (GFDL) ESM2M are best able to capture the observed APO seasonal cycle at both northern and southern hemispheric sites. In most models, discrepancies with observed APO can be attributed to the underestimation of NPP, deep ventilation or both in the northern oceans.


Introduction
Ocean physical and biogeochemical processes have profound influences on Earth's climate.Phytoplankton in the sunlit part of the ocean convert carbon from inorganic to organic form via photosynthesis, thereby establishing the base of the ocean food chain.Primary production and subsequent export of organic carbon from the mixed layer (export production, EP) and remineralization at depth are key components of the so-called biological pump, which regulates the partition of carbon between the ocean and atmosphere (Gruber and Sarmiento, 2002;Boyd and Doney, 2003).
Net community production (NCP) and the related process of EP are also important for understanding the distribution of dissolved O 2 within the ocean and the flux of O 2 (F O2 ) at the air-sea interface.NCP is defined here as the net amount of organic carbon fixed through photosynthesis over the depth of the mixed layer after accounting for grazing and both autotrophic and heterotrophic respiration.NCP is closely linked to F O2 , since each mole of photosynthetically fixed carbon that persists beyond the timescale of air-sea exchange (2-3 weeks) leaves a stoichiometric amount of O 2 available for release to the atmosphere.This release of O 2 to the atmosphere in association with NCP occurs mainly in the spring and summer at extratropical latitudes (Keeling et al., 1993).EP more or less balances NCP when averaged over a full year or if the upper ocean is in a long-term steady state and advec-tive fluxes are zero (Laws et al., 2000).The exported carbon subsequently is respired in the subsurface ocean, leading to O 2 depletion at depth.O 2 is replenished by absorption from the atmosphere when deep waters mix back to the surface in fall and winter.Deep ventilation and NCP thus are distinct processes that are largely separate in time and space but are both closely linked to the biological pump that draws carbon out of surface waters and is critical for ocean uptake of atmospheric CO 2 .
To explore the impacts of future climate change on Earth's climate and ecosystems, the Coupled Model Intercomparison Project phase 5 (CMIP5) relies on three-dimensional numerical Earth system models (ESMs), which incorporate descriptions of biogeochemical impacts of land and marine biota.Projections of future atmospheric CO 2 levels and associated climate warming in the CMIP5 depend not only on fossil fuel use projections but also on assumptions about uptake and storage of carbon by the land and ocean.The oceans have absorbed approximately one-third of the anthropogenic carbon released to the atmosphere since the beginning of the industrial era (Khatiwala et al., 2009), but this fractional rate of uptake is unlikely to continue in the future as the buffering capacity of surface waters declines and the export of carbon from the surface to the deep ocean fails to keep pace with anthropogenic fossil fuel combustion (Arora et al., 2013).Changes in ventilation of abyssal deepwater are an additional possible consequence of future climate forcing that current models may or may not be able to predict accurately (Sigman et al., 2010).
Recent studies have tested the present-day skill of the ocean components of ESMs and some have also examined future projections (Schneider et al., 2008;Steinacher et al., 2010, Bopp et al., 2013;Anav et al., 2013).These evaluations have compared model output to both hydrographic measurements and remotely sensed ocean color products, most commonly net primary production (NPP).The models predict spatial-annual patterns in NPP that reproduce some of the main features seen in satellite data, but differ over a factor of 2 in NPP magnitude.Some evaluations have examined seasonal variability and have found that ocean models tend to underestimate observed NPP at high latitudes (poleward of 44 • ) in the Northern Hemisphere and overestimate it in the Southern Hemisphere.The models also fail to capture the timing of the observed high-latitude peak in NPP in both hemispheres, with predictions that are often 1-2 months earlier than observations (Anav et al., 2013;Henson et al., 2013).However, ocean color-derived NPP values are uncertain, especially in the Southern Ocean, reducing confidence in the observed constraints.
Many biogeochemical processes that are expected to occur in the future, such as responses to warming and stratification, are also highly relevant on seasonal timescales (Keeling et al., 2010;Anav et al., 2013).Thus, challenging models against known seasonal variations can aid in the development of credible predictions of future changes.Here, we evaluate six ESMs used in the CMIP5 against two cross-cutting metrics, which test the models' ability to account for changes in ocean biogeochemistry on seasonal time frames.This work is intended primarily as a demonstration of method using an available subset of the CMIP5 ESMs rather than as a comprehensive evaluation of all the CMIP5 models.The first metric is based on satellite-derived estimates of ocean color, focusing on NPP and NCP.The second metric is based on the seasonal cycles in atmospheric potential oxygen (APO), an atmospheric tracer that varies seasonally mainly due to air-sea exchanges of O 2 (Stephens et al., 1998;Manning and Keeling, 2006).
EP is the ocean color-derived flux most relevant to the biological pump, but cannot be directly observed by remote sensing.It is derived by a combination of remote measurements and poorly constrained models, which inherently increases its uncertainty (Schneider et al., 2008;Nevison et al., 2012a).The quantity actually observed from space is spectral top-of-the-atmosphere radiance, which is used to estimate chlorophyll (or another proxy of phytoplankton biomass); chlorophyll and other variables such as photosynthetic radiation are used to estimate NPP and, finally, NPP is used to estimate EP.The first step, estimation of chlorophyll, is known to have significant bias (underestimation by ∼ 2-3 times) in the Southern Ocean which is transferred to higher level products.We correct for that bias by using algorithms tuned to Southern Ocean data sets blended with more or less standard products elsewhere (Mitchell and Kahru, 2009;Kahru and Mitchell, 2010).While our satellite estimates of EP are improved, they are still subject to high uncertainty.
Observed seasonal cycles in APO provide a new benchmark for the ocean biogeochemistry model components of ESMs.They offer evaluation metrics complementary to ocean color products by providing additional information on deep ventilation processes unavailable from satellite data alone.The main drawback of APO seasonal cycles is that atmospheric transport models (ATMs) are needed to translate ocean model air-sea O 2 fluxes into a seasonal APO signal, which inevitably introduces uncertainty (Stephens et al., 1998;Nevison et al., 2012a).A first attempt has been made to use APO seasonal cycles to evaluate ocean-only marine biogeochemistry models (Naegler et al., 2007), but the models in that study implemented a simplified parameterization of the biological processes affecting O 2 and CO 2 air-sea fluxes and were considerably less advanced than the current ecosystem dynamics and biogeochemical components used in state-ofthe-art ESMs.Further, while Naegler et al. (2007) asserted that the uncertainty introduced by ATMs was too large to provide a strong constraint on ocean model fluxes, their study relied on only two ATMs.Here, we translate the model air-sea fluxes into APO signals using a wider range of ATMs and show that, in many cases, the discrepancies between modeled and observed APO seasonal cycles transcend ATM uncertainty.
The six ESMs differ in their physical components and implement ocean biogeochemical schemes that vary in their specifics, but have many common features.All include explicit representations of upper ecosystem dynamics that distinguish at least one phytoplankton group and one size class of zooplankton.Four of the models (CESM, both GFDL variants and IPSL) divide phytoplankton further into at least two size classes: large (micro) and small (nano + pico).GFDL and CESM also explicitly model diazotrophs.Phytoplankton growth rates in all models are co-limited by light, temperature and nutrient (N, P, Si, Fe) availability.Carbon export flux is closely linked to ecosystem structure and dynamics, with higher sinking rates assumed for large phytoplankton, representing, e.g., diatoms.
For each model, the following output fields were obtained for the CMIP5 standard historical simulation, which is driven by prescribed atmospheric CO 2 from 1850 to 2005: carbon export flux at 100 m depth (EP 100 ), vertically integrated NPP, net air-sea O 2 and CO 2 fluxes, net surface heat flux (Q), and sea surface salinity and temperature (SST).Many of these fields were available through public web interfaces, but some variables, particularly Q, required assistance from the individual modeling groups, which effectively limited the study to the six models listed above.The EP 100 and NPP fields were compared directly to the corresponding satellite ocean color products.The remaining five output fields were used in the estimation of APO time series, with the final three fields used to estimate air-sea N 2 fluxes based on the Q(dS/dT ) N 2 /C p equation (Keeling et al., 1993; Man- In this equation, Q is heat flux, (dS/dT ) N 2 is the temperature derivative of the N 2 solubility coefficient and C p is the heat capacity of sea water.The resulting N 2 fluxes, together with the prognostic O 2 and CO 2 air-sea fluxes, were used as described below to force atmospheric transport model simulations to compute atmospheric time series of APO (Naegler et al., 2007;Nevison et al., 2008;2012a).Since all the ocean models operated on an irregular, off-polar grid with two-dimensional latitude and longitude coordinates, these were first interpolated to a regular 1 • × 1 • latitudelongitude grid using climate data operator (CDO) freeware (https://code.zmaw.de/projects/cdo).The climate data operator interpolation was not mass conservative, but resulted in global O 2 flux differences generally of less than 1 %.An exception was the CESM, whose output was converted conservatively to a regular grid using a CESM-specific mapping file.

Matrix method
A matrix method was used to translate the ocean model air-sea O 2 , N 2 and CO 2 fluxes into corresponding annual mean cycles in atmospheric potential oxygen (APO).The method was based on the pulse-response functions from the TransCom3 Level 2 (T3L2) ATM intercomparison.Each of the 13 ATMs that participated in T3L2 conducted forward simulations in which a uniformly distributed CO 2 flux, normalized to 1 PgC yr −1 , was released from each of 11 ocean regions (Fig. 1) for each of 12 emission months, i.e., January-December, allowed to decay for 35 months, using an annually repeating cycle of meteorology that was model specific for each ATM, and sampled every month at a range of surface monitoring sites (Gurney et al., 2003;2004).The APO code was developed from an earlier pulse-response matrix code, which has been described in detail in The pulse-response matrix code was applied separately to the O 2 , N 2 and oceanic CO 2 fluxes from the last 12 years of the historical simulations, spanning 1994-2005, converting from carbon to oxygen or nitrogen units where appropriate, to create three separate time series of atmospheric O 2 , N 2 and CO 2 as mole fraction anomalies (µmol mol −1 ) on a H 2 Ofree basis, where the O 2 and N 2 anomalies are computed as though O 2 and N 2 were trace gases, similar to CO 2 .These were combined to calculate a 9-year time series in APO in per meg units, spanning 1997 to 2005, according to Eq. ( 1) (Stephens et al., 1998): where X O 2 and X N 2 are the dry air mole fractions of O 2 and N 2 in H 2 O-free air, treated here as constants (0.2094 and 0.7808, respectively).The mean seasonal cycle was computed by detrending the time series with a third-order polynomial and then taking the average of the detrended data for all Januaries, Februaries, etc.The matrix method involves calculating separately the components of APO at each measurement site arising from fluxes from each ocean region.These components are then summed to compute the net APO signal.The model definition of APO in Eq. ( 1) ignores contributions to APO from land biospheric exchanges at ratios other than 1 : 1 and fossil fuel burning, but these are very small in comparison to oceanic contributions on seasonal timescales (Manning and Keeling, 2006;Nevison et al., 2008).

Evaluation of matrix method based on APO TransCom
An evaluation exercise was conducted in which the APO pulse-response matrix code was forced by climatological O 2 and N 2 fluxes from Garcia and Keeling (2001) and used to compute the mean seasonal cycle in APO as described above using Eq. ( 1) (minus the oceanic CO 2 term).The matrixbased results were evaluated against the mean seasonal cycles from archived station output from the forward ATM simulations of the APO TransCom Experiment, which also used the Garcia and Keeling O 2 and N 2 forcing fluxes (Blaine, 2005;Nevison et al., 2012b).This evaluation was conducted using a subset of 9 of the original 13 T3L2 ATMs that also participated in APO TransCom.For this subset, the matrix method performed well in relatively homogeneous regions like the Southern Ocean and at northern high-latitude sites like Barrow, Alaska (BRW) and Alert, Canada (ALT).It was less reliable in capturing the forward simulation cycle at sites located within northern mid-latitude ocean regions, including Cold Bay, Alaska, and La Jolla, California, where the uniform distribution of fluxes assumed by T3L2 did not accurately capture the impact of strong heterogeneity in air-sea fluxes from these regions (Tables S1, S2 and Figs.S1, S2 in the Supplement).These same North Pacific stations are subject to large uncertainty in full forward ATM simulations due to uncertainty in vertical mixing (Stephens et al., 1998;Battle et al., 2006;Tohjima et al., 2012).We therefore focus in Sect. 3 on ALT, BRW and three Southern Ocean sites, including Macquarie Island (MQA), Palmer Station, Antarctica (PSA) and the South Pole (SPO) in our use of APO to evaluate the ESM-simulated air-sea O 2 , N 2 and CO 2 fluxes.The locations of these five sites with respect to the T3L2 ocean regions are shown in Fig. 1.
While the evaluation exercise indicates that the matrix method reproduces the shape and phase of the seasonal cycles with high reliability at the above sites, it tends to underestimate the seasonal amplitude by about 4-5 % at ALT and BRW and by 11-12 % at MQA and SPO and to slightly overestimate the amplitude at PSA.In applying the matrix code to the ESM oceanic fluxes, we therefore scaled up the estimated cycles by site and ATM-specific scaling factors obtained from the evaluation exercise (Tables S1, S2, Fig. S2).Since these scaling factors were only available for the subset of 9 of the 13 T3L2 ATMs that also participated in APO TransCom, we subsequently (Sect.3.1) compare observations alternatively to the scaled nine-model subset, or to all 13 unscaled models.

Component O 2 fluxes
The net air-sea O 2 flux for each ESM can be divided into three components, associated with NCP, deep ventilation and thermal processes (Nevison et al., 2012a): (2) These in turn can be used to force the matrix model and the resulting total APO cycle can be presented as the sum of component signals according to Eq. ( 3).
Here, the APO therm term also includes the effects of N 2 fluxes, as per the second right-hand term in Eq. ( 1).The atmospheric signal due to oceanic CO 2 (last term in Eq. 1) is not easily included in any of the component terms in Eq. ( 3) based on available ESM output, but in principle all three component processes may lead to changes in CO 2 fluxes as well as O 2 fluxes.In practice, CO 2 has only a small influence on the amplitude and phasing of APO in most of the ESMs and thus is ignored in the component analysis.An exception is the MPIM, in which the oceanic CO 2 signal has a peaktrough seasonal amplitude of up to 5 ppm in the Southern Ocean that opposes the O 2 cycle, as noted previously (Anav et al., 2013) and discussed further below.Among the terms in Eq. ( 2), F O 2 ,total was provided outright by the ESMs and the thermal component F O 2 ,therm can be derived easily from standard ESM output following the approach described above for N 2 .The remaining terms, F O 2 ,NCP and F O 2 ,vent are more challenging to estimate from available ESM output.In Nevison et al. (2012a), F O 2 ,NCP was estimated from EP multiplied by a molar ratio of 1.4 mol O 2 per mol C exported.The assumption that F O 2 ,NCP = 1.4 EP was shown in Nevison et al. (2012a) to yield reasonable results for EP derived from satellite data (and indeed was applied to the satellite data described below in Sect.2.3), but this approach proved unsatisfactory for EP 100 from the ESMs, especially in the Southern Ocean as discussed further below, since it yielded an atmospheric signal that was unreasonably small.The assumption also led to phasing uncertainties for some models (IPSL, NorESM1 and MPIM) that use finite sinking velocities for particulate organic carbon (as opposed to instantaneous vertical redistribution, as assumed, e.g., by CESM) with a resulting delay in EP 100 relative to NPP.Since the timing of F O 2 ,NCP is likely to be more closely related to NPP than EP 100 (Nevison et al., 2012a), we estimated F O 2 ,NCP from the ESMs alternatively as 1.4 EP 100 and 1.4 ef * NPP, where NPP is the standard, vertically integrated ESM output variable and ef is the model-specific annual mean EP 100 /NPP ratio, integrated over the 40-60 • N or 40-60 • S latitude band for northern and southern stations, respectively (Table 1).
Finally, F O 2 ,vent in principle can be estimated as a residual of the other three terms in Eq. ( 2).F O 2 ,vent was estimated with reasonable success at the northern hemispheric sites, but generally looked unreasonable in the Southern Ocean for most models, with the exception of the IPSL.The signals were judged to be unreasonable on the basis of whether the APO vent term, if estimated as a residual from Eq. (3), dominated the APO NCP term in driving the springtime rise in APO.In reality, the APO NCP term must be primarily responsible for this rise (Keeling et al., 1993;Bender et al., 1996;Nevison et al., 2012a).We therefore do not attempt to explicitly resolve or present APO vent signals in the Southern Hemisphere.While the problems with APO vent necessarily imply a corresponding problem in one or both of the other component terms APO NCP and APO therm , as discussed below, the shape of these latter terms is still informative and is less sensitive to the uncertainties inherent in the residually estimated APO vent term.

Satellite ocean color data
The primary output product of satellite ocean color measurements historically has been the concentration of chlorophyll a (Chl), which is also the main input to most satellitebased ocean primary productivity models (Behrenfeld and Falkowski, 1997).However, the standard Chl product based on empirical band ratios of reflectances represents primarily the coefficient of total absorption of blue light and is inherently biased if the distributions of the optically active components deviate from the global mean (Lee et al., 2011;Siegel et al., 2005;Sauer et al., 2012).In the Southern Ocean the standard Chl algorithms underestimate in situ Chl by 2-3 times (Mitchell and Kahru, 2009), whereas in the Arctic they overestimate it (Mitchell, 1992).These errors are directly transferred into errors in estimates of NPP and EP.
For the Southern Hemisphere we used an empirical Chl algorithm (Scripps Photobiology) that was tuned to in situ Chl in the Southern Ocean and spatially blended with the standard SeaWiFS OC4 algorithm (Kahru and Mitchell, 2010).The same blending scheme was applied when blending NPP between two versions of the Vertically Generalized Productivity Model (VGPM) algorithm (Behrenfeld and Falkowski, 1997): the Southern Ocean version and the low-latitude version of Kahru et al. (2009).EP was calculated using a modified version of the Laws (2004) model according to Nevison et al. (2012a).The mean annual cycles for Chl, NPP and EP were calculated for 1997-2009 using data derived from Sea-WiFS.
For the Northern Hemisphere we used NPP data calculated according to the standard VGPM using MODIS-Aqua Chl.NPP was downloaded from http://science.oregonstate.edu/ocean.productivity.EP was calculated according to Dunne et al. (2005).The mean annual cycles for NPP and EP were calculated for 2002-2011 using monthly composites derived from MODIS-Aqua.While the Laws (2004) and Dunne et al. (2005) methods of deriving EP are not identical, they both estimate export efficiency as a function of sea surface temperature and NPP, are fitted to in situ data, and generally produce similar estimates.In Nevison et al. (2012a) the Southern Ocean EP derived with the Laws (2004) model was modified by constraining to the bulk nutrient budget estimated in the ocean inversion of Schlitzer (2002).That reduced the unrealistically high export efficiency of the Laws (2004)  Both the SPGANT and VGPM/OSU (Oregon State University) satellite algorithms for NCP were converted to airsea O 2 fluxes using F O 2 ,NCP = 1.4 NCP, where 1.4 refers to the molar ratio between O 2 produced and carbon fixed in photosynthesis.F O 2 ,NCP was used to force the pulseresponse code to estimate the corresponding APO signal associated with NCP as per Nevison et al. (2012a).

APO data
APO is a unique atmospheric tracer of ocean biogeochemistry that is calculated by combining high precision O 2 and CO 2 data according to APO = O 2 + 1.1CO 2 (Stephens et al., 1998).By design, APO is mostly insensitive to exchanges with the land biosphere, which have a nearly fixed stoichiometry that produces compensating changes in O 2 and CO 2 .In contrast, the exchanges of O 2 and CO 2 across the air-sea interface are not strongly correlated, largely because variability in dissolved CO 2 is strongly damped by carbonate chemistry in seawater on seasonal timescales.As a result, seasonal variability in APO reflects changes in atmospheric oxygen occurring almost solely due to oceanic processes (Manning and Keeling, 2006).
APO was calculated according to where δ(O 2 / N 2 ) is the relative deviation in the O 2 / N 2 ratio from a reference ratio in per meg units, X O 2 = 0.2094 is the O 2 mole fraction of dry air (Tohjima et al., 2005), CO 2 is the mole fraction of carbon dioxide in parts per million (µmol mol −1 ), and 1.1 is a qualitative estimate of the −O 2 : C ratio of terrestrial respiration and photosynthesis.Mean seasonal cycles for observed APO were obtained using the same detrending and averaging methodology described in Sect.2.2.1.The uncertainty in the observed mean seasonal  cycles over the timespan of available data is less than 6 % at extratropical latitudes, reflecting a combination of instrumental precision, synoptic variability and interannual variability (IAV) in the seasonal cycle.The current study is focused on the mean seasonal cycle in APO as a first-order challenge for the CMIP5 ocean models.Here, model, APO and satellite seasonal cycles are evaluated over roughly comparable periods that are dictated by data availability.The examination of IAV is deferred to future research, which will require ATM simulations of APO driven by interannually varying meteorology.

Phase metrics
The time of year of the seasonal maximum in APO and NPP was used as a phase metric.For APO, monthly mean, stationspecific time series, both modeled and observed, were fit to a third-order polynomial plus first two-harmonics function.
The harmonic components of the fit were used to construct a mean seasonal cycle with daily resolution and the day of the  seasonal maximum was identified.The same approach was used to derive the day of the seasonal NPP maximum, except that the fit was applied to monthly mean satellite-derived and ESM NPP integrals summed from 40 to 60 • S and 40 to 60 • N, which were compared to the APO phase metric at southern and northern stations, respectively.

APO comparison to ESMs
The APO cycles estimated from the six sets of ESM airsea fluxes were compared to observations at three Southern Ocean and two northern monitoring sites (Fig. 2).In these plots, the green envelope reflects our best estimate of the ATM uncertainty in the ocean model APO signal based on the nine scaled ATM results, while the gray window reflects the more complete range of uncertainty using all 13 unscaled ATM results.In general, the distinction between the green and gray windows is only moderately important, as the The MPIM and related NorESM1 ocean biogeochemistry models are examples in which the observed APO cycle lies outside both ranges of uncertainty at all five evaluation sites (Fig. 2, lower middle and right panels).For these models, the rise in the APO cycle occurs too early in the springtime in both hemispheres, while the overall amplitude of the cycle is too large at all the southern stations.Here, it is notable that the MPIM APO amplitude would be even larger in the Southern Ocean if it were not offset by the unrealistically large seasonal cycle in oceanic CO 2 described above.The large CO 2 cycle, however, does not substantially alter the phase of APO, which is determined mainly by the timing of the O 2 fluxes.
The IPSL is another ocean biogeochemistry model for which the observed APO cycle lies outside of both the best guess and full range of uncertainty at all monitoring sites, with the exception of Palmer Station (64.9 • S), where observed APO falls within the wider gray window of uncertainty (Fig. 2b, lower left panels).Unlike the MPIM and NorESM1, the rise in the IPSL APO cycle occurs somewhat later in the springtime than observed, while the overall amplitude of the cycle tends to be underestimated.The underestimate is mild at all the southern stations, and even falls within the broader range of uncertainty at PSA, but is more pronounced at the northern monitoring sites, where the IPSL amplitude is too small by nearly a factor of 2.
The CESM is the top performer among the six ESMs evaluated, consistently yielding green (gray) windows that encompass the observed APO cycle at most (all) of the five monitoring sites (Fig. 2, upper left panels).The GFDL ESM2M (depth-based coordinates) is the second most consistent performer, yielding cycles that generally agree with observations, with exceptions at BRW, where ESM2M tends to mildly underestimate the depth of the APO trough, and at PSA, where the rise in the APO cycle may be up to 1 month too early.The sigma-coordinate GFDL ESM2G model is the third best performer, capturing the observed APO cycle relatively well at most southern stations, but underestimating the seasonal amplitude at the northern stations.

Regional analysis of APO cycle
The matrix method can partition the ocean model APO cycles into regional contributions from the 11 ocean regions used in T3L2.At the southern stations of SPO, PSA and MQA, this partitioning reveals, not surprisingly, that the Southern Ocean (defined as all ocean regions south of 44 • S) dominates the APO cycle (not shown).However, at BRW and ALT at least three regions make important contributions, including the temperate North Pacific (extending from 15 • N to the Bering Strait around 65 • N and thus including the subpolar region), the temperate North Atlantic (extending from 15 to 48 • N) and the Northern Ocean (including the Arctic Ocean and the North Atlantic north of 48 • N).The Northern Ocean is the most important contributor to the APO seasonal cycle at both BRW (Fig. 3) and ALT and is by far the most variable component among the six ESMs.The largest APO amplitudes in the Northern Ocean are produced by the CESM and NorESM1, which are the only two models that capture the total observed APO amplitude at BRW (Fig. 2d).

Partitioning of APO cycle among component signals
To probe further into the underestimate of the APO amplitude at BRW by most of the ESMs, we partitioned APO into thermal and NCP-related components, as described in Sect.2.2.3 (Fig. 4).A comparison of the CESM and ESM2M in Fig. 4 indicates that both have similar APO therm and APO NCP signals, but that the CESM captures total APO more or less correctly while ESM2M underestimates the total APO amplitude.By inference, the missing APO vent term accounts for the difference.However, as discussed in Sect.2.2.3, APO vent can be estimated only as a residual of three other terms using standard CMIP5 output and thus its shape and phasing are sensitive to even small uncertainties in those other terms.Thus, the residual ventilation curves in Fig. 4 should be interpreted with caution (e.g., the MPIM curve is clearly unreasonable in phasing).The four remaining ESMs have APO NCP cycles of similar or smaller amplitude than the CESM, which in the case of the ESM2G and MPIM is due primarily to their relatively low export fraction ratios (ef ratios), and all these models substantially underestimate the total APO amplitude at BRW.This suggests that these models probably also underestimate some combination of deep ventilation and NCP.A similar partitioning of APO was attempted in the Southern Ocean, but the estimation of APO NCP from model EP 100 generally did not give plausible results in this region.This problem is discussed in more detail in Sect. 4.

Satellite data compared to ESMs
Estimates of NPP display a wide variety of spatial patterns among models and satellite data (Fig. 5).Global totals range over more than a factor of 2 (34-82 Pg C yr −1 ) among the ESMs, with most models tending to exceed the VGPM satellite-based estimate of 45 Pg C yr −1 (Table 1).Global EP is more consistent among the models, with a value around 8 Pg C yr −1 in most cases, in good agreement with the satellite-based estimate.Global EP converges among the ESMs because the model with the highest global NPP (ESM2M) has the smallest ef ratio of < 0.1 and the models with the lowest global NPP (IPSL, NorESM1) have the largest ef ratios of about 0.2 (Table 1).
The high global NPP totals in the ESMs are driven in large part by high tropical NPP values, which generally are not reflected in the satellite data except along coastlines (Fig. 5).In this paper, we focus on the 40-60 • latitude bands, which are more important than the tropics in driving the seasonal cycles in NPP, EP (NCP) and APO (Garcia and Keeling, 2001;Anav et al., 2013).In the Southern Ocean 40-60 • S band, global NPP ranges among ESMs from 5.2 to 12.5 Pg C yr −1 , encompassing the satellite-based estimates (Table 1, Fig. 6).However, the ESMs tend to underestimate EP relative to the satellite-derived values, particularly the SPGANT Laws (2004) model, due largely to the small model ef ratios.In the 40-60 • N band, the ESMs generally underestimate both NPP and ef ratios relative to the satellite-derived values.This combination leads to model EP values that are smaller than satellite EP by a factor of 2 on average (Table 1).In both hemispheres, the model NPP maximum tends to occur earlier than the satellite-derived maximum, with some models (IPSL, MPIM) predicting a maximum that is up to 1-2 months early (Fig. 6).

Combining APO and satellite data
In the previous sections we considered APO and satellite data as separate evaluation metrics for ESMs.Below we consider the two as combined metrics.While this analysis is limited by uncertainties in the absolute magnitude of satellite NPP and EP/NCP and our imperfect ability to partition the ESM total APO signal into its NCP and other components, it nevertheless provides some additional insight into the behavior of the ESMs.

Phase metrics
The phase metrics defining the timing of the observed and model seasonal maximum in APO reveal characteristic patterns for each ESM, which are relatively consistent across APO monitoring sites (Fig. 7).The APO seasonal maxima of the MPIM and NorESM1 are earlier than observed by about 1 month and 3 weeks, respectively, on average, while the IPSL APO maximum (with the exception of PSA) tends to be later than observed by 2-3 weeks.The remaining models, CESM, ESM2M and ESM2G, have seasonal APO maxima that are relatively consistent with observations, although with some variation among different stations.
The observed seasonal maximum of NPP occurs about 30-40 days earlier than the observed APO maximum in the Southern Ocean stations and about 50 days earlier at BRW and ALT.Of the models, ESM2G, CESM and ESM2M capture the phase of the NPP maximum to within about 1-3 weeks, although as noted above in Fig. 6 the model NPP maxima tend to occur earlier than the satellite-based maxima.In the MPIM, the NPP maximum is about 1 to 1.5 months earlier than observed, and the APO maximum is also corresponding early (Fig. 7).The IPSL is an outlier from the general slope of the APO vs. NPP phase relationship, as defined by the rest of the ESMs.The IPSL NPP maximum occurs about 40 days earlier than observed in the Southern Hemisphere and nearly 2 months earlier than observed in the Northern Hemisphere, but the IPSL, curiously, also has the latest APO seasonal maximum of any of the models.NorESM1 is another outlier in the opposite direction off the general APO vs. NPP phase slope, at least in the Northern Hemisphere.There, NorESM1's seasonal maximum in NPP has a relatively small lag from the APO maximum compared to the other models.NorESM1 is also unusual in that  the APO therm seasonal maximum at Barrow occurs about 1 month later than in any of the other ESMs (Fig. 4).

Seasonal amplitudes
In addition to evaluating the phasing of the ocean model APO and NPP cycles, we examined the amplitude of the cycles, with the caveat that the absolute magnitude of satellite-based NPP is not well determined and at present provides a relatively weak constraint on the models.Furthermore, the APO seasonal amplitude in principle is more closely related to NCP (or EP) than NPP.However, we chose NPP for the seasonal amplitude analysis due to the strong discrepancies in ef ratio among models and satellite data indicated in Table 1, which may unduly bias the results.
A cross plot of the seasonal amplitude in APO against the seasonal amplitude of NPP integrated between 40 and 60 • S suggests a strong correlation between the amplitudes of APO and NPP among the ocean biogeochemistry models, with larger NPP amplitudes associated with larger APO cycles.The strong correlation holds at all Southern Ocean stations and is illustrated in Fig. 8a at MQA.The cluster of top-performing ESMs (CESM, ESM2M, ESM2G) agrees relatively well with the observed APO and SPGANT amplitudes.Meanwhile both amplitudes are underestimated by the IPSL and overestimated by the NorESM1 and MPIM.
Cross plots of the seasonal amplitudes of APO and NPP in the Northern Hemisphere reveal that these amplitudes are positively correlated at BRW (Fig. 8b) and ALT (not shown), although the correlation is weaker than in the Southern Hemisphere.The CESM, ESM2G, ESM2M and MPIM all capture the satellite-based NPP seasonal amplitude relatively well, while both the CESM and NorESM1 capture the observed APO amplitude accurately.However, the CESM is the only model that reproduces both the NPP and APO seasonal amplitudes well relative to the observations.

Northern oceans
Most ESMs tend to underestimate substantially the observed seasonal amplitude of APO at Barrow, Alaska.A combination of region-specific results (Fig. 3) and O 2 component analysis (Fig. 4) suggests that some combination of fallwinter deep ventilation and spring-summer EP in the northern oceans (defined to include the North Atlantic north of 48 • N) in particular may be underestimated in many models.The combined analysis of the APO vs. NPP seasonal amplitudes (Fig. 8b) supports these conclusions and suggests that, while several models may be capturing primary production well in the northern oceans, accurate representation of EP and deep ventilation is also important for reproducing the observed APO cycle.The inference from the APO component analysis in Fig. 4 that the GFDL models may have weak ventilation in the North Atlantic appears to contradict the analysis of Dunne et al. (2012), who found robust North Atlantic Deep Water (NADW) formation in both the ESM2M and ESM2G versions, but possibly could be reconciled if the biogeochemical gradients across which deep water formation acts are too weak.
We investigated several mechanisms that might explain the differences among models in the APO cycle at high northern latitudes, including subpolar heat transport and Arctic sea ice cover.Here, stronger northward heat transport should lead to more deep ventilation, while lower sea ice cover will permit more production and ventilation in the Arctic Ocean.Subdividing the northern oceans region into the Arctic Ocean and North Atlantic components revealed that some models (IPSL and ESM2G) have a very small component (< 2 per meg) of APO seasonal amplitude coming from the Arctic Ocean alone (Fig. 9).In ESM2G this may be related to the extensive winter sea ice cover, which exceeds the observed covered area reported by the National Snow and Ice Data Center (http://nsidc.org/data/seaice_index/archives.html) by about 2 × 10 6 km 2 .However, sea ice cover is lower than observed in the IPSL, suggesting the small Arctic APO component in that model is more related to general underestimate of primary and EP (e.g., as shown in Figs.6b and 8b).
While it seems clear that the strong APO seasonality in the CESM can be attributed in part to its high productivity and EP in the northern subpolar and polar regions (Fig. 6 and Table 1), a full explanation for the underlying mechanisms of the CESM fidelity on APO compared to the other models is not readily apparent from surface-only data.This suggests the need for a more detailed exploration of ocean interior ventilation and biological response interactions outside the scope of the present work.

Southern Ocean
Compared to the Northern Hemisphere stations, the ESMs generally are more successful in the Southern Ocean in capturing the observed APO cycle (Fig. 2).Within the range of ATM uncertainty, at least three models, CESM, ESM2M, ESM2G (and IPSL at Palmer Station), predict seasonal APO amplitudes in agreement with observations.Although the Southern Ocean APO amplitude in these models varies over as much as 20 per meg, we currently are not able to distinguish which of the underlying air-sea O 2 flux fields is the most realistic, due to the uncertainty associated with translating these fluxes into an atmospheric signal using TransCom3 era model responses to uniformly distributed regional fluxes.However, even with our current matrix method, the APO constraint is sufficiently robust to indicate that the NorESM1 and MPIM substantially overestimate some combination of production and deep ventilation in the Southern Ocean, while the IPSL probably tends to underestimate these fluxes (Table 1, Fig. 8a).Notably, the ESMs that reproduce APO the best in the Southern Ocean tend to predict a smaller net carbon uptake between 44 and 75 • , and are in better agreement with independent estimates (Lenton et al., 2013) of carbon uptake from ocean inversions and observed pCO 2 databases (Fig. 10).
Reducing ATM uncertainty is a challenge that potentially can be addressed by using column-integrated APO signals from aircraft data (Wofsy, 2011), or conversely, by using vertical profiles to identify top-performing ATMs (Stephens et al., 2007).In addition, the spread in ATM results has been reduced substantially for CO 2 inversions using post-TransCom3-era ATMs (Peylin et al., 2013), suggesting that ATM uncertainty also may be reduced for forward simulations of APO.If this is the case, then new forward simulations with several different modern-era ATMs may be sufficient to characterize ATM uncertainty.Alternatively, it may be valuable to continue with a matrix-based approach, using basis functions from many ATMs, but with redefined regional boundaries that are not defined based simply on latitude, as in T3L2 (Fig. 1), but rather that correspond to the biogeography of major ocean regions (Fay and McKinley, 2014).The definition of such basis functions could help extend the utility of the matrix approach to lower latitude APO monitoring sites and allow for the partitioning of the Southern Ocean into multiple regions defined around biogeochemical function, while still retaining the advantages of the matrix method, i.e., the ability to quickly and easily compare multiple ATMs forced with the same air-sea fluxes.
A second complication in the Southern Ocean analysis is that the EP 100 values reported by the ESMs clearly are not directly comparable to satellite NCP(EP) data, particularly our SPGANT product, and thus can not be translated with confidence into air-sea O 2 fluxes associated with NCP.A likely problem is that the 100 m depth horizon used to compute EP may not be comparable across satellite algorithms and ocean biogeochemistry models.EP 100 will underestimate the model's true NCP-related O 2 outgassing flux if organic matter is respired as it sinks from the actual model mixed layer depth to 100 m depth (Najjar et al., 2007).It is also puzzling that the ef ratios predicted by the ESMs (Table 1) appear to have decreased considerably in some cases relative to those reported for earlier versions of the same models (Steinacher et al., 2010).For example, the Southern Ocean ef ratios for the MPIM and IPSL in that earlier study were about 0.2 and 0.4, respectively, compared to 0.14 and 0.27, respectively, in the current study.The mean global ef ratio for the six ESMs in the current study is only 0.14 and, even in the Southern Ocean, is only 0.17 on average, compared to satellite-based estimates of 0.18 globally and about 0.3 at high latitudes.
The small ef ratios in the GFDL models (of less than 0.1 globally and only 0.10 to 0.13 in the Southern Ocean) appear consistent with the relatively deep summer mixed layer depths (MLDs) in the Southern Ocean, which even at their minimum are often deeper than 100 m in both ESM2M and ESM2G (Dunne et al., 2012).In the CESM the Southern Ocean summer MLDs are generally shallower than 100 m and in many regions are only around 10-40 m deep (Moore et al., 2013).The shallower summer MLDs may contribute to the CESM's larger ef ratio of 0.18, although this ratio is still small compared to the satellite-based estimates.The small GFDL ef ratios may also be related to an overvigorous picophytoplankton component wherein a prochlorococcus-like form is capable of competing relatively well even in cold polar waters.Small picophytoplankton are more likely to be reoxidized and remineralized within the mixed layer, whereas larger, heavier microphytoplankton (e.g., diatoms) are more likely to be exported out of the oceanic mixed layer (Uitz et al., 2010).

Phase relationships
While much of our analysis focuses on the seasonal amplitude of APO and NPP at mid-to high latitudes, both of these metrics involve relatively large uncertainty.This derives from TransCom3-era uniform flux ATM uncertainty in the case of APO, while for NPP the uncertainty results from the lack of strong constraints on the absolute magnitude of the satellite fluxes.In contrast, we have relatively high confidence in the phasing of model APO, as represented by the matrix method (see the Supplement) and in NPP observationally derived from satellite data, based on the close correspondence in phasing between the SPGANT and VGPM algorithms.For these reasons, we used a phase metric, i.e., the timing of the seasonal maximum, to examine relationships between observed and model APO and NPP.As in the seasonal amplitude analysis, MPIM, NorESM1 and IPSL displayed phasing patterns that tended to deviate from observations and the other three top-performing models, albeit in diverging ways.A complete diagnosis of the model physics responsible for the phasing anomalies (e.g., IPSL's early NPP maxima and late APO maxima) described in Sect.3.3.1 is beyond the scope of this paper.Here we note mainly that the phase metrics are a robust and relatively good indicator of overall model performance with respect to APO.

Summary
We have used measurements of the seasonal cycles in APO to challenge and test the ocean model components of six ESMs.The model/data comparison reveal that three of the ESMs tested reproduce the observed cycles reasonably well, within the range of ATM uncertainty, while three do not.ESM performance in general is more favorable in the Southern Hemisphere than in the Northern Hemisphere, where most models appear to underestimate the wintertime ventilation of O 2 -depleted deepwater that drives the declining branch of the APO seasonal cycle and many may also underestimate both primary and EP, particularly at high northern latitudes.We used NPP and NCP(EP) products derived from satellite ocean color data as complementary constraints on the models in an effort to tighten the APO constraint, which reflects a combination of production and ventilation processes.However, while the satellite data provide relatively strong constraints with respect to phasing, they are more uncertain with respect to the absolute magnitudes of NPP and NCP(EP).
At least two primary uncertainties limit our ability to place stronger constraints on ocean model biogeochemistry based on currently available information from APO and satellite data: (1) the relatively large ATM uncertainty involved in translating air-sea O 2 fluxes into APO signals.
(2) The uncertainty in how model EP 100 relates to the true model F O 2 ,NCP flux and how this relationship varies across models and satellite algorithms.The first of these, ATM uncertainty, is large, as quantified using our TransCom3-based matrix method.However, it probably has been overstated in previous analyses, which in some cases went so far as to suggest that APO does not provide a useful constraint on ocean model fluxes (e.g., Naegler et al., 2007).Further, ATM uncertainty could be reduced substantially in future work with modern ATMs and O 2 -specific flux patterns, or with new regional boundaries defined based on ocean biogeography rather than simple latitude.Even within the limits of our current approach, we have shown that half of the six ESMs tested here produce APO cycles whose mismatch with observed APO clearly transcends ATM uncertainty, suggesting underlying deficiencies in those models' physics and biogeochemistry.
Improving the understanding of the relationship between model air-sea O 2 fluxes and quantities like NPP, NCP and EP is a more tractable problem that can be dissected with appropriate model diagnostics, e.g., as per Manizza et al. (2012).In the current analysis, using standard CMIP5 output from six ocean biogeochemistry models, we encountered difficulties in relating F O2 to EP and NCP, which hindered our ability to diagnose the mechanisms responsible for model performance and to compare ESM-derived APO NCP directly to satellitebased APO NCP signals.Extending model-derived insights to satellite products likely will require a shift in emphasis from EP at an arbitrary reference depth to near-surface processes like NCP, which are more relevant for exchanges of O 2 and CO 2 at the air-sea interface and more directly related to upward radiances detected by satellites.
The Supplement related to this article is available online at doi:10.5194/bg-12-193-2015-supplement.
) ESMs (depthbased ESM2M and density-based ESM2G vertical oceans; Dunne et al., 2012) from Princeton, New Jersey; the Institut Pierre Simon Laplace Coupled Model 5 in its lowresolution version (IPSL-CM5A-LR, hereafter referred to as IPSL) model from Paris, France; the community ecosystem model (CESM) from the National Center for Atmospheric Research in Boulder, CO; the Max Planck Institute for Meteorology (MPIM) ESM, version MPI-ESM-LR, from Hamburg, Germany; and the Norwegian ESM (NorESM1-ME, hereafter referred to as NorESM1).The ocean biogeochemical models embedded in the respective ESMs are represented by the Tracers of Phytoplankton with Allometric Zooplankton (TOPAZ) (GFDL) (Dunne et al., 2013), Pelagic Interaction Scheme for Carbon and Ecosystem Studies (PISCES) (IPSL)

Figure 1 .
Figure 1.TransCom3 Level 2 ocean regions used in the matrixbased atmospheric transport method.Locations of the five APO monitoring sites featured in Fig. 2 are superimposed.

Figure 2a .
Figure 2a.Results of the pulse-response code forced by O 2 , N 2 and CO 2 air-sea fluxes from six ESM ocean biogeochemistry model components.The dark green line and light green window show the mean and standard deviation, respectively, of the nine ATMs participating in both T3L2 and APO TransCom.The amplitudes are scaled for each ATM and monitoring site based on the validation exercise described in Sect.2.2.2 and illustrated in the Supplementary material.The gray window shows the full range of responses from all 13 T3L2 ATMs, uncorrected based on the APO TransCom validation exercise.The heavy black line shows the observed APO mean annual cycle.(a) Results at South Pole, compared to SIO observations.
model at cold temperatures and brought it into closer agreement with the Dunne et al. (2005) export efficiency.

Figure 3 .
Figure 3. Partitioning the APO cycle at Barrow, Alaska, into its main regional contributions, North Pacific (black), temperate North Atlantic (cyan) and Northern Ocean (magenta), which includes the North Atlantic north of 48 • N and the Arctic Ocean.All curves reflect the unscaled model mean of the 13 ATMs used in the matrix method.

Figure 4 .
Figure 4. Partitioning of the model mean APO cycle into NCP, thermal and residual ventilation components at Barrow, Alaska.The APO NCP components are estimated alternatively based on ocean model EP at 100m (Prod EP light green, solid curve) and vertically integrated NPP (Prod NPP ) scaled by the mean ratio of EP 100 / NPP (f ratio) between 40 and 60 • N of the given ocean model (dark green, dashed curve).All components were translated into atmospheric signals as described in Sect.2.2.3.Also shown is APO vent (blue), calculated as a residual of APO -APO NCP -APO therm .With the exception of observed APO, all curves reflect the unscaled mean of the 13 ATMs used in the matrix method.

Figure 5 .
Figure 5. Annual mean NPP (in mg C m −2 day −1 ).Top row: MODIS-Aqua data input to the VGPM NPP model and (b) SeaWiFS data input to the SPGANT algorithm as described in Nevison et al. (2012a).Rows 2 and 3 show the corresponding NPP fields from six ESMs for the mean of 1997-2005.

Figure 7 .
Figure 7. Day of APO maximum plotted against day of NPP maximum.The observed data point is derived from APO data at (a) Palmer Station, (b) Macquarie Island, (c) Barrow and (d) Alert plotted against satellite NPP data integrated over the 40-60 • latitude band of the appropriate hemisphere.

Figure 8 .
Figure 8.(a) Seasonal amplitude in APO at Macquarie Island (MQA), located at 54.5 • S, 159 • E, as estimated from the air-sea O 2 , CO 2 and heat fluxes from six ESMs, plotted against the seasonal amplitude of NPP integrated from 40 to 60 • S. Error bars represent the ATM uncertainty in model APO as estimated with the matrix method.The observed (Obs) data points (in red) are based on APO data from the PU network at MQA and NPP from the SP-GANT satellite ocean color algorithms, as described in the text.The correlation coefficient R refers to regression through ESM points only, (b) Same as 8a, but plotting seasonal amplitude in APO at Barrow, Alaska, against the seasonal amplitude of NPP integrated from 40 to 60 • N. The observed data point is based on APO data from the PU network and the VGPM algorithm with MODIS-Aqua input.

Figure 9 .
Figure 9. APO cycle at Barrow, Alaska, from the TransCom northern oceans region, restricted to latitudes north of 65 • N to estimate the contribution of the Arctic Ocean.All curves reflect the unscaled model mean of 13 ATMs used in the matrix method.

Figure 10 .
Figure 10.Annual mean CO 2 uptake in the Southern Ocean for 1997-2005 integrated from 44-75 • S and plotted vs. mean APO amplitude at Macquarie Island (MQA) over the same period, as predicted by six ESMs.Independent estimates of carbon uptake from ocean inversions and observed pCO 2 databases (Lenton et al., 2013), plotted against the observed APO amplitude at MQA are shown for reference.

Table 1 .
Vertically integrated NPP, EP at 100 m (both in Pg C yr −1 ) and EP/PP for six CMIP5 models and two satellite products.
a SPGANT totals are only shown for the 40-60 • S band because the algorithm is optimized for the Southern Ocean but not well validated in the Northern Hemisphere.