Spatially resolved evaluation of Earth system models with satellite column-averaged CO2

. Earth system models (ESMs) participating in the Coupled Model Intercomparison Project Phase 5 (CMIP5) showed large uncertainties in simulating atmospheric CO 2 concentrations. We utilize the Earth System Model Evaluation Tool (ESMValTool) to evaluate emission-driven CMIP5 and CMIP6 simulations with satellite data of column-average CO 2 mole fractions (XCO 2 ). XCO 2 time series show a large spread among the model ensembles both in CMIP5 and CMIP6. Compared to the satellite observations, the models have a bias of + 25 to − 20 ppmv in CMIP5 and + 20 to − 15 ppmv in CMIP6, with the multi-model mean biases at + 10 and + 2 ppmv, respectively. The derived mean atmospheric XCO 2 growth rate (GR) of 2.0 ppmv


Introduction
The Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report (AR5) concluded that since 1950 many of the observed changes in the climate system have Published by Copernicus Publications on behalf of the European Geosciences Union.
been unprecedented in the instrument record, confirming an unequivocal warming of the climate system (IPCC, 2013).Increasing emissions of greenhouse gases (GHGs) are the key drivers of anthropogenic climate change.The most important anthropogenic greenhouse gas is carbon dioxide (CO 2 ), with CO 2 emissions contributing more than half of the total global radiative forcing in 2011 relative to 1750 (IPCC, 2013).It is therefore important to monitor the longterm changes in atmospheric CO 2 concentrations, to understand the sources and sinks of carbon and to provide reliable projections of future CO 2 concentrations under various scenarios.
Photosynthesis causes a net uptake of atmospheric CO 2 and thus declining atmospheric CO 2 concentrations in the growing season.Conversely, atmospheric CO 2 concentrations rise throughout the dormant season when there is a net release of CO 2 from the land due to decomposition of organic matter in soils.This uptake and release of carbon by the terrestrial biosphere throughout the year causes a seasonal cycle of atmospheric CO 2 (Keeling et al., 1989).The seasonal cycle amplitude (SCA) has been increasing over the last 50 years, with higher increases in higher latitudes (Barnes et al., 2016;Graven et al., 2013;Yin et al., 2018;Keeling et al., 1995;Keeling et al., 1996;Myneni et al., 1997;Piao et al., 2018).A number of studies have explored the effects of CO 2 fertilization, land-use change and climate warming on the SCA (Bastos et al., 2019;Zhao et al., 2016;Fernández-Martínez et al., 2019).Although models do not agree unanimously, the dominant effects are a positive trend in SCA due to the CO 2 fertilization combined with a negative trend due to climate warming.Some models, however, show a large positive trend due to climate warming (Zhao et al., 2016).Land use is found to be a weaker effect in comparison to CO 2 fertilization and climate warming (Bastos et al., 2019;Fernández-Martínez et al., 2019).
Most long-term measurements of CO 2 are from groundbased stations.In situ ground-based measurements at Mauna Loa (Hawaii, USA) started in 1958, providing the first evidence that fossil fuel combustion leads to a measurable increase in atmospheric CO 2 concentrations (Keeling et al., 1976).Other observatories around the globe now also measure atmospheric CO 2 , reporting an increase of about 45 % since pre-industrial times (Ciais et al., 2013;Friedlingstein et al., 2019).
Satellite measurements of CO 2 , with first nearinfrared/short-wave-infrared (NIR/SWIR) nadir-based (downward-looking) satellite retrievals starting in 2002, can complement the ground-based measurement network and provide regional and spatial distributions of CO 2 .The quantity obtained from measurements with NIR/SWIR satellite instruments is the column-average dry-air mole fraction of atmospheric CO 2 , denoted as XCO 2 .XCO 2 is a dimensionless quantity defined as the vertical column of CO 2 divided by the vertical column of dry air (i.e., all air molecules except water vapor) often given in ppmv (parts per million per volume).An analysis of growth rates (GRs) and SCA from satellite data and their sensitivity to growing season temperature anomaly presented in Schneising et al. (2014) shows a negative correlation between SCA and growing season temperature anomaly for the period 2003-2011, which was confirmed by Yin et al. (2018) for SCA anomaly in this timeframe.Satellite XCO 2 products are often used in combination with atmospheric transport inverse modeling approaches to obtain information on surface fluxes by using a global or regional transport model with free fit parameters (Basu et al., 2013;Houweling et al., 2015;Reuter et al., 2014;Chevallier et al., 2014).The satellite data can also be used to constrain process parameters of a terrestrial biosphere model, e.g., as part of the Carbon Cycle Data Assimilation System (CCDAS; e.g., Kaminski et al., 2013), and have been used for the evaluation of chemistry-climate models (Hayman et al., 2014;Shindell et al., 2013).In the last few years, satellite data have also been used in direct comparison to output from climate models (e.g., Calle et al., 2019), characterizing rise and fall segments in seasonal cycles from GOSAT and comparing them to model output.
A large ensemble of climate model simulations for different type of experiments under common forcings is provided by the Coupled Model Intercomparison Project (CMIP), with output available for CMIP5 (Taylor et al., 2012) and more recently phase 6 (CMIP6; Eyring et al., 2016a).Earth system models (ESMs) produce a large range in projected atmospheric CO 2 , as a result of uncertainties in the future evolution of natural fluxes (Arora et al., 2013;Friedlingstein et al., 2006).Overall CMIP5 models overestimate the carbon content of the atmosphere (Friedlingstein et al., 2014;Hoffman et al., 2014).The largest uncertainties are associated with the response of the land carbon cycle to changes in climate and atmospheric CO 2 (Friedlingstein et al., 2014;Hajima et al., 2014).The ability of ESMs to simulate the land and ocean contemporary carbon cycle has previously been investigated by Anav et al. (2013).They showed that most models were able to correctly reproduce the main climatic variables and their seasonal evolution but found weaknesses in reproducing specific biogeochemical fields, such as a general overestimation of leaf area index and photosynthesis.However, the magnitude of the global photosynthesis is not well constrained by observations, with estimates ranging between 112 and 169 PgC yr −1 (Ryu et al., 2019), and the dataset used by Anav et al. (2013) is on the lower end of this range.For CMIP6, Arora et al. (2020) analyzed simulations with a CO 2 increase of 1 % per year to quantify the carbon-climate feedbacks.They found no significant change in behavior from CMIP5 to CMIP6 but lower absolute values for models which included a nitrogen cycle.
In this paper, we focus on evaluating the growth rate and the seasonal cycle amplitude of simulated CO 2 , converted to XCO 2 , from CMIP ESMs which performed emissiondriven simulations with satellite observations in CMIP5 and CMIP6.The paper is structured as follows: the data prod-ucts used in this study are introduced in Sect. 2. Section 3 describes the methods used, including the calculation of all derived quantities.A comparison between CO 2 flask measurements and XCO 2 measurements at different locations is given in Sect. 4. The evaluation of CMIP simulations with satellite data is presented in Sect.5, divided into sections focusing on the models' ability to simulate XCO 2 time series, growth rate and seasonal cycle amplitude.A summary and conclusion are given in Sect.6.
We use the Observations for Model Intercomparisons Project (Obs4MIPs) version 3 (O4Mv3) XCO 2 satellite data (Buchwitz et al., 2017a(Buchwitz et al., , 2018)).Obs4MIPs hosts observationally based datasets which have been formatted according to the CMIP model output requirements (e.g., variable definitions, coordinates, frequencies) in order to facilitate an easier comparison between observations and models (Ferraro et al., 2015;Teixeira et al., 2014;Waliser et al., 2020).The satellite product used here is a gridded (level-3) monthly data product with a 5 • ×5 • spatial resolution following the Obs4MIPs format, produced as part of the Copernicus Climate Change Service (C3S).The O4Mv3 product is retrieved from two satellite instruments: Scanning Imaging Absorption Spectrometer for Atmospheric Chartography/Envisat (Bovensmann et al., 1999;Burrows et al., 1995) and the Thermal And Near infrared Sensor for carbon Observation Fourier transform spectrometer/Greenhouse Gases Observing Satellite (TANSO-FTS/GOSAT) (Kuze et al., 2009).
The number of observations depends significantly on the location with most points over locations with low cloud cover, high surface reflectivity and (at least) moderate to high Sun elevation.Coverage over ocean is sparse as ocean retrievals are only included from GOSAT Sun-glint mode observations -outside of glint conditions, the reflectivity of water is very low in the NIR/SWIR spectral region.Figure 1 shows the mean monthly coverage of the dataset for 2003-2014.In Sect.5, we will show that taking into account this sampling in the evaluation of ESMs is essential for a proper comparison.
The dataset also contains uncertainty estimates for each grid cell, with a mean value of 0.92 ppmv, accounting for both statistical uncertainties from the individual soundings and uncertainties from potential regional and temporal biases (Buchwitz et al., 2017a).However, the overall uncertainties are small compared to inter-model differences (see Sect. 3.1) and are therefore neglected in our analysis.

Surface CO 2 measurements
For the comparison of satellite XCO 2 and surface CO 2 data in Sect.4, we have obtained surface flask measurements from the NOAA ESRL Carbon Cycle Cooperative Global Air Sampling Network (Dlugokencky et al., 2020).Measurement sites at locations with no available satellite data were excluded from the analysis, which ruled out the four baseline observatories in Mauna Loa and Samoa, as well as the South Pole and Point Barrow sites.Furthermore, sites which did not collect data during the period from 2003-2014 were discarded.From the remaining sites, a sample of five sites was chosen which had the best coverage of different latitudes, and when latitudes were similar, different longitudes were selected for increased spatial coverage.The selected sites are listed in Table 1.

Model simulations
We use monthly mean output data from 10 CMIP5 and 10 CMIP6 models which performed emission-driven simulations, with three of the CMIP5 and five of the CMIP6 models including a nitrogen cycle.Tables 2 and 3 list all the CMIP5 and CMIP6 models used in this paper along with their atmosphere, land and ocean model component, respectively.Only models with an interactive carbon cycle are able to perform the emission-driven simulations, in which the emissions rather than the concentrations of the greenhouse gases are prescribed (Taylor et al., 2012;Eyring et al., 2016a).This allows the carbon cycle in the models to react to changes in climate and atmospheric CO 2 by adjusting their carbon fluxes to the new climate conditions and providing the atmospheric CO 2 concentration as an output (Friedlingstein et al., 2014).In order to facilitate the comparison between the https://doi.org/10.5194/bg-17-6115-2020 Biogeosciences, 17, 6115-6144, 2020  several models have three or more ensemble members.We consider all of them in Fig. 3 for the time series to show the models' intrinsic variability but then proceed with the analysis with only the first ensemble member for each model.The different initial value ensemble members perform similarly to each other for the analysis presented in this paper, and using an ensemble mean would reduce the interannual variability found in each individual member.

Sampling of observations and models
For comparison of model and satellite data, first the CO 2 data of the models were converted to XCO 2 data.The model data were interpolated to the grid of the satellite dataset using a bilinear interpolation scheme and grid cells with missing values in the satellite data were also set to missing values in the model fields.Further sampling considerations are discussed in more detail in Sect.5.3.2 and in Appendix B.
Most analysis is carried out with regional averages covering several grid cells.Unless specifically stated otherwise, these are calculated by taking the arithmetic averages over all grid cells weighted by their area for each month.

Calculation of growth rate, seasonal cycle amplitude and growing season temperature anomaly
We compute the GR following the method described in Buchwitz et al. (2018).Monthly resolved annual growth rates are calculated by subtracting the XCO 2 value 6 months in the future from the one 6 months in the past.Then these monthly resolved growth rates are averaged to a yearly GR for a calendar year, and any year with less than 7 months of data is flagged as missing.The addition of the 7-month data availability was introduced to be consistent with the constraint on SCA as explained below.
We define the SCA as the peak-to-trough amplitude in a calendar year of the detrended time series.The time series is detrended with the cumulative sum of monthly growth rates, using the annual mean growth rates as substitution for missing values where necessary.The SCA is calculated by subtracting the minimum from the maximum value for each year with a minimum data availability of 7 months.When investigating the seasonal cycle of observationally sampled simulations at higher latitudes, the maximum value of the time series was generally only accounted for if more than 7 months of data were available.We therefore introduce the cutoff of 7-month data availability to preserve as many peaks as possible without restricting the data too much.However, as peak preservation cannot be guaranteed when any missing values are present, we can only speak of an effective SCA.The absolute SCA is not as important in our comparison, because we use the same sampling for both the model and observations.
The growing season temperature anomaly T is calculated from the NASA Goddard Institute for Space Studies (GISS) Surface Temperature Analysis (GISTEMP) version 4 (Hansen et al., 2010) temperature anomaly map following Schneising et al. (2014).The data are masked to include only vegetated areas, using the MODIS land cover classification (Friedl et al., 2010a).Surface temperature anomalies are calculated with respect to their monthly climatologies.The data are averaged over the growing season if they cover only one hemisphere (April-September for the Northern Hemisphere; December to May for the Southern Hemisphere).Additionally, if the data cover both hemispheres, the whole year is taken into account.The growing season averages are taken because the temperature has a large influence on the plant growth and the resulting biospheric CO 2 fluxes, which in turn drive both the SCA and interannual variability of the GR (Schneising et al., 2014).
Since its first release in 2016 (Eyring et al., 2016b), the ESMValTool has been further advanced, facilitating analysis of many different ESM components, providing welldocumented source code and scientific background of implemented diagnostics and metrics, and allowing for traceability and reproducibility of results (provenance).ESMVal-Tool v2.0 has been developed as a large community effort to specifically target the increased data volume of CMIP6 and the related challenges posed by analysis and evaluation of output from multiple high-resolution and complex ESMs.For this, the core functionalities have been completely rewritten in order to take advantage of state-of-the-art computational libraries and methods to allow for efficient and userfriendly data processing (Righi et al., 2020).Common operations on the input data such as regridding or computation of multi-model statistics are now centralized in a highly optimized preprocessor written in Python.ESMValTool v2.0 includes an extended set of large-scale diagnostics for quasioperational and comprehensive evaluation of ESMs (Eyring et al., 2020), new diagnostics for extreme events, regional model and impact evaluation and analysis of ESMs (Weigel et al., 2020), as well as diagnostics for emergent constraints and analysis of future projections from ESMs (Lauer et al., 2020).For the study here, a new ESMValTool recipe has been developed that can be used to reproduce all plots of this paper.
https://doi.org/10.5194/bg-17-6115-2020Biogeosciences, 17, 6115-6144, 2020   Until recent years, most model-observation comparisons have been carried out using in situ surface CO 2 data (e.g., Wenzel et al., 2016).As such, it is interesting to compare the differences between XCO 2 and surface CO 2 at different locations.Figure 2 shows a comparison of time series between both kinds of observations and the multi-model mean for both XCO 2 and surface CO 2 for CMIP6 (top) and CMIP5 (bottom) models.The multi-model mean for both XCO 2 and surface CO 2 is offset to have the same mean value as the satellite data, and this offset is noted above each time series panel.It is interesting to note that the offset appears to be larger at higher latitudes, thus showing a different latitudinal gradient between the models and the satellite data, indicating potential issues with surface fluxes or transport in the models.The multi-model mean and satellite data are averaged between all grid cells covering a 5 • × 5 • radius around the stations, which results in a maximum of four grid cells to be considered.The mean and growth rate of XCO 2 and surface CO 2 are in very good agreement, while the multimodel mean overestimates both variables at all sites, with the overestimated mean XCO 2 arising from the effect of higher growth rates over time.Furthermore, the offset from the modeled surface CO 2 is higher than that of XCO 2 , while this difference is smaller for CMIP5.This might be due to the fact that the CMIP5 offset for multi-model mean XCO 2 was larger overall with approximately 10 ppmv compared to the CMIP6 offset of approximately 2 ppmv.
SCA is higher at higher latitudes and also generally higher at the surface compared to the column average.This is to be expected as the processes dominating the seasonal cycle, respiration and photosynthesis, take place at the surface, leading to the higher SCA for station data and surface CO 2 from models compared to the XCO 2 values.Mixing of air coming from lower latitudes with lower SCA dampens the SCA in the column compared to surface SCA.This is evident in the increasing SCA difference between XCO 2 and surface CO 2 going from low-latitude to high-latitude stations, with no discernible seasonal cycle in the Southern Hemisphere due to the lack of land and vegetation.The multi-model mean follows this trend in the observations, although it underestimates the higher-latitude SCA, with a larger underestimation at the surface while capturing the XCO 2 SCA relatively well.As this study aims at evaluating model simulations with satellite data, further analysis is restricted to XCO 2 .
5 Evaluation of CMIP simulations with satellite data

XCO 2 time series
The globally averaged time series of XCO 2 is shown in Fig. 3 in the top panel, with CMIP6 (a) and CMIP5 (b) models sampled as the satellite observations (see Sect. 3.1).The obser-vational uncertainty is too small to be seen in this plot and is therefore neglected.The middle panel shows the monthly resolved annual growth rate and the bottom panel the detrended seasonal cycle.All available ensemble members for CMIP6 models are used to show their internal variability.All ensemble members perform similar to one another.The multimodel mean is computed using the first ensemble member, which is also used in the further analysis.As in Fig. 2, an increase of XCO 2 with time and a pronounced seasonal cycle for all models can be seen.The focus here is on the absolute values, as the trend (GR) and SCA are discussed in dedicated sections below.The CMIP6 models display a large range of absolute XCO 2 values, ranging from an underestimation by 15 ppmv (MRI-ESM2.0and MIROC-ES2L) to an overestimation by 20 ppmv (GFDL-ESM4).The model closest to the observations is CNRM-ESM2-1, which reproduces the mean value well, with the next closest models being NorESM2-LM and MPI-ESM1.2-LR,both overestimating XCO 2 by about 5 ppmv.The multi-model mean shows an overestimation by approximately 2 ppmv or equivalently a time shift of 1 year.While the spread in the models has not decreased much since CMIP5, the overestimation of the multi-model mean has decreased from 10 to 2 ppmv.Furthermore, CMIP6 models which have predecessors in CMIP5 show similar biases as their predecessors, besides the MIROC models, which overestimated the mean by 15 ppmv in CMIP5 and underestimates it by that much in CMIP6.Both MRI models underestimate XCO 2 significantly, while GFDL-ESM4 overestimates the atmospheric content even more.The MRI-ESM1 model was the only model in CMIP5 to underestimate XCO 2 with respect to the observations, and this was by about 20 ppmv.This model underestimates the historical warming, causing plant and soil respiration to be too low, which leads to a larger land sink and a reduced atmospheric CO 2 concentration (Adachi et al., 2013).This underestimation has been reduced by about 5 ppmv in CMIP6.The GFDL models show an overestimation of about 15 ppmv in both ensembles, and both CanESM models are 10 ppmv too high.A minor improvement can be seen for NorESM-LM over NorESM1-ME, with a decrease of the overestimation from 15 to 10 ppmv.

Growth rate
The middle panel of Fig. 3 shows that while models capture the interannual variability of the growth rate quite well, they overestimate the mean growth rate compared to the observations.The correlation coefficient for the multi-model mean is at 0.48 in CMIP6 and 0.07 in CMIP5 which shows a large improvement in this area.The pronounced feature in 2009 is due to the introduction of the GOSAT data which changed the shape of the seasonal cycle and thus due to its calculation the monthly resolved annual growth rate.Fortunately, this feature gets averaged out when computing the annual growth rate and does not tangibly affect Biogeosciences, 17, 6115-6144, 2020 https://doi.org/10.5194/bg-17-6115-2020 Figure 2. Comparison of time series from satellite XCO 2 (black), multi-model mean XCO 2 (orange) and surface CO 2 (red), and NOAA surface CO 2 station data (blue) at selected sites, with the coordinates noted in brackets above the time series and the altitudes shown in the map plot (see Table 1).The multi-model mean for both XCO 2 and surface CO 2 was offset to have the same average value as the satellite XCO 2 for better comparison, and this offset is noted above each time series.CMIP6 and CMIP5 multi-model means are shown in the top (a) and bottom (b) panels, respectively.our conclusions.Figure 4 shows the global mean GR of XCO 2 for 2003-2014 and its standard deviation over all years depicted as error bars, with the observations shown in black and the multi-model mean in red.The annual mean GR of the satellite data is 1.9 ± 0.4 ppmv yr −1 , while the CMIP5 models (right) range from 1.5 ± 0.4 (MRI-ESM1) to 3.0 ± 0.9 ppmv yr −1 (CanESM2) with a multi-model mean of 2.4 ± 0.4 ppmv yr −1 .In CMIP6 (left), the multi-model mean is slightly lower at 2.3 ± 0.3 ppmv yr −1 , and the spread has decreased by 0.6 ppmv yr −1 , with a range from 1.7 ± 0.4 (MRI-ESM2.0)to 2.6 ± 0.7 ppmv yr −1 (GFDL-ESM4).As expected from Fig. 3, the models -with the exception of MRI-ESM1, MRI-ESM2.0and MIROC-ES2L -overestimate the growth rate, leading to an increased XCO 2 level in the present-day atmosphere compared to observations.The interannual variability of the growth rate for the models is generally higher than that of the observations but well reproduced in the multi-model mean.
Emergent constraints are relationships defined using an ensemble of models, between a measurable aspect of current or past climate and an aspect of Earth system feedback in the future, which can be constrained using observational data (Eyring et al., 2019).In Appendix C, we have tried to reproduce the trend in interannual variability (IAV) of CO 2 growth rate to IAV of tropical temperature used by Cox et al. (2013) to develop an emergent constraint on the sensitivity of tropical land carbon to climate change but were unable to find a https://doi.org/10.5194/bg-17-6115-2020 Biogeosciences, 17, 6115-6144, 2020  significant trend for this much shorter satellite-derived time series.
The spatial variability of the GR is small, as CO 2 is long lived and well mixed in the atmosphere with a 1-year mean interhemispheric crossing time.Thus, there should be no significant regional changes on an annual level.Buchwitz et al. (2018) found the growth rate of the satellite dataset to be in agreement with those quoted by NOAA ESRL's global and Mauna Loa time series, as well as robust over several latitude bands.Our own analysis also shows only very small regional differences in the growth rate (not shown).No significant changes to the annual growth rates due to the satellite spatial coverage were found.

Seasonal cycle amplitude
This section about the SCA is divided into two subsections, with the first one taking a closer look at inter-model differences, while the second subsection is devoted to the impact of observational sampling.

Model differences
The lower panel in Fig. 3 shows the detrended global seasonal cycle for all models.Models in CMIP6 (a) show a strong improvement in their ability to capture both the seasonal cycle amplitude, as well as its phase compared to CMIP5 (b) but still underestimate the SCA.The correlation coefficient to the observed seasonal cycle is 0.93 in CMIP5 and 0.98 in CMIP6 for the multi-model mean.The only model in CMIP6 to significantly underestimate the seasonal cycle amplitude is CNRM-ESM2-1.Two errors have been identified causing this dampened seasonal cycle: ocean carbon fluxes are lagged in time, and in the emission-driven simulations, CO 2 is considered as an active tracer and coupled with atmospheric chemistry.These chemical fields are restored to global mean climatological concentrations at the model surface, acting as a damping component to the CO 2 concentrations (Séférian et al., 2019).Figure 5 shows maps of the climatological mean SCA (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014) for all models, with the global mean given in the top right and the zonal averages shown in the panel attached to the right of the maps.All CMIP6 models (Fig. 5a) underestimate the SCA compared to satellite observations (Fig. 6, middle) in the global mean, with the closest mean SCA being MIROC-ES2L.This underestimation was already present in CMIP5 (Fig. 5b), with several studies discussing it for surface CO 2 SCA (Wenzel et al., 2016;Graven et al., 2013).In CMIP6, the multi-model mean has a globally averaged mean SCA of 3.25 ppmv, compared to 2.92 ppmv for CMIP5, while the observations show an effective SCA of 5.89 ppmv (Fig. 6).
Both models and observations show the well-known increasing SCA with increasing latitude, due to the more pronounced seasonal cycle of the climate at higher latitudes.Most models show a decreased growth from 0 to 30 • N, with higher SCA increases in the lower tropics and northern midlatitudes.Overall, the zonal distribution is quite similar throughout the models, with UK-ESM1-0-LL showing increased SCA at 30-90 • S. Tropical land regions in northern South America, Africa and southeast Asia show increased SCA values compared to the ocean SCA at this latitude for the same model.While in the GFDL CMIP5 models this was so pronounced that these regions showed a higher SCA than https://doi.org/10.5194/bg-17-6115-2020 Biogeosciences, 17, 6115-6144, 2020 the higher latitudes (Dunne et al., 2013), this is no longer the case for GFDL-ESM4 in CMIP6.Dunne et al. (2013) attributed the GFDL problem in CMIP5 to the seasonality of respiration in the northern latitudes and an Amazonian lowprecipitation bias which reversed the seasonal cycle synchronizing it with the African and Oceanian rain forests.The improvement in CMIP6 is due to a reduced ocean-atmosphere CO 2 flux in the Southern Hemisphere, as well as the reduction of the high tropical land-atmosphere fluxes expressed over the ocean (Dunne et al., 2020).
The SCA in the CMIP5 MPI-ESM-LR model is on average twice as large as the observed one.The high SCA has been discussed in Giorgetta et al. (2013), where it was attributed to a combination of an overestimation of net primary productivity in ocean and land biology and uncertainties in atmospheric tracer transport.A particularly severe overestimation was seen in the Southern Hemisphere when comparing to station data.As shown in Fig. 6, we additionally find a large overestimation in XCO 2 SCA in the Northern Hemisphere, in particular in the extratropics.For the CMIP6 successor model, MPI-ESM1.2-LR,the SCA is still the highest of the model ensemble but is no longer twice as high as the other models.However, it now shows a more pronounced SCA over the tropical land regions mentioned above, which was not as dominant in CMIP5.
It is known that nitrogen limitations tend to suppress CO 2 fertilization (Reich et al., 2006)

Influence of sampling
There are a number of ways to compare model SCA to observational SCA, beginning with a grid box comparison.
Figure 6 shows a comparison of the multi-model mean of CMIP6 (Fig. 6a) and CMIP5 (Fig. 6b) to observations.The top right shows the unsampled SCA.The top left panel shows the effective SCA when using observational sampling and the middle panel the satellite data's effective SCA.All numbers are given in ppmv.For an easier comparison, the bottom panels show the absolute difference plots, with the left panel depicting the difference between sampled model and observations, and the right panel the difference between the sampled and unsampled model.Observational sampling slightly lowers the SCA, which is to be expected, as it could lead to masking out the peaks or troughs.While this effect was minimized by imposing the restriction of only computing the SCA of a year when at least 7 months of data are available, it is still a possibility.We therefore classify this SCA as an "effective SCA".However, the SCA does not seem to decrease significantly through sampling and the difference does not follow a trend in latitude, so a grid box comparison seems feasible.This paves the way for more comprehensive spatial investigations, which previously relied on data from ground-based stations with sparse spatial coverage.While the stations provide data in higher latitudes that the satellite dataset does not cover, in the tropics and midlatitudes the spatial coverage of the satellite is superior to the ground-based stations.A downside with this approach is the sparsity of the data when using observational sampling.Furthermore, this becomes a com-putationally expensive operation, as the SCA will need to be calculated for each grid box.
Another approach often used in model analysis is area averaging, e.g., over different latitude bands like the tropics or the northern midlatitudes.Using surface flask measurements, Wenzel et al. (2016) found an increased SCA with rising CO 2 concentration for CMIP5 using model data from the full historical simulation (1850-2005) -CO 2 fertilization -and used this to establish an emergent constraint on the fertilization of terrestrial gross primary productivity (GPP).trend in the unsampled models, while the right one shows the trend when following the sampling of the satellite data.The SCA was computed after a weighted area average was determined on the XCO 2 time series.While some of the unsampled models show an increasing SCA trend with increasing XCO 2 , which is in agreement with the findings from Wenzel et al. ( 2016), others do not show a statistically significant trend and the multi-model mean shows an insignificant positive trend.The sampled model data (right) show a significant negative trend.Calculating the average with a zonal average before summing up the latitude bands does not change this result.
To investigate this change in trend due to observational sampling, Fig. 8 shows the data coverage for different time periods, 2003-2008 for SCIAMACHY only measurements (top), the overlap between the two satellites in 2009-2012 (middle) and 2013-2014 for the GOSAT satellite only (bottom), with the pattern marking areas with a coverage of 50 % or above.Above 50 • N, SCIAMACHY measurements include more areas with 50 % or more coverage compared to GOSAT measurements.With a larger SCA in higher latitudes, it implies that SCIAMACHY would have a larger average SCA over this region compared to GOSAT, hence artificially generating a decreasing trend in the observed SCA, when moving from SCIAMACHY to GOSAT. Figure 9 shows the CMIP6 effective SCA trend with XCO 2 using SCIAMACHY and GOSAT masks obtained from masking out points with less than 50 % coverage.While the slopes remain largely the same, the mean effective SCA is higher in the models using the SCIAMACHY mask than when using the GOSAT mask.This mean SCA difference is larger than the total SCA difference within a model using the same sampling for the whole time period.Thus, when considering the observational time series and its full sampling, the trend intrinsic to the model is dominated by the negative SCA difference going from the SCIAMACHY to the GOSAT data coverage and thus changed to the negative trend seen in the observations.We can therefore attribute at least part of the negative trend in the satellite data to the different data coverage of the two satellites.We are able to reproduce this negative trend with the models, when these are sampled consistently with the satellite data.This study on sampling also holds true for CMIP5 models, with the equivalent figures shown in Appendix B (Figs. B1 and B2).
Further impacts on CO 2 concentrations could come through temporal sampling, such as the fact that the satellite data only include measurements with low cloud cover and are limited to 13:00 LT.While cloud cover can impact photosynthesis, the response can be fundamentally different for various ecosystems (Still et al., 2009); we expect a larger effect from the diurnal cycle in CO 2 which is included in the model means but not the satellite data.Due to a lack of CO 2 data from models with a higher temporal resolution, this effect cannot be estimated in this study.
The XCO 2 time series comparison shows that most models overestimate the carbon content of the atmosphere relative to the satellite observations in both model ensembles, with a lower overestimation for the CMIP6 models of 2 ppmv for the multi-model mean and a wide range of individual model differences of −15 to +20 ppmv.The CMIP5 models overestimate by 5 to 25 ppmv with the exception of the MRI-ESM1 model, which underestimates by 20 ppmv.The CMIP5 multi-model mean overestimates by 10 ppmv compared to the observations, which has also previously been found for surface comparisons (Friedlingstein et al., 2014;Hoffman et al., 2014).Overall, CMIP6 models follow the same trends as their CMIP5 counterparts but with reduced systematic biases.
All models capture the expected increase of the SCA with increasing latitudes, but most models underestimate the SCA to differing degrees in different regions.This result is in line with previous studies (Wenzel et al., 2016;Graven et al., 2013).Models with similar model components show similar behavior, with models including a nitrogen cycle generhttps://doi.org/10.5194/bg-17-6115-2020 Biogeosciences, 17, 6115-6144, 2020 ally showing a lower SCA in CMIP5, but this influence is not clear in CMIP6.Finally, the connection between SCA and XCO 2 was investigated in the northern midlatitudes.Most models from both ensembles show a positive trend, i.e., an increase of the SCA with XCO 2 , consistent with findings for surface CO 2 (Wenzel et al., 2016).However, the satellite product shows a strong negative trend in contrast to the models and surface based observations.We have attributed this trend reversal to the sampling characteristics of the satellite products.The average effective SCA is higher for models sampled according to the SCIAMACHY/Envisat as opposed to the TANSO-FTS/GOSAT mean data coverage.As the early time series is based solely on the SCIA-MACHY/Envisat data and the last years only use data from TANSO-FTS/GOSAT, this introduces an artificial negative trend which dominates the positive trend shown by the unsampled models.This demonstrates the importance of equal sampling of models and observations in model evaluation studies.
There are several ways to improve on this analysis in the future.With more available future scenario simulations, the analysis can be extended to a longer time series, making use of longer observational time series, such as the one introduced in Reuter et al. (2020).Higher temporal resolution of the models would enable studies on the effect of the diurnal cycle of CO 2 on the monthly mean and also allow for the construction of a co-located time series with the level-2 satellite data.This could help highlight some of the causes of model biases by being able to pinpoint time and space where they occur more precisely.Model biases may also result from the CMIP experimental design, such as requiring the climate state to be in equilibrium in 1850 while the real world may not have been (Bronselaer et al., 2017), or the parameterizations of biological and physical processes not allowing the system to change rapidly enough (Hoffman et al., 2014).Along with a longer time series, newer satellites, such as OCO-2 or the planned Sentinel 7 bring higher resolutions and more data, potentially helping to fill the gaps and reduce the impact of the sampling we discussed in Sect.5.3.2.
Overall, the CMIP6 ensemble shows improved agreement with the satellite data in all considered quantities (mean XCO 2 , growth rate, SCA and trend in SCA), with the biggest improvement shown in the mean XCO 2 content of the atmosphere.The paper demonstrates the great potential of satellite data for climate model evaluation as this allows us to go beyond regional means or single point observations from in situ data and also enables the investigation of regional effects on SCA, such as the increase in SCA at higher latitudes.
Biogeosciences, 17, 6115-6144, 2020 https://doi.org/10.5194/bg-17-6115-20202013) developed an emergent constraint on the sensitivity of tropical land carbon to climate change using the sensitivity of the IAV of CO 2 growth rate to the IAV of tropical temperature, which was later adapted by Wenzel et al. (2014) to CMIP5 models.Figure C1 shows the sensitivity of the IAV of XCO 2 growth rate to the tropical growing season temperature IAV for CMIP6 (left) and CMIP5 (right) models, both compared with observations.The observational temperature is taken from the GISTEMP v4 dataset (Lenssen et al., 2019) and the models use their own modeled temperature.We find an observational value of −0.23 ± 0.70 ppmv yr −1 K −1 for the 2003-2014 period.However, when using the full span of the satellite data until 2016, the slope increases to 0.75 ± 0.6 ppmv yr −1 K −1 (not shown), as the additional years show both a high growing season temperature and GR IAV, coinciding with a strong El Niño.This shows that the time period 2003-2014 is not sufficiently long to reproduce the emergent constraint, although this may become feasible once CMIP6 emission-driven future simulations are available for a longer time overlap between models and observations.CMIP5 model values for the timeframe 2003-2014 range from 0.53 ± 0.51 (NorESM1-ME) to 3.14 ± 0.63 ppmv yr −1 K −1 (MRI-ESM1), with only CESM1-BGC showing a negative trend of −0.64 ± 0.55 ppmv yr −1 K −1 .The multi-model mean has a value of 1.79 ± 0.80 ppmv yr −1 K −1 .In CMIP6, the range is significantly decreased with a minimum of 1.14 ± 0.56 ppmv yr −1 K −1 (ACCESS-ESM1.5)to a maximum of 3.37 ± 0.71 ppmv yr −1 K −1 (CanESM5-CanOE) and a multi-model mean of 1.14 ± 0.37 ppmv yr −1 K −1 .3. ESMValTool v2.0 (Eyring et al., 2020;Lauer et al., 2020;Righi et al., 2020) is released under the Apache License, VERSION 2.0.The latest release of ESMValTool v2 is publicly available on Zenodo at https://doi.org/10.5281/zenodo.3401363(Andela et al., 2020a).The source code of the ESMValCore package, which is installed as a dependency of the ESMValTool v2, is also publicly available on Zenodo at https://doi.org/10.5281/zenodo.3387139(Andela et al., 2020b).ESMValTool and ESMValCore are developed on the GitHub repositories available at https://github.com/ESMValGroup,last access: 30 November 2020.The corresponding recipe that can be used to reproduce the figures of this paper will be included in ESMValTool v2 at the time of publication of the paper.
Author contributions.BG led the writing and analysis of the paper.MB and MR provided the satellite dataset.VE, PC and PF contributed to the evaluation of the CMIP simulations.All authors contributed to the writing of the manuscript.
Competing interests.The authors declare that they have no conflict of interest.
Acknowledgements.This work has been supported by ESA Climate Change Initiative (CCI) Climate Modelling User Group (CMUG) project, the Horizon 2020 project Climate-Carbon Interactions in the Current Century (4C) funded by the EU (grant agreement no.821003) and the EVal4CMIP project funded by the Helmholtz Society.The generation of the satellite XCO 2 data product has received funding from ESA (GHG-CCI) and the EU (Copernicus Climate Change Service -C3S, led by ECMWF).We acknowledge the World Climate Research Programme (WCRP), which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6.We thank the climate modeling groups (listed in Tables 2 and 3 of this paper) for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access and the multiple funding agencies who support CMIP and ESGF.We also thank both anonymous reviewers for their helpful comments on this paper and the handling editor, Trevor Keenan, for taking on this paper.
Financial support.This research has been supported by Horizon 2020 (grant no.4C(821003)), the European Space Agency (ESA Climate Change Initiative Climate Model User Group, ESA CCICMUG) and the Helmholtz Association (Advanced Earth SystemModel Evaluation for CMIP, EVal4CMIP).
The article processing charges for this open-access publication were covered by the University of Bremen.
Review statement.This paper was edited by Trevor Keenan and reviewed by two anonymous referees.

Figure 1 .
Figure 1.Mean fractional coverage of monthly satellite data for 2003-2014.A value of 0 (white) signifies no available data, while a value of 1 (dark green) means that this grid cell contains data for all years of this month.

Figure 3 .
Figure 3. (a) Global time series of monthly mean column-averaged carbon dioxide (XCO 2 ) from 2003 to 2014 for the emission-driven CMIP6 model simulations in comparison to satellite XCO 2 data (bold black line).The model output is sampled as the satellite data.The top panels show the time series, while the middle panels show the computed monthly growth rate, which has been used to detrend the data to obtain the seasonal cycle shown in the bottom panel.All available ensemble members for each model are shown to demonstrate the intrinsic variability of the models.(b) Same as Fig. 3a but for CMIP5.Only one ensemble member is shown.

Figure 4 .
Figure 4. Global mean and standard deviation over all years of annual growth rate of XCO 2 during 2003-2014 for CMIP6 models (a) and CMIP5 models (b).The black bar represents the satellite observations, while the red bar depicts the multi-model mean.

Figure 5 .
Figure 5. Maps of mean annual seasonal cycle amplitude for 2003-2014 for the CMIP6 (a) and CMIP5 (b) models.The model name is given in the top left of each panel, and the top right shows the global average of the mean annual seasonal cycle.The panel to the right of the maps shows the same zonal mean SCA.
Figure 6.(b) Same as Fig. 6a but for the CMIP5 multi-model mean.

Figure 7 .
Figure 7. Trend of SCA with XCO 2 for 2003-2014 for the northern midlatitudes (30-60 • N), including a linear regression with slope and mean SCA given in the top left of each panel and the Pearson correlation coefficient as well as the p value in the bottom right.Symbols denote the different years and model colors are consistent with previous figures.The left panels (a) show unsampled CMIP6 models, while CMIP6 models sampled according to the satellite data are shown on the right (b).Note that the y-axis range for each plot is the same and only differs by a shift.

Figure 8 .
Figure 8. Data coverage of the satellite observations for (a) 2003-2008 containing only SCIAMACHY data, (b) 2009-2012 containing the overlap of SCIAMACHY and GOSAT data and (c) 2013-2014 containing only GOSAT data.The patterned area highlights values above 0.5.

Figure 9 .
Figure 9. Same as Fig. 7 but with CMIP6 models masked using (a) the SCIAMACHY mask and (b) the GOSAT mask, with the masks derived from Fig. 8, masking out points with less than 50 % coverage in those time periods.

Figure B2 .
Figure B2.Same as Fig. B1 but with CMIP5 models masked using (a) the SCIAMACHY mask and (b) the GOSAT mask, with the masks derived from Fig. 8, masking out points with less than 50 % coverage in those time periods.

Figure C1 .
Figure C1.Sensitivity of the interannual variability of the XCO 2 growth rate in the tropics (30 • S-30 • N) to the interannual variability of tropical growing season temperature for CMIP6 models (a) and CMIP5 models (b).The observational temperature taken from the GISTEMP temperature anomaly map, while the models use their own simulated temperature.A linear regression is performed on the data for each dataset.Model colors are the same as in Fig. 3, and symbols denote the years.In the top left of each panel, the regression coefficient and its uncertainty is shown, while the bottom right states the Pearson correlation coefficient and p value.

Table 1 .
List of active NOAA surface flask measurement sites used in this study.

Table 2 .
CMIP5 models analyzed in this study.D stands for models including dynamic vegetation, and N stands for models including nitrogen cycles.

Table 3 .
CMIP6 models analyzed in this study.D stands for models including dynamic vegetation, and N stands for models including nitrogen cycles.