Emergent relationships with respect to burned area in global satellite observations and ﬁre-enabled vegetation models

. Recent climate changes have increased ﬁre-prone

etation effects on fire are a main deficiency regarding fireenabled dynamic global vegetation models' ability to accurately simulate the role of fire under global environmental change.

Introduction
About 3 % of the global land area burns every year (Chuvieco et al., 2016;Giglio et al., 2013;Randerson et al., 2012).Fire represents a strong control on large-scale vegetation patterns and structure (Bond et al., 2004) and can significantly accelerate the impacts of changing climate or land management on global ecosystems (Aragão et al., 2018;Beck et al., 2011).Fire directly affects global and regional climate through changing surface albedo (López-Saldaña et al., 2015;Randerson et al., 2006), atmospheric trace gas, and aerosol concentrations (Andreae and Merlet, 2001;Ward et al., 2012), and on longer timescales by affecting vegetation composition and structure with subsequent impacts on the carbon cycle and hydrology (Li and Lawrence, 2016;Pausas and Dantas, 2017;Tepley et al., 2018;Thonicke et al., 2001).
Climate influences several aspects of the fire regime, including the seasonal timing of lightning ignitions (Veraverbeke et al., 2017), temperature and moisture controls on fuel drying, and wind-driven fire spread (Jolly et al., 2015).Climate also influences the nature and availability of fuel, through its impact on vegetation productivity and structure (Harrison et al., 2010).Vegetation structure, in turn, influences the patterns of available fuel and moisture that directly determine fire spread, severity, and extent (Krawchuk and Moritz, 2011;Pausas and Ribeiro, 2013).People set and suppress fires and use them to manage agricultural and natural ecosystems, for land use change and deforestation practices (Andela and van der Werf, 2014;van Marle et al., 2017).Human-induced modifications and fragmentation of natural vegetation through agricultural expansion and urbanization limit fire spread (Bowman et al., 2011).Thus, climate, vegetation, and human controls on fire are multivariate and have strong interactions with one another (Bowman et al., 2009;Harrison et al., 2010;Krawchuk et al., 2009).Empirical analyses of fire regimes using machine learning algorithms have identified the most important variables and their sensitivities for fire occurrence and spread (Aldersley et al., 2011;Archibald et al., 2009;Bistinas et al., 2014;Forkel et al., 2017;Krawchuk et al., 2009;Moritz et al., 2012).However, because of the difficulty of factoring out interactions between predictor variables, such sensitivities represent emergent relationships rather than specific physical controls on fire.Thus, it has proved difficult to disentangle the role of changes in any single factor on the trajectory of changes in fire regimes.For example, changes in climate result in weather conditions that are increasingly favourable for fire and fire activity in some temperate regions (Holden et al., 2018;Jolly et al., 2015;Müller et al., 2015); however, it has been suggested that changes in land use compensate for climate effects and result, for example, in declining burned areas in African savannahs (Andela and van der Werf, 2014).Hence, there is still uncertainty regarding factors such as the cause of the recent observed decline in global burned area (Andela et al., 2017).Furthermore, there is even greater uncertainty about the potential trajectory of changes in fire regimes in the future (Settele et al., 2014).
Fire-enabled dynamic global vegetation models (DGVMs) or Earth system models are process-oriented tools used to predict the consequences of future climate change on fire regimes and associated feedbacks (Hantson et al., 2016).Our faith in these projections is contingent on the ability of these models to capture features of the current situation.State-ofthe-art fire-enabled DGVMs partly capture the spatial patterns of burned area (Andela et al., 2017;Kelley et al., 2013); however, doubt has been cast on the ability of these models to capture the response to extreme events and recent trends in burned area (Andela et al., 2017).This suggests that they inaccurately represent the response of fire to combined changes in climate, vegetation, and socio-economic drivers.
Here we aim to test how fire-enabled DGVMs reproduce emergent relationships with the drivers of burned area.We apply a machine learning algorithm to the output from seven fire-enabled DGVMs and a suite of satellite and other observation-based data sets in order to derive emergent relationships between a number of potential drivers of burned area.By comparing the model-and data-derived emergent relationships, we assess the degree to which DGVMs reproduce these relationships.While we make no assumption about the actual physical controls on burned area, this comparison allows us to pinpoint relationships between drivers and burned area that are unrealistically represented in fireenabled DGVMs.

Method summary
In order to infer relationships between potential drivers of fire in satellite data and fire-enabled DGVMs, we applied the random forest (RF) machine-learning algorithm to predict monthly burned area (response variable) from climate, vegetation, and socio-economic predictor variables (Fig. 1).Predictor variables and burned area were either taken from satellite and other observation-based data sets or from simulations by a suite of fire-enabled DGVMs from the Fire Modeling Intercomparison Project (FireMIP) (Rabin et al., 2017) to derive relationships for data sets and models, respectively.
The RF algorithm is a regression approach that allows non-linear, non-monotonic, and non-additive relations between multiple predictor variables and the target variable.trees that are built based on the training data set (Breiman, 2001;Cutler et al., 2012).We built three sets of RF models.We first built RF models for satellite-observed burned area based on a multitude of predictor variables to derive relationships from data.There are differences between the available burned area data sets; therefore, we used five recent and/or well-established data sets to encompass these uncertainties.The fire-enabled DGVMs do not use some of the predictor variables in the satellite-derived RF.Hence, we built a second set of satellite-derived RF models with a reduced set of predictor variables.The third set of RF models was derived for each FireMIP model using the simulated burned area as the target variable and simulations of gross primary production (averaged over precedent months), biomass and land cover predictor variables, and the population density and climate predictor variables that were used as inputs for the models (according to the FireMIP protocol).

RF averages predicted values across an ensemble of decision
From each RF model we then derived the importance (Sect.2.7), relationships, and sensitivity (Sect.2.8) of each predictor variable to burned area.Relationships and sen-sitivities were derived by computing the individual conditional expectation (ICE) and the partial dependencies curves (Goldstein et al., 2013).These dependencies represent the emergent relationships of burned area to drivers in the observation-or model-variable space.We then compared the data-and model-derived emergent relationships and sensitivities both globally and on a per grid cell basis.

Burned area from satellite data sets
There are several global burned area data sets; both the spatial patterns and temporal dynamics differ between these models (Hantson et al., 2016;Humber et al., 2018) as they use different satellite sensors and retrieval algorithms, and have different sensitivities to small fires (Chuvieco et al., 2016;Giglio et al., 2013;Randerson et al., 2012).We used the variability between five global data sets (Table 1) as an estimate of uncertainty.However, by doing so we might still have underestimated the real uncertainty in burned area observations because all data sets rely on active fire detec-  (Giglio et al., 2009).Giglio et al. (2013) 1995-2015

GFED4s
GFED4s 0.25 tions (thermal anomaly) and reflectance changes from the same sensor (MODIS).As an exception, CCI_MERIS uses MERIS reflectances combined with MODIS active fires.We restricted our analysis to burned area data with high observational quality.Observational quality indicates the degree to which missing input satellite imagery or contamina-tion from clouds, smoke, snow, and shadows limit burned area detection.Especially MERIS land observations are subject to substantial gaps in raw data acquisitions (Tum et al., 2016).Low observational coverage can result in a strongly underestimated burned area.Here, we used the CCI_MERIS "observed area fraction" layer as a time-variant mask to all burned area data sets and only included estimates for months with observational coverage higher than 80 %.We also excluded burned area in months with < 0 • C to remove suspicious small burned areas in polar regions or in winter months which are likely caused by insufficiently corrected gas flares and other industrial activities.Analyses were made with monthly burned area observations for the period from 2005 to 2011, which is the common period between the five data sets.

Burned area from FireMIP models
A detailed description of FireMIP DGVMs and the simulation protocol is given by Rabin et al. (2017).Here we used monthly burned area from seven models that made transient simulations from 1700 to 2013 (bottom half of Table 1).The models were forced using inputs of meteorological variables from the CRUNCEP V5 data set (Wei et al., 2014), monthly cloud-to-ground lightning strikes (Rabin et al., 2017), annually updated values of human population density from the HYDE 3.1 data set (Goldewijk et al., 2010), annually updated land use and land cover changes from the Hurtt et al. (2011) data set, and annually updated values of global atmospheric CO 2 (Le Quéré et al., 2014).Although forcing data sets are common across DGVMs, they do not use the same set of forcing variables, i.e. wind speed (WSPEED), or use population density (PopDens) for fire ignitions and/or fire suppression.
The model outputs were aggregated to a common spatial resolution of 1.89 • latitude×2.5 • longitude.Aggregation was undertaken by averaging the fractional burned area from all high-resolution grid cells that belong to the same coarseresolution grid cell.Nearest neighbour resampling was carried out if less than two high-resolution grid cells were within one coarse-resolution grid cell.Analyses were also undertaken for the same period as the common window of the satellite data (2005)(2006)(2007)(2008)(2009)(2010)(2011) and by applying the "observed area mask" from the satellite data.

Evaluation of data-data and model-data temporal agreement
We evaluated the temporal agreement of monthly burned area time series for 2005-2011 between the data sets and between the data sets and the fire-enabled DGVMs based on various model performance metrics (Janssen and Heuberger, 1995) on a per-grid cell basis.We selected the Spearman rank correlation coefficient to compare the temporal agreement and the fractional variance (FV) to compare the variability of burned area per grid cells: where σ ref and σ x are the variances of the reference and observed or simulated burned area, respectively.FV ranges be- In the case where a single satellite data set (e.g.x = BA.CCI_MERIS) was compared with the other satellite data sets, this data set was not used in the reference vector.This approach directly considers the differences between data sets in the computation of model performance metrics and implies that it is impossible for a FireMIP model or for one single satellite data set to reach an optimal correlation of unity or a FV of zero as long as the satellite burned area data sets show differences.We used the median of the correlation coefficient and of the FV for each grid cell to quantify the data-data or model-data agreement over the ensemble of data sets or models.As a single global agreement metric, we computed the percentage of the land area that showed "good" agreement from the spatial patterns of the Spearman correlation Cor and FV: good agreement for an individual grid cell was defined based on a positive and nonrandom relationship (i.e.Cor ≥ 0.25) and a comparable variance (−0.75 ≤ FV ≤ 0.75) between simulated and observed burned area.

Predictor variables and data sets
Several variables have been identified as predictors of global fire in previous studies, inter alia the number of dry or wet days per month (WET), diurnal temperature range (DTR), maximum temperature (TMAX), grass and shrub cover, leaf area index (LAI), net primary production (NPP), population density (PopDens), and gross domestic product (GDP) (Aldersley et al., 2011;Bistinas et al., 2014).Other variables have been found important for fire at a regional scale, including total precipitation, tree cover, forest cover type, tree height, biomass and litter fuel loads, and grazing (Archibald et al., 2009;Chuvieco et al., 2014;Parisien et al., 2010;Pettinari and Chuvieco, 2017).We created a combined set of potential variables used in these studies to predict burned area (Table A1).We used data on gross primary production (GPP) instead of NPP as GPP can be estimated from eddy covariance observations and does not require model assumptions about autotrophic respiration.

Climate data
Climate data were taken from the CRUNCEP V5 data set (Wei et al., 2014).CRUNCEP provides 6-hourly time series of precipitation, maximum and minimum temperature, and wind speed.From these time series, we derived the monthly mean of daily maximum temperature (CRUNCEP.TMAX) and minimum temperature (CRUNCEP.TMIN), the monthly mean daily diurnal temperature range (CRUNCEP.DTR = TMAX − TMIN), the monthly 90th percentile of daily wind speed (CRUNCEP.WSPEED), the monthly total precipitation (CRUNCEP.P), and the number of wet days per month (CRUNCEP.WET).A wet day was defined as a day with ≥ 0.1 mm precipitation (Harris et al., 2014).

Land cover
Land cover was taken from the ESA CCI Land cover V2.0.7 data set which provides annual land cover maps for the period from 1992 to 2015 (Li et al., 2018).Land cover classes were converted into the fractional coverage of plant functional types (PFTs).For this conversion, we used the cross-walking approach (Poulter et al., 2011(Poulter et al., , 2015) ) based on the conversion table in Forkel et al. (2017).Individual PFTs combine growth form (tree, shrubs, herbaceous vegetation, or crops) with leaf type (broadleaved or needle-leaved) and leaf longevity (evergreen or deciduous).The variable Tree.BD, for example, is the fractional coverage of broadleaved deciduous trees (Table A1).We created an additional category combining trees and shrubs (e.g.TreeS.BD = Tree.BD + Shrub.BD) because most of the FireMIP models simulate woody vegetation rather than separating shrubs and trees explicitly (Table S1 in the Supplement).JULES, LPJG-SIMF, and LPJG-SPITF dynamically simulate the fractional coverage of PFTs, but CLM, CTEM, JSBACH, and ORCHIDEE used prescribed PFT distributions.We reclassified the PFTs of each model into the same set of PFTs that we derived from the CCI land cover data set (Table S1).

Vegetation productivity
Data on gross primary production (GPP) and leaf area index (LAI) were taken to account for the seasonal effects of vegetation productivity and canopy development.GPP was taken from the FLUXCOM data set which is up-scaled from GPP estimates at FLUXNET measurement sites (Tramontana et al., 2016).We used the FLUXCOM data set that used satellite and CRUNCEP meteorological data for the upscaling.LAI was taken from MODIS (Myneni et al., 2015).GPP and LAI were averaged to monthly mean values (e.g.variable name GPP.orig).To account for seasonal fuel accumulation, we also computed previous-season GPP or LAI values as the mean over the 3 and 6 months before the month of comparison with burned area (e.g.GPP.pre3mon and GPP.pre6mon).

Biomass and fuels
We used temporally static vegetation data sets to account for the effects of vegetation biomass, fuel properties, and ecosystem structure on burned area dynamics.Total above-and below-ground vegetation biomass was obtained from Carvalhais et al. ( 2014), which is based on an above-ground forest biomass map for the tropics for the early 2000s (Saatchi et al., 2011), a total forest biomass map for temperate and bo-real forests for the year 2010 (Thurner et al., 2014), and an estimate of herbaceous biomass (Carvalhais et al., 2014).The vegetation biomass data set does not cover southern Australia or New Zealand.Although fire is common in these regions, we did not fill the global vegetation biomass map with a regional map to avoid potential artefacts in the derived sensitivities that would likely result from merging different biomass maps.From each FireMIP model, we used the simulated vegetation carbon averaged for the years from 2005 to 2011 as the equivalent to this data set.We used canopy height from Simard et al. (2011); this data set provides a snapshot of the average canopy height in 2005.Factors related to fuel properties, specifically grass height, litter depth, woody litter depth, and the amount of woody litter in different size classes were extracted from the global fuel bed database (Pettinari and Chuvieco, 2016).This database is based on a land coverbased extrapolation of regional fuel databases for the globe and provides a generic picture of the conditions around 2005.

Socio-economic data
We used the annually varying population density data set from the HYDE V3.1 database (Klein Goldewijk et al., 2011), which was utilized as a forcing data set for the FireMIP simulations.We also used annually varying gross domestic product per capita (GDP; World Bank, 2018), a static satellite-derived index of socio-economic development based on night-time lights for the year 2006 (Elvidge et al., 2012), and a data set on cattle density for the year 2007 (Wint and Robinson, 2007).

Random forest experiments and selection of predictor variables
We performed our analysis using the randomForest package V4.6-12 in R (Liaw and Wiener, 2002).We trained the RF with 500 regression trees.The training target was either a "satellite-observed" or a "model-simulated" burned area, i.e. we trained one RF against each burned area data set and each individual FireMIP model simulation, respectively.We used two sets of predictor variables in three sets of RF experiments (Table A1): -"RF.Satellite.full" for satellite-derived RF experiments: we used 23 of the 28 predictor variables to train RF models for each burned area data set.Five predictor variables were not included in the RF because they were highly correlated with others (r > 0.8, i.e. nightlight development index, cattle density, woody litter for the 10 h fuel size class, precedent 3-monthly GPP, and precedent 3-monthly LAI; Fig. S1 in the Supplement).The purpose of these experiments was to identify the relationship between burned area and each predictor variable from data sets.
-"RF.Satellite.fm" for satellite-derived RF experiments: these experiments were also trained against burned area data sets but included only the reduced set of 16 databased predictor variables that are available from both observational data sets and the FireMIP (fm) models.
-"RF.FireMIP.fm"for model-derived RF experiments: these experiments used the reduced set of predictor variables with land cover, GPP, biomass, and the response variable burned area taken from simulations of each FireMIP model.The purpose of these experiments was to compare relationships and sensitivities from satelliteand FireMIP-derived RF experiments.

Importance of predictor variables in random forest
The normal method of determining the importance of predictor variables for RFs (increment in the mean-squared error -MSE) was found to be overly sensitive to the burnt area data set that was used in training because of the highly skewed distribution of burned area, and this hampers its interpretability (Figs.S8, S9).To overcome this issue and to obtain additional information about the regional (i.e.grid cell-level) importance of predictor variables, we developed an alternative approach.This alternative approach uses the fractional variance (FV) and Spearman correlation (r) instead of the MSE and is computed for each grid cell.The importance of variables is quantified as a distance D in a two-dimensional space based on these metrics: where FV 0 and r 0 , and FV p and r p are the performance metrics based on the original RF predictions and based on the RF predictions after permuting a single predictor variable, respectively.The FV-related term was multiplied by 0.5 to obtain the same range as the correlation.FV and r are computed at the grid cell-level based on the monthly burned area time series from the RF predictions and the training data (i.e.burned area from a satellite data set or from a FireMIP model).As the metric D depends on the permutation, we permutated each predictor variable 10 times and averaged the D metric.

Deriving emergent relationships and sensitivities from random forest
Insight into the shape of a relationship between a predictor and the target variable in a trained RF can be obtained from partial dependence (PDP) (Friedman, 2001) and individual conditional expectation (ICE) plots (Goldstein et al., 2013; Fig. S2).PDPs show the partial relationship between the predicted target variable and one predictor variable when other predictor variables are set to their mean value.ICE plots show the relationship between the predicted target variable and one predictor variable for individual cases of the predictor data set (Goldstein et al., 2013).In our application, an individual case is a specific combination of climate, land cover, vegetation, and socio-economic data for a given grid cell in a given month (Fig. S2).The average of all ICE curves corresponds to the PDP.We used the ICEbox package V1.1.2for R for the computation of ICE curves and partial dependencies (Goldstein et al., 2013).We computed ICE curves for all predictor variables and from all RF experiments (Supplement Sects.S4 and S5).We computed ICE curves and PDPs based on the global data set to analyse and compare global emergent relationships.The Pearson correlation coefficient was computed between pairs of satellite-and model-derived ICE curves to quantify the agreement of the emergent relationships (Fig. S15).We also computed PDPs for each grid cell to produce global maps of partial sensitivities for selected predictor variables.To summarize and map the PDP of each grid cell in a single number, we fitted a linear quantile regression to the median between the partial dependence of burned area and the corresponding predictor variable and mapped the slope of this regression.In the following, we name this slope "sensitivity".

Evaluation of temporal burned area dynamics
Here we compare the monthly temporal dynamics of burned area from the satellite data sets, FireMIP model simulations, and random forest predictions for the overlapping period from 2005 to 2011.The satellite data sets showed relatively good agreement with each other on average (i.e."good" is r ≥ 0.25 and −0.7 ≤ FV ≤ 0.7) over 70 % of the global vegetated land area, with the best agreement found in frequently burning grasslands and savannahs (Fig. 2a).However, individual data sets only showed good agreement over 31 %-56 % of the land area (Fig. S3).The largest dissimilarities between burned area data sets occurred in temperate regions with a high land use intensity (e.g.North America, Europe, China), tropical forests, and in sparsely vegetated arid and tundra regions.These difference are likely caused by limited detection possibilities under cloud cover (e.g. in the Amazon) and the sensitivities of the algorithms regarding the detection of small fires (temperate and sparsely vegetated regions).As the CCI_MERIS data set is based on a different sensor, it is the most different from the other data sets (31 % of land area showed good agreement, Fig. S3).Hence, these uncertainties make it necessary to separately train RF to each data set in order to assess how such uncertainties translate into emergent relationships to burned area.
FireMIP models showed good agreement with satellite data sets over 9 % of the land area (Fig. 2b).In particular, models tended to underestimate the variability of burned area in key biomass burning regions, while overestimating  S3), (b) FireMIP model simulations and all satellite data sets (Fig. S4), (c) predicted burned area from RF and all satellite data sets ("full" set of predictor variables, Fig. S5), and (d) predicted burned area from RF trained against FireMIP models and the corresponding simulated burned area from each FireMIP model (Fig. S7).Green percentage numbers indicate the land area with "good" agreement (correlation ≥ 0.25 and −0.7 ≤ FV ≤ 0.7).Regions with missing data (white) either had no vegetation cover (e.g.deserts, ice sheets), had no burned area (e.g.parts of the Amazon and tundra), or were not covered by the vegetation carbon map used (i.e.regions in southern Australia and New Zealand).
fire variability in arid and some temperate regions with infrequent fire activity.Individual FireMIP models displayed weaker performance than the model ensemble (6 % to 8 % with good agreement, Fig. S4).
The RF models can reproduce the temporal dynamics of the satellite burned area data sets reasonably well in most frequently burned regions (Fig. 2c).The overall proportion of the vegetated land area that showed good agreement in "full" experiments was only 36 % but individual RF models reached better performances (up to 41 % with good agreement, Fig. S5).The "fm" RF models had slightly weaker performance (22 % to 38 % with good agreement, Fig. S6).However, the performance of RF models was much higher than the performance of FireMIP models (Fig. 2b).RF was also able to largely emulate the simulated burned area from FireMIP models (85 % with good agreement with the FireMIP simulation, Fig. 2d).The RF models most closely emulated simulated burned area in FireMIP models that are based on empirical relationships (JULES, LPJG-SIMF, Fig. S7).In summary, the ability of RF models to emulate simulated or observed monthly burned area dynamics is sufficient for the purposes of comparing satellite-derived and FireMIP-derived relationships.

Importance of predictor variables in random forest
Satellite-derived RF experiments show that temperaturerelated variables were the most important predictors for temporal burned area dynamics in temperate and boreal regions, and land cover-or productivity-related variables were the most important in subtropical and tropical regions (Fig. 3a).Maximum temperature had on average the highest importance globally and was the most important predictor over 30 %-40 % of the land area in satellite-derived RF (Fig. 3c and e).Productivity and land cover-related variables (i.e.mostly precedent 6-monthly GPP and broadleaved deciduous tree cover in savannahs) were the most important predictors over another 20 %-30 % of the land area.Dryness-related predictor variables (WET and P ) were most important in tropical forest regions.Human-related predictor variables were only most important in a few grid cells, whereby cropland cover was, on average, of higher importance than population density (Fig. 3c).The satellite-derived importance was very similar among the burned area data sets (Fig. 3e).
On average, the FireMIP model-derived RF experiments broadly reproduced the satellite-derived importance of predictor variables (Fig. 3b).However, maximum temperature, precedent 6-monthly GPP, and number of wet days had a lower importance, but diurnal temperature range, cropland cover, and precipitation had a higher importance than in the satellite-derived RFs (Fig. 3d).In addition, the modelderived importance of predictor variables differed among FireMIP models (Figs.3e, S10).Most model-derived RF experiments underestimated the importance of precedent 6-monthly GPP and showed large differences in the importance of land cover-related predictors.The strongly varying size of the yellow and green bars in Fig. 3e indicate that differences in simulated burned area between FireMIP models mostly originate from how productivity and land cover effects on fire are represented.

Climate
The satellite-derived global relationships showed expected patterns between burned area and climate variables: burned area increased exponentially with maximum temperature, decreased with an increasing number of wet days per month, and increased with diurnal temperature range (Fig. 4a-c).ables were robust among the burned area data sets (Fig. S11).However, burned area data sets show offsets between the relationship curves: for example, the curves that were derived from the GFED4s and CCI_MERIS data sets usually show higher burned area than the curves from the other data sets (Fig. S11).These positive offsets are caused by the fact that GFED4s and CCI_MERIS include more small fires and therefore have an overall higher burned area than the other data sets.RF experiments that either use the "full" or "fm" set of predictor variables resulted in largely similar relationships (Fig. S12).
The relationships between burned area and climate variables were broadly similar for the FireMIP models (Figs.4ac, S15, S16).Most model-derived global relationships agreed relatively well (r > 0.5) with satellite-derived relationships for maximum temperature, diurnal temperature range, and the number of wet days (Fig. 5).However, LPJG-SPITF and ORCHIDEE did not reproduce the satellite-derived increase of burned area with maximum temperature (Fig. 4a).In the case of LPJG-SPITF, this is likely due to a modification to the calculation of dead fuel moisture.In contrast to other SPITFIRE implementations, LPJG-SPITF uses soil moisture in part to determine dead fuel moisture.This likely explains the failure of LPJG-SPITF to reproduce the dependency on maximum temperature and the markedly different behaviour from the other SPITFIRE models seen here.CLM and JS-BACH did not reproduce the decrease in burned area with increasing number of wet days (Fig. 4b).
Regionally, sensitivities to maximum temperature were positive over most land areas in satellite-derived RF experiments (Figs.6a, S18).Regional sensitivities to the number of wet days were negative over most land areas but were positive in arid regions and in the boreal regions of northern America (Figs.6d, S20).Most FireMIP models tended to overestimate the regional sensitivities between maximum temperature and burned area in comparison to the satellitederived sensitivities in most non-forested regions (Fig. 6b-c).Regional sensitivities to wet days were very different among FireMIP models and were different to the satellite-derived sensitivities (Figs.6e-f, S21).In summary these results show that fire-enabled DGVMs broadly reproduced the overall relationships and sensitivities of burned area with climate variables.

Socio-economics
The satellite-derived global relationships showed that burned area increased exponentially as population density decreased at very low values (< 20 people km −2 ) and, generally, showed no sensitivity when population density was > 40 people km −2 (Figs.4d, S13a).Regionally, the satellitederived sensitivity to population density varied with vegetation type.It was negative in most grassland and savannah ecosystems but positive in infrequently burned forested ecosystems (Fig. 6g).Burned area exponentially increased at a very low gross domestic product per capita (Fig. S13b).The relationship between burned area and cropland area was nonmonotonic: all data sets showed a burned area peak at < 5 % cropland, minimum burned area at 5 %-30 % cropland cover, and an increasing burned area at > 30 % cropland cover (Fig. S13c).The satellite-derived relationships with cropland cover only had moderate correlations with the other satellitederived relationships for some data sets (e.g.r = 0.53 for CCI_MODIS, Fig. S15k) because global burned area products are not very accurate for agricultural fires (Hall et al., 2016).
The relationships between burned area and population density were very different among FireMIP models and contrasted somewhat with the satellite-derived relationships (Figs. 4d,S23).ORCHIDEE, LPJG-SIMF, and partly CLM and JSBACH reproduced the satellite-derived decline of burned area with increasing population density (r > 0.4, Fig. S15).LPJG-SPITF, CTEM, and JULES had a weak agreement with the satellite-derived sensitivities (r < −0.34).However, the model ensemble median reproduced the regionally negative relationships in savannahs and the partly positive relationships in forest regions (Fig. 6h-i).FireMIP model sensitivities to cropland cover showed large differences in comparison to satellite-derived sensitivities (Fig. S16g).Only LPJG-SIMF reached a comparable corre-lation (r = 0.41) to the satellite-derived sensitivity as its internal formulation reduces burned area with increasing cropland cover; however, it does not simulate crop fires.These large differences in the sensitivities of burned area to socioeconomic variables demonstrate that fire-enabled DGVMs mostly disagree on how human effects on fire should be represented.

Land cover, vegetation productivity, and biomass
The satellite-derived global relationships to vegetationrelated predictor variables showed that burned area increased with increasing herbaceous vegetation cover (Fig. 4e), with precedent 6-monthly GPP (Fig. 4f), with precedent 6monthly LAI (Fig. S14b), and with woody litter (Fig. S14h).The satellite-derived relationships were moderately to highly correlated for most land cover types and for vegetation carbon (Fig. S15f-o).Regionally, the satellite-derived relationship with herbaceous vegetation cover was positive in most ecosystems but negative in agricultural areas in Europe, India, eastern Asia, and North America (Fig. 6j).The regional sensitivity to precedent 6-monthly GPP was strongly positive in most semi-arid regions (Fig. 6m).These relationships reflect the importance of plant productivity and fuel production for burned area.Burned area decreased with increasing actual-month LAI (Fig. S14a-d), reflecting the fact that fires usually do not occur during the wet season when LAI is high in semi-arid regions.Globally, burned area showed a bimodal sensitivity to grass height and litter depth (Fig. S14fg).In summary, the satellite-derived sensitivities demonstrate a strong global dependence of burned area dynamics on vegetation type and coverage, litter fuels, pre-season plant productivity, and fuel accumulation.

Discussion and conclusions
In summary, fire-enabled DGVMs showed the best correlations with monthly observed burned area in some savannah regions in Africa and South America.However, models generally underestimated the variance of burned area in most fire-prone semi-arid ecosystems and overestimated the variance in temperate regions.Using the RF machine learning  (d-f) number of wet days per month, (g-i) population density, (j-l) herbaceous vegetation cover, and (m-o) precedent 6-monthly GPP.Sensitivities are slopes of a linear quantile regression fit to grid cell-level partial dependencies between burned area and the predictor variables as derived from satellite-derived "fm" RF experiments (left column) and model-derived RF experiments (middle column).The right column shows the difference between model-and satellite-derived sensitivities.Stippling indicates locations where fewer than two model-derived sensitivities are within the range of satellite-derived sensitivities.Sensitivities for individual satellite data sets and FireMIP models are shown in Figs.S18 to S27.Regions with missing data (white) had no vegetation cover (e.g.deserts, ice sheets), had no burned area (e.g.parts of the Amazon and tundra), or were not covered by the vegetation carbon map used (i.e.regions in southern Australia and New Zealand).algorithm, we were able to diagnose reasons for these differences between data and models: fire-enabled DGVMs largely reproduced data-derived relationships and sensitivities between burned area and climate variables.However, models showed very different relationships with socio-economic variables and generally underestimated sensitivities to preseason plant productivity in all semi-arid ecosystems.As a consequence, these results point towards fuel properties and fuel dynamics, and human-fire interactions as components of fire-enabled DGVMs that should be the focus of future model development.In the following, we will discuss methodological aspects of our applied pattern-oriented model evaluation approach (Sect.4.1), discuss controls on fire in data and models (Sect.4.2), and finally provide suggestions on how to improve fire-enabled DGVMs using current Earth observation data sets (Sect.4.3).

Pattern-oriented evaluation of DGVMs using machine learning
Simply speaking, simulations of fire (e.g.burned area) in DGVMs can be wrong because (1) the vegetation model simulates incorrect vegetation distributions, plant productivity, and hence fuels, or (2) because the fire module misrepresents the response of fire to weather, humans, or fuel properties.Classical model benchmarking uses, for example, maps of burned area, biomass, and tree cover to quantify the model-data mismatch between these variables (Kelley et al., 2013;Schaphoff et al., 2018).However, classical model benchmarking does not allow one to disentangle the individual effects of the vegetation or fire module on the simulated burned area, as errors in the simulated vegetation might be caused by errors in burned area and vice versa.Because we use the same climate forcing, and vegetation state variables derived from each model in our machine learning approach, we are able to evaluate the response of fire models independent of their underlying DGVMs.This allows us to derive (as partial dependencies or individual conditional expectations) and evaluate the relationships between predictors and the response for each fire module separately.Hence, we are able to attribute deficiencies in fire-enabled DGVMs to human-and productivity-influences on fire.Previously, a similar approach also used a tree-based machine learning algorithm to evaluate drivers of soil carbon stocks in observational databases and in Earth system models (Hashimoto et al., 2017).Unlike classical model benchmarking, such pattern-oriented model evaluation approaches help to diagnose the reasons for model-data mismatches.
The core of our pattern-oriented model evaluation is the application of a machine learning algorithm to learn emergent relationships from data or models.We used the random forest algorithm because this algorithm has previously been used to identify drivers of burned area (Aldersley et al., 2011;Archibald et al., 2009) and does not require any assumptions about the statistical distribution of predictor variables, the shape of relationships, or the interactions between predictor variables, unlike algorithms such as generalized additive/linear models (Bistinas et al., 2014;Forkel et al., 2017;Krawchuk et al., 2009).Other flexible algorithms such as maximum entropy have also been used in empirical fire modelling (Moritz et al., 2012;Parisien et al., 2016) with very similar prediction performance and importance of variables compared to random forest (Arpaci et al., 2014).In addition, the emergent relationships between predictors and burned area that we identified here show the same directions as the relationships that were found in a previous study based on generalized linear models (Bistinas et al., 2014).These findings suggest that the choice of machine learning algorithm only marginally affects the direction and overall shape of the derived relationships.

Controls on burned area
Following previous studies, we found that climate is the primary control of global burned area which directly affects fire through weather and fuel moisture conditions, and indirectly through ecosystem productivity, vegetation type, and fuel loads (Archibald et al., 2013;Krawchuk and Moritz, 2011).Fire results from an interplay of several meteorological variables, thereby maximum temperature is an important predictor globally -especially in northern temperate and boreal ecosystems.Fire-enabled DGVMs generally reproduced the relationships with maximum temperature but overestimated the sensitivity in grassland and savannah ecosystems on average.Relationships and sensitivities with the number of wet days showed larger differences among models and, when compared to satellite-derived relationships, suggest that climate effects on fuel moisture need to be improved in fireenabled DGVMs.
As an indirect climate effect, we found that previous season plant productivity was among the most important predictor variables globally and was the dominant predictor with the strongest sensitivity to burned area in semi-arid savannah regions.It has long been recognized that the occurrence and development of fires is affected by the production and accumulation of fuels (Krawchuk and Moritz, 2011;Pausas and Ribeiro, 2013).Plant productivity in fire-prone semi-arid ecosystems has a high year-to-year variability (Ahlström et al., 2015).Our results demonstrate that the inter-annual variability in productivity and hence fuel accumulation is an important driver of the variability in burned area.Most fireenabled DGVMs poorly captured the importance, relationship, and sensitivity of previous-season plant productivity to burned area.This may be a reason why they underestimate observed variability in burned area and why they misrepresent trends in fire occurrence in Africa as well as globally (Andela et al., 2017).
While climate and fuel controls when and where fires can burn, humans are responsible for the majority of fire ignitions while also simultaneously suppressing fire.We found a www.biogeosciences.net/16/57/2019/Biogeosciences, 16, 57-76, 2019 strong decline of burned area with increasing population density between 0 and 20 people km −2 which confirms previous findings (Bistinas et al., 2014;Knorr et al., 2014).Human effects on fire emerge from various activities such as traditional land use practices (shifting cultivation, hunting, grazing, and grassland burning), the use of fires for land clearing or as tool in land conflicts, from prescribed small fires within fire management, and from unintended or illegal ignitions (Archibald, 2016;Bowman et al., 2011;Lauk and Erb, 2009;van Marle et al., 2017).The modest performance of random forest regarding reproducing satellite burned area suggests that we did not capture the complexity of humanfire interactions with the set of predictor variables used.For example, the complex non-monotonic relationship between burned area and cropland cover suggests that agricultural land use has diverging effects on fire in different agricultural regions of the world (Fig. S13c) (Korontzi et al., 2006).However, alternative variables such as cattle density or the night light-based index of socio-economic development were highly correlated with population density or cropland cover at the coarse resolution of our analysis; therefore, they did not add to prediction performance of random forest.At regional scales, land use or infrastructure-related variables are important predictors for fire (Archibald et al., 2009;Arpaci et al., 2014;Chuvieco and Justice, 2010;Parisien et al., 2010).However, these regional findings also show that the importance of human-related predictors largely differs between regions, which complicates its applicability for global-scale fire modelling.However, random forest largely emulated the simulated burned area from FireMIP models, which suggests that we indeed included the main predictors for the model world.Although some newer global fire models include the effects of cropland and pasture management on fires (Rabin et al., 2018), the complexity of human-fire interactions currently lacks a solid and large-scale empirical basis that would allow researchers to derive alternative formulations on human-fire interactions for fire-enabled DGVMs.

Improving vegetation controls on fire in DGVMs
Our results demonstrate that the role of vegetation on fire needs to be better represented in fire-enabled DGVMs to accurately simulate the variability of burned area.The links between vegetation productivity, fuel production, and fire need to be improved.Fuel production depends on plant productivity, and on the allocation, turnover, and respiration processes of carbon in different fuel types.As a first step for model improvement, fire-enabled DGVMs need to be tested and possibly re-calibrated against observations or observation-based estimates of plant productivity, above-ground biomass, and carbon turnover (Carvalhais et al., 2014;Thurner et al., 2016Thurner et al., , 2017)).Beyond total above-ground biomass, the evaluation of different fuel types (e.g.biomass in wood, canopy and understory, and litter size classes) is currently hampered by the availability of data.Only a few in situ measurements of fuel loads exist (van Leeuwen et al., 2014) and global maps of fuel properties are based on spatial extrapolations including various assumptions and uncertainties (Pettinari and Chuvieco, 2016).As an alternative, hybrid data-model-based approaches such as land carbon cycle data assimilation systems (Bloom et al., 2016) may provide consistent information to benchmark vegetation productivity, turnover, and litter fuel dynamics in DGVMs.Fire largely depends on the vegetation type (Rogers et al., 2015).Also our results show consistent land cover-specific relationships to burned area in satellite data, but these relationships differ among FireMIP models and in comparison to the satellite-derived relationships (Fig. S17).Vegetation types and associated morphological, biochemical, and structural characteristics of plants affect the flammability and fire tolerance of vegetation (Archibald et al., 2018;Pausas et al., 2017).Although global fire models have PFT-specific parameterizations for flammability (Thonicke et al., 2010), such fire-relevant plant characteristics need to be incorporated in DGVMs (Zylstra et al., 2016).Such efforts should be complemented by calibrating DGVMs against satellite observations that provide relevant information about the spatial distributions of fuel structure (Pettinari and Chuvieco, 2016;Riaño et al., 2002), fuel moisture content (Yebra et al., 2013(Yebra et al., , 2018)), fire ignitions and spread (Laurent et al., 2018), fuel consumption (Andela et al., 2016), and fire radiative energy (Kaiser et al., 2012).In summary, besides human-fire interactions, we identified vegetation effects on fire as a main deficiency of fire-enabled dynamic global vegetation models in simulating temporal dynamics of burned area.
Data availability.Data are available from the references as indicated in Table A1.
Appendix A P u blis h e r s p a g e : h t t p s:// doi.o r g/ 1 0. 5 1 9 4/ b g-1 6-5 7-2 0 1 9 Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d p a g e n u m b e r s m a y n o t b e r efl e c t e d in t hi s v e r sio n.Fo r t h e d efi nitiv e v e r sio n of t hi s p u blic a tio n, pl e a s e r ef e r t o t h e p u blis h e d s o u r c e .You a r e a d vis e d t o c o n s ul t t h e p u blis h e r's v e r sio n if yo u wis h t o ci t e t hi s p a p er.This ve r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wi t h p u blis h e r p olici e s. S e e h t t p://o r c a .cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s.Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Figure 1 .
Figure1.Overview of the approach for using the random forest machine learning algorithm to derive emergent relationships between several predictor variables for burned area from satellite and other observation-based data and in fire-enabled DGVMs.

tween − 2
and 2 with negative values indicating an underestimation and positive values indicating an overestimation of the observed variance.The reference ref is a vector of the monthly burned area time series from all satellite data sets: ref = [BA.CCI MERIS , BA.CCI MODIS , BA.GFED4, BA.GFED4s, BA.MCD64C6]

Figure 2 .
Figure 2. Comparison of temporal burned area dynamics from satellite data sets, fire-enabled DGVMs, and random forest.The maps show the median Spearman rank correlation coefficient and median fractional variance of the monthly burned area from 2005 to 2011 between (a) satellite data sets and the other satellite data sets (Fig.S3), (b) FireMIP model simulations and all satellite data sets (Fig.S4), (c) predicted burned area from RF and all satellite data sets ("full" set of predictor variables, Fig.S5), and (d) predicted burned area from RF trained against FireMIP models and the corresponding simulated burned area from each FireMIP model (Fig.S7).Green percentage numbers indicate the land area with "good" agreement (correlation ≥ 0.25 and −0.7 ≤ FV ≤ 0.7).Regions with missing data (white) either had no vegetation cover (e.g.deserts, ice sheets), had no burned area (e.g.parts of the Amazon and tundra), or were not covered by the vegetation carbon map used (i.e.regions in southern Australia and New Zealand).

Figure 3 .
Figure3.Grid cell-level importance of predictor variables in satellite-and FireMIP-derived RF experiments.The importance of variables is quantified as the change in the grid-cell level performance of the RF predictions after a predictor variable is permuted (D metric, see Sect.2). (a, b) Maps of the group of variables with the highest importance.For example, "temperature" (red) indicates that either TMAX or DTR had the maximum D metric and were the predictors with the highest importance in a grid cell.Please note that the predicted burned area in random forest (and in reality) emerges from multiple predictors and that the second-most important predictor (not shown in the maps) might have similar importance.(c, d) Global distributions of D for each variable from satellite-and model-derived "fm" RF experiments, respectively.Variables with the same colour are grouped together for the figures in (a), (b), and (e).(e) Area distribution of the variable groups with the highest D for each RF experiment.In (a) and (b), regions with missing data (white) either had no vegetation cover (e.g.deserts, ice sheets), had no burned area (e.g.parts of the Amazon and tundra), or were not covered by the vegetation carbon map used (i.e.regions in southern Australia and New Zealand).

Figure 4 .
Figure 4. Example of global emergent relationships of the fractional burned area per month to predictors from satellite-derived (grey; mean and range based on the five burned area data sets) and FireMIP model-derived (colours) "fm" random forest experiments for six selected variables.Tick marks along the x axis show the deciles (minimum, quantile 0.1 to maximum) of the global distribution of each predictor variable.Maximum temperature was cut at 0 • C in (a) and population density was cut at 100 people km −2 in (d).Emergent relationships for other predictor variables are shown in Figs.S16 and S17.

Figure 5 .
Figure 5. Correlations between global relationships from satellitederived and model-derived RF "fm" experiments.Pearson correlations were computed from the relationships as shown in Fig. 4. Boxes show the distribution of all model-data correlations (five satellite-derived relationships × seven FireMIP model-derived relationships).Correlations for individual satellite-and model-derived RFs are shown in Fig. S15.

Figure 6 .
Figure 6.Regional sensitivities of burned area to the driving factors for five selected variables (a-c) maximum temperature, (d-f) number of wet days per month, (g-i) population density, (j-l) herbaceous vegetation cover, and (m-o) precedent 6-monthly GPP.Sensitivities are slopes of a linear quantile regression fit to grid cell-level partial dependencies between burned area and the predictor variables as derived from satellite-derived "fm" RF experiments (left column) and model-derived RF experiments (middle column).The right column shows the difference between model-and satellite-derived sensitivities.Stippling indicates locations where fewer than two model-derived sensitivities are within the range of satellite-derived sensitivities.Sensitivities for individual satellite data sets and FireMIP models are shown in Figs.S18 to S27.Regions with missing data (white) had no vegetation cover (e.g.deserts, ice sheets), had no burned area (e.g.parts of the Amazon and tundra), or were not covered by the vegetation carbon map used (i.e.regions in southern Australia and New Zealand).

Table 1 .
Overview of used burned area data sets and FireMIP models.

Table A1 .
Overview of predictor variables, data sets used, and their utilization in random forest experiments.