Emergent relationships on burned area in global satellite observations and fire-enabled vegetation models

Recent climate changes increases fire-prone weather conditions in many regions and likely affects fire occurrence, which might impact ecosystem functioning, biogeochemical cycles, and society. Prediction of how fire impacts may change in the future is difficult because of the complexity of the controls on fire occurrence and burned area. Here we aim to assess how process-based fire-enabled Dynamic Global Vegetation Models (DGVMs) represent relationships between controlling 30 factors and burned area. We developed a pattern-oriented model evaluation approach using the random forest (RF) algorithm to identify emergent relationships between climate, vegetation, and socioeconomic predictor variables and burned area. We applied this approach to monthly burned area time series for the period 2005-2011 from satellite observations and from DGVMs from the Fire Model Inter-comparison Project (FireMIP) that were run using a common protocol and forcing datasets. The satellite-derived relationships indicate strong sensitivity to climate variables (e.g. maximum temperature, number of wet 35 days), vegetation properties (e.g. vegetation type, previous-season plant productivity and leaf area, woody litter), and to socioeconomic variables (e.g. human population density). DGVMs broadly reproduce the relationships to climate variables


2
-P 8, L 22-24: "The vegetation biomass dataset does not cover southern Australia and New Zealand. Although fire is common in these regions, we did not fill the global vegetation biomass map with a regional map to avoid potential artefacts in the derived sensitivities that would likely result from merging different biomass maps." -P 11, L3-4 and P 13, L 11-13 and P 17, L 3-5: "Regions with missing data (white) are either without vegetation cover (e.g. deserts, ice sheets), had no burned area (e.g. parts of the Amazon and tundra), or were not covered by the used 5 vegetation carbon map (i.e. regions in southern Australia and New Zealand)." -P 13, L 6-8: "Please note that the predicted burned area in random forest (and in reality) emerges from multiple predictors and that the second-most important predictor (not shown in the maps) might have similar importance." -P 19, L 23-24: "Fire results from an interplay of several meteorological variables, thereby maximum temperature was an important predictor globally and especially in northern temperate and boreal ecosystems. " 10 -P 20 L 25-26: "Our results demonstrate that the role of vegetation on fire needs to be better represented in fire-enabled DGVMs to accurately simulate the variability of burned area." -Supplement, Figs. S 16 and S17: The placement of the legend within the figures was improved. and some models to population density. Interestingly, satellite-derived responses show a strong increase of burned area with previous-season leaf area index and plant productivity in most fire-prone ecosystems which was largely underestimated by most DGVMs. Hence our pattern-oriented model evaluation approach allowed us to diagnose that vegetation effects on fire are a main deficiency of fire-enabled dynamic global vegetation models to accurately simulate the role of fire under global environmental change. 5

Introduction
About 3% of the global land area burns every year Giglio et al., 2013;Randerson et al., 2012). Fire represents a strong control on large-scale vegetation patterns and structure (Bond et al., 2004) and can significantly accelerate impacts of changing climate or land management on global ecosystems (Aragão et al., 2018;Beck et al., 2011). Fire affects global and regional climate directly through changing surface albedo (López-Saldaña et al., 2015;Randerson et al., 2006), 10 atmospheric trace gas and aerosol concentrations (Andreae and Merlet, 2001;Ward et al., 2012), and on longer time scales by affecting vegetation composition and structure with subsequent impacts on the carbon cycle and hydrology (Li and Lawrence, 2016;Pausas and Dantas, 2017;Tepley et al., 2018;Thonicke et al., 2001).
Climate influences several aspects of the fire regime, including the seasonal timing of lightning ignitions (Veraverbeke et al., 2017), temperature and moisture controls on fuel drying, and wind-driven fire spread (Jolly et al., 2015). Climate also 15 influences the nature and availability of fuel, through its impact on vegetation productivity and structure .
Vegetation structure in turn influence the patterns of fuel amounts and moisture that directly determine fire spread, severity, and extent (Krawchuk and Moritz, 2011;Pausas and Ribeiro, 2013). People set and suppress fires and use them to manage agricultural and natural ecosystems, for land use change and deforestation practices (Andela and van der Werf, 2014;Marle et al., 2017). Human-induced modifications and fragmentation of natural vegetation through agricultural expansion and 20 urbanization limits fire spread (Bowman et al., 2011). Thus, climate, vegetation, and human controls on fire are multivariate and have strong interactions with one another (Bowman et al., 2009;Harrison et al., 2010;Krawchuk et al., 2009). Empirical analyses of fire regimes by using machine learning algorithms have identified the most important variables and their sensitivities for fire occurrence and spread (Aldersley et al., 2011;Archibald et al., 2009;Bistinas et al., 2014;Forkel et al., 2017;Krawchuk et al., 2009;Moritz et al., 2012). However, because of the difficulty of factoring out interactions between 25 predictor variables, such sensitivities represent emergent relationships rather than specific physical controls on fire. Thus, it has proved difficult to disentangle the role of changes in any single factor on the trajectory of changes in fire regimes. For example, changes in climate result in increasing fire weather conditions and fire activity in some temperate regions (Holden et al., 2018;Jolly et al., 2015;Müller et al., 2015) but it has been suggested that changes in land use compensate climate effects and result for example in declining burned areas in African Savannahs (Andela and van der Werf, 2014). Hence there is still 30 uncertainty, for example, about the cause of the recent observed decline in global burned area (Andela et al., 2017). There is even greater uncertainty about the potential trajectory of changes in fire regimes in the future (Settele et al., 2014).
Fire-enabled Dynamic Global Vegetation Models (DGVMs) or Earth System Models are process-oriented tools to predict the consequences of future climate change on fire regimes and associated feedbacks (Hantson et al., 2016). Our faith in these projections is contingent on the ability of these models to capture features of the current situation. State-of-the-art fire-enabled DGVMs partly capture the spatial patterns of burned area (Andela et al., 2017;Kelley et al., 2013) but doubt has been cast on their ability to capture the response to extreme events and recent trends in burned area (Andela et al., 2017). This suggests that 5 these models inaccurately represent the response of fire to combined changes in climate, vegetation, and socioeconomic drivers.
Here we aim to test how fire-enabled DGVMs reproduce emergent relationships with the drivers of burned area. We apply a machine learning algorithm to the output from seven fire-enabled DGVMs and a suite of satellite and other observation-based datasets in order to derive emergent relationships between a number of potential drivers of burned area. By comparing the 10 model-and data-derived emergent relationships, we assess the degree to which DGVMs reproduce these relationships. While we make no assumption about the actual physical controls on burned area, this comparison allows us to pinpoint relationships between drivers and burned area that are unrealistic represented in fire-enabled DGVMs.

Method summary 15
In order to infer relationships between potential drivers of fire in satellite data and fire-enabled DGVMs, we applied the random forest (RF) machine-learning algorithm to predict monthly burned area (response variable) from climate, vegetation, and socioeconomic predictor variables ( Figure 1). Predictor variables and burned area were taken either from satellite and other observation-based datasets or from simulations by a suite of fire-enabled DGVMs from the Fire Model Inter-comparison project (FireMIP) (Rabin et al., 2017) to derive relationships for datasets and models, respectively. 20 The RF algorithm is a regression approach that allows non-linear, non-monotonic, and non-additive relations between multiple predictor variables and the target variable. RF averages predicted values across an ensemble of decision trees that are built based on the training data set (Breiman, 2001;Cutler et al., 2012). We built three sets of RF models. We first built RF models for satellite-observed burned area based on a multitude of predictor variables to derive relationships from data. There are differences between the available burned area datasets; we therefore used five recent and/or well-established datasets to 25 encompass these uncertainties. The fire-enabled DGVMs do not use some of the predictor variables in the satellite-derived RF. We hence built a second set of satellite-derived RF models with a reduced set of predictor variables. The third set of RF models was derived for each FireMIP model by using the simulated burned area as target variable and simulations of gross primary production (averaged over precedent months), biomass and land cover predictor variables, and the population density and climate predictor variables that were used as inputs for the models (according to the FireMIP protocol  (Breiman, 2001;Cutler et al., 2012) and partial dependencies curves (Goldstein et al., 2013). These dependencies represent the emergent relationships of burned area to drivers in the observation-or model-variable space. We then compared the data-and model-derived emergent relationships and sensitivities both globally and per grid cell basis. 5 Figure 1: Overview of the approach on using the random forest machine learning algorithm to derive emergent relationships between several predictor variables for burned area in satellite and other observation-based data and in fire-enabled DGVMs.

Burned area from satellite datasets
There are several global burned area datasets, and both the spatial patterns and temporal dynamics differ between them 10 (Hantson et al., 2016;Humber et al., 2018) because they use different satellite sensors and retrieval algorithms, and have different sensitivities to small fires Giglio et al., 2013;Randerson et al., 2012). We used the variability between five global datasets (Table 1) as an estimate of uncertainty. However, by doing so we might still underestimate the real uncertainty in burned area observations because all datasets rely on active fire detections (thermal anomaly) and on reflectance changes from the same sensor (MODIS). As exception, CCI_MERIS uses MERIS reflectances combined with MODIS active fires.
We restricted our analysis to burned area data with high observational quality. Observational quality indicates to which degree missing input satellite imagery or contaminations by clouds, smoke, snow and shadows limit burned area detection. Especially MERIS land observations are subject to substantial gaps in raw data acquisitions . Low observational 5 coverage can result in strongly underestimated burned area. Here, we used the CCI_MERIS "observed area fraction" layer as a time-variant mask to all burned area datasets and only included estimates for months with observational coverage higher than 80 %. We also excluded burned area in months with < 0 °C to remove suspicious small burned areas in polar regions or in winter months that are likely caused by insufficiently corrected gas flares and other industrial activities. Analyses were made with monthly burned area observations for the period 2005-2011, which is the common period between the five datasets. 10

Burned area from FireMIP models
A detailed description of FireMIP DGVMs and the simulation protocol is given by Rabin et al. (2017). Here we used monthly burned area from seven models that made transient simulations from 1700 to 2013 (Table 1, bottom half). The models were forced using inputs of meteorological variables from the CRUNCEP V5 dataset (Wei et al., 2014), monthly cloud-to-ground lightning strikes (Rabin et al., 2017), annually-updated values of human population density from the HYDE 3.1 data set ( Klein 15 Goldewijk et al., 2010), annually-updated land use and land cover changes from the Hurtt et al. (2011) data set, and annuallyupdated values of global atmospheric CO2 (Le Quéré et al., 2014). Although forcing datasets are common across DGVMs, they do not use the same set of forcing variables, i.e. wind speed (WSPEED), or use population density (PopDens) for fire ignitions and/or fire suppression.
The model outputs were aggregated to a common spatial resolution of 2.5° longitude x 1.89° latitude. Aggregation was done 20 by averaging the fractional burned area from all high-resolution grid cells that belong to the same coarse-resolution grid cell.
Nearest neighbour resampling was done if less than two high-resolution grid cells were within one coarse-resolution grid cell.
Analyses were made for the same period as the common window of the satellite data (2005)(2006)(2007)(2008)(2009)(2010)(2011) and by also applying the "observed area mask" from the satellite data.

Evaluation of data-data and model-data temporal agreement
We evaluated the temporal agreement of monthly burned area time series in 2005-2011 between the datasets and between the datasets and the fire-enabled DGVMs based on various model performance metrics (Janssen and Heuberger, 1995) on a per-5 7 grid cell basis. We selected the Spearman rank-correlation coefficient to compare the temporal agreement and the fractional variance (FV) to compare the variability of burned area per grid cells: where σref and σx are the variances of the reference and observed or simulated burned area, respectively. FV ranges between - In the case of a comparison of a single satellite datasets (e.g. x = BA.CCI_MERIS) with the other satellite datasets, this dataset was not used in the reference vector. This approach directly considers the differences between datasets in the computation of model performance metrics and implies that it is impossible for a FireMIP model or for one single satellite dataset to reach an 10 optimal correlation of unity or a FV of zero as long as the satellite burned area datasets show differences. We used the median of the correlation coefficient and of the FV for each grid cell to quantify the data-data or model-data agreement over the ensemble of datasets or models. As a single global agreement metric, we computed the percentage of the land area that shows a "good" agreement from the spatial patterns of Spearman correlation Cor and FV, where good agreement for an individual grid cell was defined based on a positive and non-random relationship (i.e. Cor ≥ 0.25) and a comparable variance (-0.75 ≤ FV 15 ≤ 0.75) between simulated and observed burned area.

Predictor variables and datasets
Several variables have been identified as predictors of global fire in previous studies, inter alia the number of dry or wet days per month (WET), diurnal temperature range (DTR), maximum temperature (TMAX), grass and shrub cover, leaf area index (LAI), net primary production (NPP), population density (PopDens), and gross domestic product (GDP) (Aldersley et al., 20 2011;Bistinas et al., 2014). Other variables have been found important for fire at a regional scale, including total precipitation, tree cover, forest cover type, tree height, biomass and litter fuel loads, and grazing (Archibald et al., 2009;Chuvieco et al., 2014;Parisien et al., 2010;Pettinari and Chuvieco, 2017). We created a combined set of potential variables used in these studies to predict burned area (Table A 1). We used data on gross primary production (GPP) instead of NPP as GPP can be estimated from eddy covariance observations and does not require model assumptions about autotrophic respiration. 25 Climate data. Climate data was taken from the CRUNCEP V5 dataset (Wei et al., 2014). CRUNCEP provides six-hourly time series of precipitation, maximum and minimum temperature, and wind speed. From these time series, we derived the monthly mean of daily maximum temperature (CRUNCEP.TMAX) and minimum temperature (CRUNCEP.TMIN), the monthly mean daily diurnal temperature range (CRUNCEP.DTR = TMAX -TMIN), the monthly 90 th percentile of daily wind speed (CRUNCEP.WSPEED), monthly total precipitation (CRUNCEP.P) and the number of wet days per month (CRUNCEP.WET). 30 A wet day was defined as a day with≥ 0.1 mm precipitation (Harris et al., 2014).

35
Land cover. Land cover was taken from the ESA CCI Land cover V2.0.7 dataset which provides annual land cover maps for the period 1992-2015 (Li et al., 2018). Land cover classes were converted into the fractional coverage of plant functional types (PFTs). For this conversion, we used the cross-walking approach (Poulter et al., 2011 based on the conversion table in Forkel et al. (2017). Individual PFTs combine growth form (tree, shrubs, herbaceous vegetation, or crops) with leaf type (broad-leaved or needle-leaved) and leaf longevity (evergreen or deciduous). The variable Tree.BD, for example, is the 5 fractional coverage of broad-leaved deciduous trees (Table A 1). We created an additional category combining trees and shrubs (e.g. TreeS.BD = Tree.BD + Shrub.BD) because most of the FireMIP models simulate woody vegetation rather than separating shrubs and trees explicitly (Table S 1). JULES, LPJG-SIMF, and LPJG-SPITF dynamically simulate the fractional coverage of PFTs, but CLM, CTEM, JSBACH, and ORCHIDEE used prescribed PFT distributions. We reclassified the PFTs of each model into the same set of PFTs that we derived from the CCI land cover dataset (Table S 1

). 10
Vegetation productivity. Data on gross primary production (GPP) and leaf area index (LAI) were taken to account for the seasonal effects of vegetation productivity and canopy development. GPP was taken from the FLUXCOM dataset which is up-scaled from GPP estimates at FLUXNET measurement sites (Tramontana et al., 2016). We used the FLUXCOM dataset that used satellite and CRUNCEP meteorological data for the upscaling. LAI was taken from MODIS (USGS, 2001, p.2). GPP and LAI were averaged to monthly mean values (e.g. variable name GPP.orig). To account for seasonal fuel accumulation, we 15 also computed previous-season GPP or LAI values as the mean over the three and six months before the month of comparison with burned area (e.g. GPP.pre3mon and GPP.pre6mon).

Biomass and fuels.
We used temporally-static vegetation datasets to account for the effects of vegetation biomass, fuel properties, and ecosystem structure on burned area dynamics. Total above-and below-ground vegetation biomass was obtained from Carvalhais et al. (2014), which is based on an above-ground forest biomass map for the tropics for the early 2000s 20 (Saatchi et al., 2011), a total forest biomass map for temperate and boreal forests for the year 2010 , and an estimate of herbaceous biomass . The vegetation biomass dataset does not cover southern Australia and New Zealand. Although fire is common in these regions, we did not fill the global vegetation biomass map with a regional map to avoid potential artefacts in the derived sensitivities that would likely result from merging different biomass maps. From each FireMIP model, we used the simulated vegetation carbon averaged for the years 2005-2011 as the equivalent to this data 25 set. We used canopy height from Simard et al. (2011); this data set provides a snapshot of average canopy height in 2005.
Factors related to fuel properties, specifically grass height, litter depth, woody litter depth, and amounts of woody litter in different size classes were extracted from the global fuelbed database (Pettinari and Chuvieco, 2016). This database is based on a land cover-based extrapolation of regional fuel databases to the globe and provides a generic picture of conditions around

30
Socioeconomic data. We used the annually-varying population density dataset from the HYDE V3.1 database (Klein Goldewijk et al., 2011), which was used as a forcing dataset for the FireMIP simulations. We also used annually-varying gross

Random forest experiments and selection of predictor variables
We performed our analysis using the randomForest package V4.6-12 in R (Liaw and Wiener, 2002). We trained the RF with 500 regression trees. The training target was either a "satellite-observed" or a "model-simulated" burned area, i.e. we trained 5 one RF against each burned area dataset and each individual FireMIP model simulation, respectively. We used two sets of predictor variables in three sets of RF experiments (Table A 1):  "RF.Satellite.full" for satellite-derived RF experiments: We used 23 out of all 28 predictor variables to train RF models for each burned area dataset. Five predictor variables were not included in the RF because they were highly correlated with others (r > 0.8, i.e. night-light development index, cattle density, woody litter for the 10 h fuel size 10 class, precedent 3-monthly GPP, and precedent 3-monthly LAI, Figure S 1). The purpose of these experiments was to identify the relationship between burned area and each predictor variable from datasets.
 "RF.Satellite.fm" for satellite-derived RF experiments: These experiments were also trained against burned area datasets but included only the reduced set of 16 data-based predictor variables that are available from both observational datasets and the FireMIP (fm) models. 15  "RF.FireMIP.fm" for model-derived RF experiments: These experiments used the reduced set of predictor variables with land cover, GPP, biomass, and the response variable burned area taken from simulations of each FireMIP model.
The purpose of these experiments was to compare relationships and sensitivities from satellite-and FireMIP-derived RF experiments.

Importance of predictor variables in random forest 20
The normal method of determining the importance of predictor variables for RFs (increment in mean-squared error, MSE) was found to be overly sensitive to the burnt area dataset that was used in training because of the highly skewed distribution of burned area, and this hampers its interpretability (Figure S 8, Figure S 9). To overcome this issue and to obtain additional information about regional (i.e. grid cell-level) importance of predictor variables, we developed an alternative approach.
This alternative approach uses the fractional variance (FV) and Spearman correlation (r) instead of the MSE and is computed 25 for each grid cell. The importance of variables is quantified as a distance D in a two-dimensional space based on these metrics: where FV0 and r0, and FVp and rp are the performance metrics based on the original RF predictions and based on the RF predictions after permuting a single predictor variable, respectively. The FV-related term was multiplied with 0.5 to obtain the same range as the correlation. FV and r are computed at grid cell-level based on monthly burned area time series from the RF 30 Deleted: like predictions and the training data (i.e. burned area from a satellite dataset or from a FireMIP model). As the metric D depends on the permutation, we permutated each predictor variable 10 times and averaged the D metric.

Deriving emergent relationships and sensitivities from random forest
Insight into the shape of a relationship between a predictor and the target variable in a trained RF can be obtained from partial dependence (PDP) (Friedman, 2001) and individual conditional expectation (ICE) plots (Goldstein et al., 2013) (Figure S 2). 5 PDPs show the partial relationship between the predicted target variable and one predictor variable when other predictor variables are set to their mean value. ICE plots show the relationship between the predicted target variable and one predictor variable for individual cases of the predictor dataset (Goldstein et al., 2013). In our application, an individual case is a specific combination of climate, land cover, vegetation, and socioeconomic data for a given grid cell in a given month (Figure S 2).
The average of all ICE curves corresponds to the PDP. We used the ICEbox package V1.1.2 for R for the computation of ICE 10 curves and partial dependencies (Goldstein et al., 2013).
We computed ICE curves for all predictor variables and from all RF experiments (Supplementary Information 4 and 5). We computed ICE curves and PDPs based on the global dataset to analyse and compare global emergent relationships. Pearson's correlation coefficient was computed between pairs of satellite-and model-derived ICE curves to quantify the agreement of the emergent relationships ( Figure S 15). We also computed PDPs for each grid cell to produce global maps of partial 15 sensitivities for selected predictor variables. To summarize and map the PDP of each grid cell in a single number, we fitted a linear quantile regression to the median between the partial dependence of burned area and the corresponding predictor variable and mapped the slope of this regression. In the following, we name this slope "sensitivity".

Evaluation of temporal burned area dynamics 20
Here we compare the monthly temporal dynamics of burned area from the satellite datasets, FireMIP model simulations, and random forest predictions for the overlapping period 2005-2011. The satellite datasets showed in average relatively good agreement with each other (i.e. "good" is r ≥ 0.25 and -0.7 ≤ FV ≤ 0.7) over 70% of the global vegetated land area, with best agreement in frequently burning grasslands and savannahs (Figure 2 a). However, individual datasets showed good agreement in only 31-56% of the land area ( Figure S 3). The largest dissimilarities between burned area datasets occurred in temperate 25 land use-intense regions (North America, Europe, China), tropical forests, and in sparsely vegetated arid and tundra regions.
These difference are likely caused by limited detection possibilities under cloud cover (e.g. in the Amazon) and by the sensitivities of the algorithms to detect small fires (temperate and sparsely-vegetated regions). As the CCI_MERIS dataset is based on a different sensor, it is the most different from the other datasets (31% of land area with good agreement, Figure S 3). Hence these uncertainties make it necessary to train RF to each dataset separately in order to assess how such uncertainties 30 translate into emergent relationships to burned area.
FireMIP models showed good agreement with satellite datasets in 9% of the land area (Figure 2 b). In particular, models tended to underestimate the variability of burned area in key biomass burning regions, while overestimating fire variability in arid and some temperate regions of infrequent fire activity. Individual FireMIP models had weaker performance than the model ensemble (6% to 8% with good agreement, Figure S 4).  predicted burned area from RF trained against FireMIP models and the corresponding simulated burned area from each FireMIP model (Figure S 7). Green percentage numbers indicate land area with "good" agreement (Correlation ≥ 0.25 and -0.7 ≤ FV ≤ 0.7). Regions with missing data (white) are either without vegetation cover (e.g. deserts, ice sheets), had no burned area (e.g. parts of the Amazon and tundra), or were not covered by the used vegetation carbon map (i.e. regions in southern Australia and New Zealand).

5
The RF models can reproduce the temporal dynamics of the satellite burned area datasets reasonably well in most frequently burning regions (Figure 2 c). The overall proportion of the vegetated land area showing good agreement in "full" experiments was only 36% but individual RF models reached better performances (up to 41% with good agreement, Figure S 5). The "fm" RF models had slightly weaker performance (22% to 38% with good agreement, Figure S 6). However, the performance of RF models was much higher than the performance of FireMIP models (Figure 2 b). RF was also able to largely emulate the 10 simulated burned area from FireMIP models (85% with good agreement with the FireMIP simulation, Figure 2 d). The RF models most closely emulated simulated burned area in those FireMIP models that are based on empirical relationships (JULES, LPJG_SIMF, Figure S 7). In summary, the ability of RF models to emulate simulated or observed monthly burned area dynamics is sufficient for the purposes of comparing satellite-derived and FireMIP-derived relationships.

Importance of predictor variables in random forest 15
Satellite-derived RF experiments show that temperature-related variables were the most important predictors for temporal burned area dynamics in temperate and boreal regions, and land cover-or productivity-related variables were most important in subtropical and tropical regions (Figure 3 a) On average, the FireMIP model-derived RF experiments broadly reproduced the satellite-derived importance of predictor 25 variables (Figure 3 b). However, maximum temperature, precedent 6-monthly GPP, and number of wet days had a lower importance, but diurnal temperature range, cropland cover, and precipitation had a higher importance than in the satellitederived RFs (Figure 3 d). In addition, the model-derived importance of predictor variables differed among FireMIP models

Climate
The satellite-derived global relationships showed expected patterns between burned area and climate variables: burned area 10 increased exponentially with maximum temperature, decreased with an increasing number of wet days per month, and increased with diurnal temperature range (Figure 4 a-c). The shapes of the relationships of burned area to climate variables were robust among the burned area datasets (Figure S 11). However, burned area datasets show offsets between the relationship curves: For example, the curves that were derived from the GFED4s and CCI_MERIS datasets show usually higher burned area than the curves from the other datasets (Figure S 11). These positive offsets are caused by the fact that GFED4s and 15 CCI_MERIS include more small fires and have hence an overall higher burned area than the other datasets. RF experiments that either use the "full" or "fm" set of predictor variables resulted in largely similar relationships (Figure S 12).
The relationships between burned area and climate variables were broadly similar for the FireMIP models (Figure 4 a-c, Figure   S 15, Figure S 16). Most model-derived global relationships agreed relatively well (r > 0.5) with satellite-derived relationships for maximum temperature, diurnal temperature range, and the number of wet days ( Figure 5). However, LPJG-SPITF and 5 ORCHIDEE did not reproduce the satellite-derived increase of burned area with maximum temperature (Figure 4 a). In the case of LPJG_SPITF, this is likely due to a modification to the calculation of dead fuel moisture. In contrast to other SPITFIRE implementations, LPJG_SPITF uses soil moisture in part to determine dead fuel moisture. This likely explains the failure of LPJG_SPITF to reproduce the dependency on maximum temperature and the markedly different behaviour from the other SPITFIRE models seen here. CLM and JSBACH did not reproduce the decrease of burned area with increasing number of wet 10 days (Figure 4 b).
Regionally, sensitivities to maximum temperature were positive over most land areas in satellite-derived RF experiments

Socioeconomics
The satellite-derived global relationships showed that burned area increased exponentially as population density decreased at very low values (< 20 persons km -2 ) and, generally, showed no sensitivity when population density was > 40 persons km -2 (Figure 4 d, Figure S 13 a). Regionally, the satellite-derived sensitivity to population density varied with vegetation type. It 10 was negative in most grassland and savannah ecosystems but positive in infrequently burning forested ecosystems (Figure 6 g). Burned area exponentially increased at very low gross domestic product per capita ( Figure S 13 b). The relationship between burned area and cropland area was non-monotonic: all datasets showed a burned area peak at < 5% cropland, minimum burned area at 5-30% cropland cover, and an increasing burned area at > 30% cropland cover ( Figure S Figure 6 h-i). FireMIP model sensitivities to cropland cover showed large differences in comparison to satellitederived sensitivities ( Figure S 16 g). Only LPJG_SIMF reached a comparable correlation (r = 0.41) to the satellite-derived sensitivity because its internal formulation reduces burned area with increasing cropland cover, it however does not simulate 25 crop fires. These large differences in the sensitivities of burned area to socioeconomic variables demonstrate that fire-enabled DGVMs mostly disagree on how human effects on fire should be represented.

Land cover, vegetation productivity, and biomass
The satellite-derived global relationships to vegetation-related predictor variables showed that burned area increased with increasing herbaceous vegetation cover (Figure 4e), with precedent 6-monthly GPP (Figure 4 f), with precedent 6-monthly 30 LAI ( Figure S 14 b), and with woody litter (Figure S 14 h). The satellite-derived relationships were for most land cover types and for vegetation carbon moderately to highly correlated ( Figure S 15 f-o). Regionally, the satellite-derived relationship to herbaceous vegetation cover was positive in most ecosystems but negative in agricultural areas in Europe, India, Eastern Asia, and North America (Figure 6 j). The regional sensitivity to precedent 6-monthly GPP was strongly positive in most semi-arid regions (Figure 6 m). These relationships reflect the importance of plant productivity and fuel production for burned area.
Burned area decreased with increasing actual-month LAI (Figure S 14 a-d), reflecting the fact that fires usually do not occur during the wet season when LAI is high in semi-arid regions. Globally, burned area showed a bimodal sensitivity to grass height and litter depth ( Figure S 14 f-g). In summary, the satellite-derived sensitivities demonstrate a strong global dependence 5 of burned area dynamics on vegetation type and coverage, litter fuels, pre-season plant productivity and fuel accumulation.
FireMIP models reproduced the general increase of burned area with increasing herbaceous vegetation (Figure 4 e, Figure 5). However, regional sensitivities to herbaceous cover differed among models (Figure S 25). The satellite-derived increase of burned area with precedent 6-monthly GPP was reproduced by LPJ_SPITF, ORCHIDEE, JSBACH, and JULES (r > 0.6) while LPJG_SIMF had a reverse relationship (Figure 4 f). However, the FireMIP models underestimated the regional sensitivity to 10 precedent 6-monthly GPP especially in most fire-prone semi-arid regions such as African savannahs, Australia, the Mediterranean, and temperate steppes ( Figure 6 n-o) but patterns strongly differed among models (Figure S 27).

Discussion and Conclusions
In summary, fire-enabled DGVMs showed the best correlations with monthly observed burned area in some Savannah regions in Africa and South America. However, models generally underestimated the variance of burned area in most fire-prone semi-15 arid ecosystems and over-estimated the variance in temperate regions. By using the RF machine learning algorithm, we were able to diagnose reasons for these differences between data and models: Fire-enabled DGVMs largely reproduced data-derived relationships and sensitivities between burned area and climate variables. However, models showed very different relationships with socioeconomic variables and generally underestimated sensitivities to pre-season plant productivity in all semi-arid ecosystems. As a consequence, these results point towards fuel properties and fuel dynamics, and human-fire interactions as 20 components of fire-enabled DGVMs that should be in the focus of future model development. In the following, we will discuss methodological aspects of our applied pattern-oriented model evaluation approach (4.1), discuss controls on fire in data and models (4.2), and finally provide suggestions on how to improve fire-enabled DGVMs by using current Earth observation datasets (4.3).

Pattern-oriented evaluation of DGVMs using machine learning 25
Simply speaking, simulations of fire (e.g. burned area) in DGVMs can be wrong because #1 the vegetation model simulates wrong vegetation distributions, plant productivity, and hence fuels, or #2 because the fire module misrepresents the response of fire to weather, humans, or fuel properties. Classical model benchmarking uses, for example, maps of burned area, biomass, and tree cover to quantify the model-data mismatch between these variables (Kelley et al., 2013;Schaphoff et al., 2018). However, classical model benchmarking does not allow to disentangle the individual effects of the vegetation or fire module 30 on the simulated burned area because errors in the simulated vegetation might be caused by errors in burned area and vice versa. Because we use the same climate forcing, and vegetation state variables derived from each model in our machine learning approach, we are able to evaluate the response of fire models independent from their underlying DGVMs. This allows us to derive (as partial dependencies or individual conditional expectations) and evaluate the relationships between predictors and response for each fire module separately. Hence we are able to attribute deficiencies in fire-enabled DGVMs to humanand productivity-influences on fire. Previously, a similar approach used also a tree-based machine learning algorithm to 5 evaluate drivers of soil carbon stocks in observational databases and in Earth System Models (Hashimoto et al., 2017). Unlike classical model benchmarking, such pattern-oriented model evaluation approaches help to diagnose the reasons for model-data mismatches.
The core of our pattern-oriented model evaluation is the application of a machine learning algorithm to learn emergent relationships from data or models. We used the random forest algorithm because this algorithm has been previously used to 10 identify drivers of burned area (Aldersley et al., 2011;Archibald et al., 2009) and does not require any assumptions about the statistical distribution of predictor variables, the shape of relationships, and the interactions between predictor variables unlike algorithms such as generalized additive/linear models (Bistinas et al., 2014;Forkel et al., 2017;Krawchuk et al., 2009). Other flexible algorithms such as maximum entropy have been used as well in empirical fire modelling (Moritz et al., 2012;Parisien et al., 2016) with very similar prediction performance and importance of variables compared to random forest (Arpaci et al., 15 2014). In addition, the emergent relationships between predictors and burned area that we identified here show the same directions like the relationships that have been found in a previous study based on generalized linear models (Bistinas et al., 2014). These findings suggest that the choice of the machine learning algorithm only marginally affects the direction and overall shape of the derived relationships.

Controls on burned area 20
Following previous studies, we found that climate is the primary control of global burned area which affects fire directly through fire weather and fuel moisture conditions, and indirectly through ecosystem productivity, vegetation type, and fuel loads (Archibald et al., 2013;Krawchuk and Moritz, 2011). Fire results from an interplay of several meteorological variables, thereby maximum temperature was an important predictor globally and especially in northern temperate and boreal ecosystems. Fire-enabled DGVMs generally reproduced the relationships with maximum temperature but on average 25 overestimated the sensitivity in grassland and Savannah ecosystems. Relationships and sensitivities with the number of wet days showed larger differences among models and in comparison to satellite-derived relationships, suggesting that climate effects on fuel moisture need to be improved in fire-enabled DGVMs.
As an indirect climate effect, we found that previous season plant productivity was among the most important predictor variables globally and was the dominant predictor with the strongest sensitivity to burned area in semi-arid savannah regions. 30 It has been long recognized that the occurrence and development of fires is affected by the production and accumulation of fuels (Krawchuk and Moritz, 2011;Pausas and Ribeiro, 2013). Plant productivity in fire-prone semi-arid ecosystems has a high year-to-year variability (Ahlström et al., 2015). Our results demonstrate that the inter-annual variability in productivity and hence fuel accumulation is an important driver for the variability in burned area. Most fire-enabled DGVMs poorly captured the importance, relationship, and sensitivity of previous-season plant productivity on burned area. This may be a reason why they underestimate observed variability in burned area and it might be one reason why they misrepresent trends in fire occurrence in Africa and globally (Andela et al., 2017).
While climate and fuel controls when and where fires can burn, humans are on the one hand responsible for the majority of 5 fire ignitions and on the other hand suppress fire. We found a strong decline of burned area with increasing population density between 0 and 20 person km -2 which confirms previous findings (Bistinas et al., 2014;Knorr et al., 2014). Human effects on fire emerge from various activities such as from traditional land use practices (shifting cultivation, hunting, grazing, and grassland burning); the use of fires for land clearing or as tool in land conflicts; from prescribed small fires within fire management; and from unintended or illegal ignitions (Archibald, 2016;Bowman et al., 2011;Lauk and Erb, 2009;Marle et 10 al., 2017). The modest performance of random forest in reproducing satellite burned area suggests that we did not capture the complexity of human-fire interactions with the used set of predictor variables. For example, the complex non-monotonic relationship between burned and cropland cover suggests that agricultural land use has diverging effects on fire in different agricultural regions of the world (Figure S 13 c) (Korontzi et al., 2006). However, alternative variables such as cattle density or the night light-based index of socio-economic development were highly correlated with population density or cropland 15 cover at the coarse resolution of our analysis and did therefore not add to prediction performance of random forest. At regional scales, land use or infrastructure-related variables are important predictors for fire (Archibald et al., 2009;Arpaci et al., 2014;Chuvieco and Justice, 2010;Parisien et al., 2010). However, these regional findings also show that the importance of humanrelated predictors largely differs between regions, which complicates its applicability for global-scale fire modelling. However, random forest largely emulated the simulated burned area from FireMIP models, which suggests that we indeed included the 20 main predictors for the model world. Although some newer global fire models include effects of cropland and pasture management on fires (Rabin et al., 2018), the complexity of human-fire interactions lacks currently a solid and large-scale empirical basis that would allow to derive alternative formulations on human-fire interactions for fire-enabled DGVMs.

Improving vegetation controls on fire in DGVMs
Our results demonstrate that the role of vegetation on fire needs to be better represented in fire-enabled DGVMs to accurately 25 simulate the variability of burned area. The links between vegetation productivity, fuel production, and fire need to be improved. Fuel production depends on plant productivity, and on the allocation, turnover, and respiration processes of carbon in different fuel types. As a first step for model improvement, fire-enabled DGVMs need to be tested and possibly re-calibrated against observations or observation-based estimates of plant productivity, above-ground biomass, and carbon turnover Thurner et al., 2016Thurner et al., , 2017. Beyond total above-ground biomass, the evaluation of different fuel types 30 (e.g. biomass in wood, canopy and understory, and litter size classes) is currently hampered by the availability of data. Only a few in-situ measurements of fuel loads exists (van Leeuwen et al., 2014) and global maps of fuel properties are based on spatial extrapolations including various assumptions and uncertainties (Pettinari and Chuvieco, 2016). As an alternative, hybrid data/model-based approaches such as land carbon cycle data assimilation systems (Bloom et al., 2016) may provide consistent information to benchmark vegetation productivity, turnover, and litter fuel dynamics in DGVMs.
Fire largely depends on the vegetation type (Rogers et al., 2015). Also our results show consistent land cover-specific relationships to burned area in satellite data, but these relationships differed among FireMIP models and in comparison to the satellite-derived relationships (Figure S 17). Vegetation types and associated morphological, biochemical, and structural 5 characteristics of plants affect the flammability and fire tolerance of vegetation .
Although global fire models have PFT-specific parametrisations for flammability (Thonicke et al., 2010), such fire-relevant plant characteristics need to be incorporated in DGVMs (Zylstra et al., 2016). Such efforts need to be complemented by calibrating DGVMs against satellite observations that provide relevant information about the spatial distributions of fuel structure (Pettinari and Chuvieco, 2016;Riaño et al., 2002), fuel moisture content (Yebra et al., 2013(Yebra et al., , 2018, fire ignitions 10 and spread (Laurent et al., 2018), fuel consumption (Andela et al., 2016), and fire radiative energy (Kaiser et al., 2012). In summary, besides human-fire interactions, we identified vegetation effects on fire as a main deficiency of fire-enabled dynamic global vegetation models in simulating temporal dynamics of burned area.

Data availability
Data is available from the references as indicated in Table A1. 15

Code availability
This analysis is based on R (version 3.3.2) by using the packages randomForest (version 4.6-12) and ICEbox (version 1.1.2). R and the packages are available from the Comprehensive R Archive Network (CRAN, https://cran.r-project.org/).     and the most important variable is coloured in red and denoted with 1. White fields denote predictor variables that were not used in the respective RF experiment. The importance of variables depends on the burned area dataset that was used to train RF and differs between t he "full" and "fm" set of predictor variables.

Deleted:
SI 6: Regional sensitivities Figure S 18: Regional sensitivities of the partial fractional burned area per month to monthly maximum temperature from satellitederived "fm" RF models.

5
Figure S 19: Regional sensitivities of the partial fractional burned area per month to monthly maximum temperature from modelderived "fm" RF models.

5
Figure S 20: Regional sensitivities of the partial fractional burned area per month to the monthly number of wet days from satellitederived "fm" RF models.
Figure S 21: Regional sensitivities of the partial fractional burned area per month to the monthly number of wet days from modelderived "fm" RF models.

5
Figure S 22: Regional sensitivities of the partial fractional burned area per month to the population density from satellite-derived "fm" RF models.
Figure S 23: Regional sensitivities of the partial fractional burned area per month to the population density from model-derived "fm" RF models.

5
Figure S 24: Regional sensitivities of the partial fractional burned area per month to the herbaceous vegetation cover from satellitederived "fm" RF models. Figure S 25: Regional sensitivities of the partial fractional burned area per month to the herbaceous vegetation cover from modelderived "fm" RF models.

5
Figure S 26: Regional sensitivities of the partial fractional burned area per month to precedent 6-monthly GPP from satellite-derived "fm" RF models.
Figure S 27: Regional sensitivities of the partial fractional burned area per month to precedent 6-monthly GPP from model-derived "fm" RF models.