In this study, high-resolution bathymetric multibeam and optical image data,
both obtained within the Belgian manganese (Mn) nodule mining license area by
the autonomous underwater vehicle (AUV)

High-resolution quantitative predictive mapping of the distribution and
abundance of manganese nodules (Mn nodules) is of interest for both the
deep-sea mining industry and scientific fields such as marine geology,
geochemistry, and ecology. The distribution and abundance of Mn nodules are
affected by several factors such as local bathymetry (Craig, 1979; Kodagali,
1988; Kodagali and Sudhakarand, 1993; Sharma and Kodagali, 1993),
sedimentation rate (Glasby, 1976; Frazer and Fisk, 1981; von Stackelberg and
Beiersdorf, 1991; Skornyakova and Murdmaa, 1992), availability of nucleus
material (Glasby, 1973), and bottom current strength (Frazer and Fisk, 1981;
Skornyakova and Murdmaa, 1992). As a consequence, the distribution and
abundance of Mn nodules is heterogeneous (Craig, 1979; Frazer and Fisk, 1981;
Kodagali, 1988; Kodagali and Sudhakar, 1993; Kodagali and Chakraborty, 1999;
Kuhn et al., 2011), even on fine scales of 10 to 1000 m (Peukert et al.,
2018a; Alevizos et al., 2018). This increases the difficulty of quantitative
predictive mapping using remote-sensing methods. Vast areas of the seafloor
can be mapped by ship-mounted, multibeam echo-sounder systems (MBESs).
State-of-the-art MBESs feature a low frequency (12 kHz) and can map
ca. 300 km

Schematic workflow of the data sets used in this study to enable the spatial assessment of Mn nodules inside the study area. The medium resolution of AUV MBES (meter scale) refers to the comparison of the optical and physical data (centimeter scale).

AUVs have proven their usefulness for multibeam data acquisition in the deep-sea environment (Grasmueck et al., 2006; Deschamps et al., 2007; Haase et al., 2009; Wynn et al., 2014; Clague et al., 2014, 2018; Pierdomenico et al., 2015; Peukert et al., 2018a). They achieve higher spatial and vertical resolution compared to ship-mounted MBESs. This is due to their operation close to the seafloor, which results in a smaller footprint at a given beam angle and enables the use of higher frequencies (Henthorn et al., 2006; Mayer, 2006; Caress et al., 2008; Paduan et al., 2009). Additionally, AUVs avoid problems like near-surface turbulences, bubbles, ship noise, and strong sound velocity changes (Kleinrock et al., 1992a, b; Jakobson et al., 2016; Paul et al., 2016). They work independently from the surface vessel and operate at a stable altitude. AUVs can efficiently conduct a dive pattern of dense survey lines and thus reduce survey effort and costs (Chance et al., 2000; Bellingham, 2001; Bingham et al., 2002; Danson, 2003; Roman and Mather, 2010). High-resolution bathymetry enables computing bathymetric derivatives like slope and rugosity with a similarly high resolution. These derivatives play an important role in predicting Mn nodules' distribution and abundance (Craig, 1979; Kodagali, 1988; Skornyakova and Murdmaa, 1992; Kodagali and Sudhakar, 1993, Sharma and Kodagali, 1993; Ko et al., 2006). However, a small number of recent studies have investigated this role on an AUV scale (Okazaki and Tsune, 2013; Peukert et al., 2018a; Alevizos et al., 2018).

Underwater optical data have generally played an important role in the qualitative analysis of the seafloor features and for the specific task of assessing Mn nodules' distribution explicitly (Glasby, 1973; Rogers, 1987; Skornyakova and Murdmaa, 1992; Sharma et al., 1993). The development of automated detection algorithms enabled quantitative optical image data analysis and subsequent statistical interpretation of Mn-nodule densities. The spatial coverage of optical imaging is much higher than for box core sampling. The data resolution remains high enough to reveal the high variance in the spatial distribution of nodules at meter scale. Thus optical data can fill the investigation gap between ground-truth sampling and hydroacoustic remote sensing (Sharma et al., 2010, 2013; Schoening et al., 2012a, 2014, 2015, 2016, 2017a; Kuhn and Rathke, 2017). Moreover, mosaicking of optical data could reveal mining obstacles such as outcropping basements or volcanic pillow lava flows. In addition, seafloor photos are the source for evaluating benthic fauna occurrences and related habitats on a wider area (Schoening et al., 2012b; Durden et al., 2016).

Box coring is common to obtain physical samples of Mn nodules and sediments
for resource assessments and biological studies. While optical data reveal
only the exposed and semi-buried Mn nodules, box corers collect the top
30–50 cm of the seafloor with minimum disturbance, allowing an accurate
measure of the Mn nodules' abundance (kg m

Random forests (RF) is an ensemble machine learning (ML) method composed of multiple weaker learners, namely classification or regression trees (Breiman, 2001a). Within RF an ensemble of distinct tree models is trained using a random subsample of the training data for each tree until a maximum tree size is reached. In each tree, each node is split using the best among a subset of predictors randomly chosen at that node instead of using the best split among all variables (Liaw and Wiener, 2002). Thus, the process is double-randomized which further reduces the correlation between trees. About two-thirds of the training data are used to tune the RF while the remaining “out-of-bag” (OOB) samples are used for an internal validation. By aggregating the predictions of all trees (majority votes for classification, the average for regression), new values can be predicted. This aggregation keeps the bias low while it reduces the variance, resulting in a more powerful and accurate model. RFs have the ability to estimate the importance of each predictor variable, which enables data mining of the high-dimensional prediction data. Terrestrial studies use RFs in prospectivity mapping of mineral deposits (Carranza and Laborte, 2015a, b; 2016; Rodriguez-Galiano et al., 2014, 2015). In the marine environment, RFs have been used to combine MBES bathymetry, backscatter, their derivatives, sediment sampling, and optical data for various seabed classification and regression tasks (e.g., Li et al., 2010, 2011a; Che Hasan et al., 2014; Huang et al., 2014). Further studies showed the robustness of RFs for selected data sets compared to other ML algorithms (Che Hasan et al., 2012; Stephens and Diesing, 2014; Diesing and Stephens, 2015; Herkul et al. 2017), as well as to geostatistical and deterministic interpolation methods (Li et al., 2010, 2011a, b; Diesing et al., 2014).

The study area lies in the Clarion–Clipperton Zone (CCZ; ca.

The data (Greinert, 2016) were collected in March 2015 during cruise SO239
EcoResponse (Martínez Arbizu and Haeckel, 2015) with the German research
vessel ^{™} software. The largest
uncertainties during AUV operations result from inaccurate navigation and
localization in the deep-sea environment (Paull et al., 2014). AUV

The bathymetric derivatives computed in SAGA GIS and used as predictor variables.

High-resolution optical data (20.2 megapixels) were acquired by the
DeepSurveyCamera system on board AUV

The data exploration, spatial plotting, and analysis was performed with
ArcMap^{™} 10.1, PAST v3.19 (Hammer et al.,
2001), and R (R Development Core Team, 2008). All data were projected as a UTM
Zone 10N coordinate system (to enable spatial analysis). The existence of
spatial autocorrelation in the distribution of Mn nodules m^{™} 10.1 (for parameter settings see
Appendix A). One decimal was retained in the presentation of the results from
statistical analysis and RF modeling.

A total of five box corers (0.5 m

The number of Mn nodules on the sediment surface, the total number of Mn nodules per box core, the ratio of those two values, and the distance of the box corer deployments from the study area in block G77.

The RF modeling was performed with the Marine Geospatial Ecology Tools (MGET)
toolbox in ArcMap^{™} 10.1. MGET (Roberts et al.,
2010) uses the randomForests R package for classification and
regression (Liaw and Wiener, 2002). Our target variable (number of Mn nodules
m

As the optimal RF model was applied to the entire block G77, an estimate of
the abundance (kg m

The analysis of AUV photos with the CoMoNoD algorithm (Schoening et al.,
2017a) revealed a rather heterogeneous pattern of Mn nodules m

Spatial analyses revealed the presence of a spatial autocorrelation in the
distribution of Mn nodules m

Number and percentage of samples in each type of spatial clustering.

The application of the LMI reveals a bias that exists in the data due to the
sampling procedure, especially in the subarea b (Fig. 5b). Here, the presence
of the slope around 2.8

Scatterplot of the AUV altitude (m) and the estimated number of
Mn nodules m

Adjacent AUV photos from consecutive dive tracks that were obtained inside subarea b from

The results of the modeling procedure demonstrate that the RF algorithm is
influenced by the size of the training sample (Fig. 11a). This finding is in
accordance with other studies, in which larger training samples tended to
increase the performance of RF (Li et al., 2010, 2011b; Millard and
Richardson, 2015). The inclusion of a more representative range of the
observed values, and consequently a larger spectrum of the causal underlying
relationships, assists the RF to build a better model for the prediction of
the value distribution inside the study area. For our data, the decrement
becomes smaller when the size of the training sample increases further; it
reaches a minimum value of 0.2 between 80 % and 90 %, showing that
these additional 10 % do not notably benefit the RF model. However, the
absence of stabilization of the error to a minimum value indicates that more
optical data are needed from this block. The small decrement in error between
80 % and 90 % was the decisive factor to select 80 % of the data
as training samples (also considering the larger number of remaining
validation data and the reduced computational effort). Based on this data
set, the examination of different numbers of trees showed that the RF error
remains constant after 600 trees (Fig. 11b). Less trees result in a larger
error; this becomes particularly evident with less than 300 trees. With more
than 300 trees the range of the error is reduced (Appendix B). A higher
number of trees enables higher

Based on the abovementioned findings, the optimal RF regression model, which
uses 80 % of training data, 600 trees, and 6 predictor variables to be
randomly selected at each node, was selected and applied to the entire block
G77. The comparison of the predicted values with the observed values from the
remaining 20 % (2255 observations) of validation data showed a good
predictive performance (Table 4). Analytically, MAE and RMSE have very low
values,

The values of validation measures between predicted and observed data.

The scatterplot and box plot (Fig. 12a and b) illustrate this good match between predicted and observed values, as confirmed also by the descriptive statistics (Table 5). The residual analysis confirmed further the robustness of the model (Appendix B).

Comparison between observed and predicted values: scatterplot

Descriptive statistics of observed and predicted values.

The statistical analysis also reveals the limitations of the RF model which cannot predict beyond the range of training values. It underestimates the maximum predicted values and overestimates the minimum values (Fig. 12b and Table 5), a limitation also mentioned by other authors (e.g., Horning, 2010). This happens because in regression RF, the result is the average value of all the predictions (Breiman, 2001a).

The final application of the RF model for the entire block G77 predicts that
the majority of the area is covered by 30–45 Mn nodules m

The RF-predicted distribution of Mn nodules m

The analysis of the RF variable importance showed that the best explanatory
variable for the distribution of Mn nodules m

The predicted Mn-nodule distribution was combined with the abundance from box
corer data (and corrected with the ratio of buried to unburied Mn nodules, in
order to include the top

The total abundance of Mn nodules from the surface and embedded in
the sediment (max 15 cm), in areas with slope

The estimated amount of metal mass for five metals, based on the average values of metal content inside CCZ and a five-metal HCL-leach recovery method (Volkmann, 2015).

We present a case study that highlights the applicability of the combination
of AUV bathymetric and optical data for Mn-nodule resource modeling using RF
machine learning. The use of AUVs for collecting hydroacoustic and optical
data in areas of scientific and commercial interest can provide more precise
bathymetric and Mn-nodule distribution maps. Regarding the bathymetric maps,
the accurate and detailed reconstruction of the seafloor bathymetry at
meter-scale resolution enables to use bathymetry and its derivatives as
source data layers within a high-resolution RF model. These data should have
high-quality characteristics, as the presence of acquisition artefacts may
affect the robustness of the modeling procedure (Preston, 2009; Herkül et
al., 2017). The combined use of cameras as the DeepSurveyCamera
(Kwasnitschka et al., 2016) for
acquiring high-resolution photographs and an automated analysis with a
state-of-the-art algorithm (Schoening et al., 2017a) provide essential
quantitative information about the distribution of Mn nodules. Image analysis
results are more robust for constant AUV altitudes (7–9 m) above flat areas
(< 3

Inside block G77, the number of Mn nodules m

This study did not consider the geochemical properties of the sediments as input data in the modeling process, which might give additional clues as to why Mn nodules are distributed as they are. However, RF importance and partial dependence plots show that bathymetric and topographic factors tend to affect this distribution in a nonlinear way and with the bulk of data plotting in specific ranges of the bathymetric derivatives. Classic studies have shown that the bathymetry and the variation in the topographic characteristics of the seafloor affects the sediment deposition environment and bottom currents and thus also geochemical processes in the sediment. All these factors determine Mn-nodule growth and thus affect the distribution of Mn nodules on regional scales (e.g., Craig, 1979; Sharma and Kodagali, 1993). It is still unknown how these properties influence the Mn-nodule distribution on meter to tens of meter scales as seen in our AUV data. The nonlinear relationship between Mn nodules and bathymetry on such high-resolution scales only began to be investigated very recently (e.g., Peukert et al., 2018; Alevizos et al., 2018). To elaborate more on the hydrodynamic and geochemical reasons behind the observed distribution pattern, we would need more investigations at and in the sediment on the same scale.

It should be acknowledged that the aim of any ML predictive model is to
derive accurate predictions based on an existing (large) number of
measurements to capture a complex underlying relationship (e.g., nonlinear
and multivariate) between different types of data, for which our theoretical
knowledge or conceptual understanding is still under development (Schmueli,
2010; Lary et al., 2016). Especially due to the constantly increasing size of
scientific multivariate data in marine sciences and the existence of such
nonlinear relationships between predictor and response variables (e.g., Zhi
et al., 2014; Li et al., 2017), ML and RF are considered important analytic
tools that can objectively reveal patterns of a (unknown) phenomenon (Genuer
et al., 2017; Kavenski et al., 2009; Lary et al., 2016). Such predictions may
be used to derive causalities or may drive the creation of new hypotheses. In
other words, for a predictive model, the “unguided” data analyses come
first and the interpretation follows (Breiman, 2001b; Schmueli, 2010;
Obermeyer and Emanuel, 2016). This “a priori” knowledge of the distribution
of the Mn-nodule number and size on such a scale can contribute to the
biological data survey planning, too. Recent studies showed that the
abundance and species richness of nodule fauna inside the CCZ is affected by
the abundance of Mn nodules (Amon et al., 2016; Vanreusel et al., 2016) as
well as their size (Veillette et al., 2007). Thus, high-priority areas (e.g.,
those with the highest commercial interest) can be targeted for sampling based on
the results of optic data and RF modeling. The RF modeling takes advantage
of the multilayer information (here: hydroacoustic and optical data), handling their complex relationships effectively while being resistant to
overfitting (Breiman, 2001a). Moreover, the randomization of the input
training points in each tree in each run results in a completely different
training data set each time with mixed points from the entire study area. This
random selection and mixing of points is appropriate for clustered data, as
it ignores their spatial locations and consequently limits the influence of
spatial autocorrelation (Appendix B). Along these lines, several authors have
included the values of latitude/longitude and even the LMI values as
predictor variables in order to increase the model performance (e.g., Li,
2013; Li et al., 2011b, 2013). RF has a high operational character due to its
relatively simple calibration, which does not request extensive data
preparation/transformation or the need for geostatistical assumptions (e.g.,
stationarity). The selection of the MGET toolbox (Roberts et al., 2010) further increased the simplicity of the workflow, as the RF modeling was
performed entirely inside a graphic environment familiar to many
geoscientists. As RF model runs can be implemented inside various software
packages in future implementations of this workflow, it would be interesting
to include the uncertainty for the associated predictions, e.g., with the use
of the quantile regression forests (Meinshausen, 2006) from the
quantregForest R package (Meinshausen, 2012). However, this will
increase the computational time (Tung et al., 2014) and the simplicity of the
procedure, especially if other recently proposed methodologies of
estimating the uncertainty are used: the jackknife method (Wager et al., 2014), the
Monte Carlo approach (Coulston et al., 2016), and the

Similarly to other studies (e.g., Cutler et al., 2007; Millard and Richardson,
2015), RF showed increased stability in its performance, allowing a small
number of iterations to compute sufficient results. The examination of the
main two tuning parameters (

Finally, the resource assessment showed that block G77 is a potential mining
area with high average Mn-nodule density and gentle slopes. While the
threshold of 3

The results of this study show that the acquisition and analysis of optical seafloor data can provide quantitative information on the distribution of Mn nodules. This information can be combined with AUV-based MBES data using RF machine learning to compute predictions of Mn-nodule occurrence on small operational scales. Linking such spatial predictions with sampling-based physical Mn-nodule data provides an efficient and effective tool for mapping Mn-nodule abundance.

The data used in this work are available at PANGAEA. This includes MBES ship-based data (Greinert, 2016), optical imagery (Greinert et al., 2017; Schoening, 2017c), and the source code of the CoMoNoD algorithm (Schoening, 2017b). The MBES AUV-based data are not publicly available due to the confidentiality of coordinates.

The calculation of the bathymetric derivatives was performed with the SAGA
GIS v6.3.0 Morphometry library
(

Global Moran's I and local Moran's I were performed with the
ArcMap^{™} 10.1 software, using the Spatial
Statistics toolbox, according to the equations provided. As a null
hypothesis, it is assumed that the examined attribute is randomly distributed
among the features in the study area. For the optimal conceptualization of
spatial relationships, the inverse Euclidian distance method was selected, as
it is appropriate for modeling processes with continuous data in which the
closer two samples are in space, the more likely they are to
interact/influence each other or have been influenced for the same reasons.
The distance threshold was set at 50 m, and the increment analysis was
performed with a step of 50 m. Moreover, the spatial weights were
standardized in order to minimize any bias that exists due to sampling design
(uneven number of neighbors). Apart from the index value, the

Spearman's correlation coefficient for each pair of predictor variables.

Descriptive statistics of different training samples.

Correlation among the derivatives was checked by the Spearman's correlation
coefficient (

The nine training samples with different sizes were created by the MGET tool “Randomly Split Table into Training and Testing Records”. The spatial randomness of the procedure, combined with the many available data, resulted in training samples with similar descriptive statistics.

The descriptive statistics of the performance of each model were used as decision factors for the number of iterations (Tables B1–B5). In all cases, the mean value with very low standard error, very low standard deviation, range, and the 95 % confidence interval indicate a rather stable performance, without the need for further iterations.

Descriptive statistics of MSR from different training set sizes, after 10 iterations with default settings.

Descriptive statistics of MSR from a different number of the

Descriptive statistics of MSR from different number of the

Descriptive statistics of MSR for the optimum selected RF
model, after 30 iterations with 80 % of the sample as
training data,

Descriptive statistics of RF importance for the optimum RF
model, after 30 iterations with 80 % of the sample as training data,

The histogram of Mn nodules m

The descriptive statistics of the number of
Mn nodules m

The potential linear correlation
between depth, bathymetric derivatives, and the number of Mn nodules m

The Spearman's rank correlation coefficient between
Mn nodules m

Despite the fact that RF is a full nonparametric technique and there is no
need for the residuals to follow specific assumptions (Breiman, 2001a), the
examination of them can provide an in-depth look at RF
performance characteristics. The scatterplot of residuals against predicted
values shows a random pattern, which is also confirmed by the low values of
Pearson, Spearman, and

Scatterplot between residuals and predicted values.

Pearson, Spearman, and

Main descriptive statistics of residuals and 5 % trimmed residuals.

Residuals range.

The spatial autocorrelation analysis of the residuals using the global
Moran's index (same settings as Appendix A), showed low spatial
autocorrelation (

Spatial plotting of the RF residuals (absolute values). The intervals of their range are in accordance with the Table B10.

The production of the RF partial dependence plots show the nonlinear
character between the Mn nodules m

Partial dependence plots for each of the predictor variables. The

IZG processed the MBES and AUV data, performed the RF modeling, the statistical and GIS analysis, and wrote the paper. TS contributed to the survey design with respect to the optic data, developed the CoMoNoD algorithm, and processed the optic data. EA was involved in developing the idea of using RF for modeling and contributed to the GIS analysis. JG contributed to the survey by designing the MBES and the optic data survey planning, acquiring the MBES and the optic data, verifying the analytical methods, and supervising the project. All authors discussed the results, provided critical feedback, and contributed to the final paper.

The authors declare that they have no conflict of interest.

This article is part of the special issue “Assessing environmental impacts of deep-sea mining – revisiting decade-old benthic disturbances in Pacific nodule areas”. It is not associated with a conference.

We thank the captain and crew of RV