Geochemical zones and environmental gradients for soils from the central Transantarctic Mountains, Antarctica

Previous studies have established links between biodiversity and soil geochemistry in the McMurdo Dry Valleys, Antarctica, where environmental gradients are important determinants of soil biodiversity. However, these gradients are not well established in the central Transantarctic Mountains, which are thought to represent some of the least hospitable Antarctic soils. We analyzed 220 samples from 11 ice-free areas along the Shackleton Glacier (∼ 85 S), a major outlet glacier of the East Antarctic Ice Sheet. We established three zones of distinct geochemical gradients near the head of the glacier (upper), its central part (middle), and at the mouth (lower). The upper zone had the highest watersoluble salt concentrations with total salt concentrations exceeding 80 000 μg g−1, while the lower zone had the lowest water-soluble N : P ratios, suggesting that, in addition to other parameters (such as proximity to water and/or ice), the lower zone likely represents the most favorable ecological habitats. Given the strong dependence of geochemistry on geographic parameters, we developed multiple linear regression and random forest models to predict soil geochemical trends given latitude, longitude, elevation, distance from the coast, distance from the glacier, and soil moisture (variables which can be inferred from remote measurements). Confidence in our random forest model predictions was moderately high with R2 values for total water-soluble salts, watersoluble N : P, ClO−4 , and ClO − 3 of 0.81, 0.88, 0.78, and 0.74, respectively. These modeling results can be used to predict geochemical gradients and estimate salt concentrations for other Transantarctic Mountain soils, information that can ultimately be used to better predict distributions of soil biota in this remote region.

. Hot deserts are typically viewed as one of the least biologically diverse environments, but cold deserts can often be even less diverse (Freckman and Virginia, 1998). Soils in Antarctica typically serve as end-members for low habitat suitability due to their high salt concentrations, low organic carbon, low soil moisture, and low mean annual temperatures (Courtright et al., 2001).
In the McMurdo Dry Valleys (MDV), organic matter and salt concentrations influence soil communities, where soils with higher amounts of organic carbon, lower water-soluble N : P ratios, and lower total water-soluble salt concentrations generally harbor the greatest biomass and biodiversity Bottos et al., 2020;Caruso et al., 2019;Magalhães et al., 2012). These Antarctic ecosystems are relatively simple and are among the few known soil systems where nematodes and microarthropods (Collembola, Acari) are at the top of the food chain (Freckman and Virginia, 1998;Hogg and Wall, 2012). Studies of soils in the MDV and Transantarctic Mountains (TAM) have been key to understanding ecosystem structure and function in extreme terrestrial environments (e.g., Caruso et al., 2019;Collins et al., 2019Collins et al., , 2020Convey and McInnes, 2005;Freckman and Virginia, 1998;Hodgson et al., 2010).
Biological processes in Antarctic soils are largely dependent on the availability, duration, and proximity of soils to liquid water . Due to the seasonality of thawing events, liquid water acts as a pulse to the ecosystem, providing water for organisms but also wetting surface soils and dissolving soluble salts (Webster-Brown et al., 2010;Zeglin et al., 2009). Experiments of salt thresholds on Antarctic nematodes found that no individuals survived in highly saline soils over ∼ 2600 mg L −1 total dissolved solids (TDS) (Nkem et al., 2006). Concentrations of soluble salts exist at these concentrations or higher at high elevation and inland locations in the TAM (Bockheim, 2008;Lyons et al., 2016). Additionally, studies on TAM soils have found that increased salt concentrations lead to a decrease in soil biodiversity in older soils compared to younger soils (Magalhães et al., 2012). Yet, despite these inhospitable conditions (e.g., high salt concentrations and glacial advance and retreat), some organisms are postulated to have found suitable refugia in TAM soils and persisted in isolation for millions of years and through glacial cycles (Beet et al., 2016;Collins et al., 2019Collins et al., , 2020Stevens et al., 2006;Stevens and Hogg, 2003).
It is generally accepted that habitat suitability for invertebrate species in Antarctic soils is driven by a combination of geochemical, geographic, hydrologic, and geomorphic variables (Bottos et al., 2020;Courtright et al., 2001;Freckman and Virginia, 1998;Magalhães et al., 2012). Geographic variables, such as elevation, can be measured with advanced mapping tools and satellite imagery; however, surface exposure ages, soil geochemistry and nutrient content require extensive logistical support and resource allocation for sample collection and analysis. A better understanding of the relationship between geographic variables and on-theground measurements is needed to aid in our ability to understand and predict habitat suitability for invertebrates throughout the TAM.
With this study, we determined and evaluated geochemical patterns and gradients of water-soluble ions in soils collected from 11 ice-free areas along the Shackleton Glacier, central Transantarctic Mountains (CTAM). Particular attention was given to total water-soluble salt concentrations, N : P ratios, and ClO − 4 and ClO − 3 concentrations based on their influence on biodiversity as determined in previous studies (e.g., Ball et al., 2018;Barrett et al., 2006;Courtright et al., 2001;Dragone et al., 2020;Nkem et al., 2006). The geochemical data were compared to geographic parameters to understand how the physical environment influences the observed geochemical variability. Our results show that water-soluble ion concentrations and distributions are driven largely by soil geography and surface exposure age. Finally, we implemented statistical and machine learning techniques to interpolate and predict the soil geochemistry across the region using geographic variables. Our multiple linear regression and random forest models show that latitude, longitude, elevation, distance from the coast, distance from the glacier, and soil moisture (all variables currently or soon to be remotely measurable using maps and satellites) are moderately effective at estimating spatial patterns in TAM soil geochemistry with R 2 values as high as 0.87. These data will be particularly useful for ecologists seeking to understand refugia and habitat suitability in Antarctica and similarly harsh desert environments.

Study sites
The Shackleton Glacier (∼ 84.5 to 86.4 • S; ∼ 130 km long and ∼ 10 km wide) is a south-north trending outlet glacier of the East Antarctic Ice Sheet (EAIS) located to the west of the Beardmore Glacier, and it flows through the Queen Maud Mountains (CTAM) into the Ross Sea (Fig. 1). The elevations of exposed soils range from ∼ 150 to >3500 m a.s.l. (above sea level) from the coast towards the Polar Plateau. Long-term climate data are not yet available, but the Shackleton Glacier region is a polar desert regime, similar to the Beardmore Glacier region, with average annual temperatures well below freezing and little precipitation (LaPrade, 1984).
During the Last Glacial Maximum (LGM) and glacial periods throughout the Pleistocene, the size and thickness of the EAIS was likely greater than current levels (Golledge et al., 2013;Nakada and Lambeck, 1988;Talarico et al., 2012;Wilson et al., 2018). Outlet glaciers, such as the Shackleton Glacier, may have had the greatest increases in extent, especially at the glacier terminus (Golledge et al., 2012;Golledge and Levy, 2011). The behavior of local alpine and tributary glaciers is not well-constrained, but these glaciers are also believed to have advanced and retreated over the last 2 million years (Diaz et al., 2020a;Jackson et al., 2018). As a re-Figure 1. Samples were collected and analyzed from the exposed soils along the Shackleton Glacier, a major outlet glacier of the EAIS (a), in three zones. The upper zone (b) was located at the head of Shackleton Glacier, the middle zone (c) was the central portion, and the lower zone (d) was at the terminus of the glacier where it drains into the Ross Sea. Satellite images were provided courtesy of the Polar Geospatial Center (PGC). The Ross Ice Shelf (RIS) is indicated in panel (a).
sult, currently exposed soils were overlain and reworked by fluctuations of the Shackleton Glacier and other tributary and alpine glaciers in the region. Exposure ages range from the early Holocene to the Miocene, and they generally increase with distance from the coast and distance from the glacier (Balter-Kennedy et al., 2020;Diaz et al., 2020a).
The soils contain a range of water-soluble salts derived primarily from atmospheric deposition and chemical weathering (Claridge and Campbell, 1968;Diaz et al., 2020b). The major salts are typically nitrate and sulfate salts, especially at higher elevations and further inland from the coast of the Ross Sea (Diaz et al., 2020b). The solubilities of the salts vary, but nitrate salts are highly soluble, and their occurrence at high elevation and inland locations suggests that those soils have maintained persistent arid conditions.

Sample collection and preparation
During the 2017-2018 austral summer, 220 surface soil samples (∼ top 5 cm) were collected from 11 distinct ice-free ar-eas (Roberts Massif, Schroeder Hill, Mt. Augustana, Bennett Platform, Mt. Heekin, Thanksgiving Valley, Taylor Nunatak, Mt. Franke, Mt. Wasko, Nilsen Peak, and Mt. Speed) along the Shackleton Glacier, including a subset of 27 samples previously analyzed for S, N, and O isotopes in nitrate and sulfate (Diaz et al., 2020b). At each area, we collected samples in transects (ranging from ∼ 200 to ∼ 2000 m in length) to maximize the geochemical variability. Our transects were also designed to capture the LGM transition, with some soils exposed throughout the LGM and others exposed following glacier retreat. GPS coordinates and elevations were recorded with each sample and later used to estimate the distance from the coast and distance from the glacier (defined as linear distance from the nearest glacier -Shackleton, tributary, or alpine). Once collected, the samples were stored and shipped frozen (−20 • ) to the Ohio State University.
Prior to geochemical analysis, the samples were dried at 50 • C for at least 72 h, with the loss in mass attributed to soil moisture content. The dried soils were leached at a 1 : 5 soil to deionized water (DI) ratio, and the leachate was filtered through 0.4 µm Nucleopore membrane filters (Diaz et al., 2018(Diaz et al., , 2020bNkem et al., 2006). Due to the low sediment to water ratio, this leaching technique only dissolves the more water-soluble salts (Toner et al., 2013). These include salts with ClO − 4 , NO − 3 , Cl − , SO 2− 4 , ClO − 3 , and CO 2− 3 + HCO − 3 . Process blanks were generated and analyzed to account for any contamination from the leaching process.

Analytical analysis of water-soluble anions, cations, and nutrients
The analytical techniques used here are similar to those reported by Diaz et al. (2020b). In brief, the analytes included anions (F − , Cl − , Br − , and SO 2− 4 ) which were measured on a Dionex ICS-2100 ion chromatograph, cations (K + , Na + , Ca 2+ , Mg 2+ , and Sr 2+ ) which were measured on a PerkinElmer Optima 8300 inductively coupled plasma optical emission spectrometer (ICP-OES), and nutrients (NO − 3 + NO − 2 , PO 3− 4 , H 4 SiO 4 , and NH 3 ) which were measured on a Skalar San++ automated wet chemistry analyzer at the Ohio State University. Perchlorate (ClO − 4 ) and chlorate (ClO − 3 ) were measured using an ion chromatographtandem mass spectrometry technique (IC-MS/MS) at Texas Tech University (Jackson et al., 2012(Jackson et al., , 2015. All analytes are reported as listed. Total water-soluble salt concentration was calculated as the sum of all measured cations and anions. The precision of replicated check standards and samples was typically better than 10 % for all major anions, cations, and nutrients and better than 20 % for perchlorate and chlorate. Accuracy was typically better than 5 % for all major anions, cations, and nutrients, as determined by the NIST 1643e external reference standard and the 2015 United States Geological Survey (USGS) interlaboratory calibration standard (M-216), and better than 10 % for perchlorate and chlorate, as determined by spike recoveries. Precision and accuracy for individual analytes are located in Table S1. Detection limits for the analytes have been previous reported (Diaz et al., 2018;Jackson et al., 2012).

Data interpolation and machine learning
Inverse distance weighted (IDW) interpolations were performed for Bennett Platform, Thanksgiving Valley, and Roberts Massif using the Geostatistical Analyst tool in Ar-cMap 10.3. Since IDW is a deterministic method where unknown values are predicted based on proximity to known values, we chose those three sites as they had the most defined transects and relatively higher sample density. The interpolation parameters were constant with a power of 4, maximum neighbors of 15, minimum neighbors of 5, 4 sectors, and a variable search radius. These parameters were chosen such that they optimize for the lowest mean absolute error (MAE).
Multiple linear regressions were generated for all geochemical analytes except H 4 SiO 4 (total of 15 dependent variables) with latitude, longitude, elevation, distance from the coast, distance from the glacier, and soil moisture as independent variables using built-in functions in R 3.6.3 (R Core Team, 2020). Random forest regression models were sim-ilarly generated using the RandomForest library. The random forest model is a machine learning algorithm that utilizes supervised learning algorithms to predict values given input predictor variables (Breiman, 2001). Multiple decision trees are run in parallel with a randomized subset of predictor variables, and the aggregate result of each tree is used to generate a predicted outcome. Since each tree is generated using a random sample and random predictor variables, the random forest model is effective at minimizing overfitting and handling outliers (Breiman, 2001). For both models, all geochemical data were log-transformed to ensure the data were normally distributed (verified using a Jarque-Bera normality test). Missing values were input as NA (not available).
Machine learning algorithms are widely used in a variety of disciplines from finance (Patel et al., 2015) to ecology (Davidson et al., 2009;Peters et al., 2007;Prasad et al., 2006) for both data prediction (regression) and classification. Recently, these techniques have been used for earth science applications, including geologic mapping (Heung et al., 2014;Kirkwood et al., 2016), air quality monitoring (Stafoggia et al., 2019), and water contaminant tracing (Tesoriero et al., 2017). We developed a novel application of machine learning to predict concentrations and gradients of water-soluble salts in Antarctic soils given set geographic parameters, similar to the approaches developed for stock market and real estate predictions (Antipov and Pokryshevskaya, 2012;Patel et al., 2015).
For our random forest models, any sparse missing values in Table S2 were estimated by averaging the geochemistry of the samples collected immediately before and after in the same transect. The new imputed dataset was split into a training set representing 86 % of the data (n = 189; Table S3) and a testing set representing the remaining 14 % (n = 31; Table S4) based on ideal model parameters described by Breiman (2001). The training dataset was used to generate the random forest models for each analyte. Each of the models was run with 2000 decision trees (ntree = 2000) to minimize the mean squared error. The number of random variables used for each node split in the decision trees was set to the recommended regression default of one third of the total number of variables to optimize the model randomness, which in our case was 2 (mtry = 2), following parameters described previously (Breiman, 2001). The scripts developed for both the multiple linear regression and random forest models are included in the Supplement.

Geochemistry of upper, middle, and lower zones
The maximum, minimum, mean, standard deviation, and coefficient of variation are reported in Table 1 for the measured geographic and geochemical data. Concentrations of watersoluble ions span up to 5 orders of magnitude and are variable across the region. Elevation, distance from the coast, distance from the glacier, and soil moisture are also variable and span up to 3 orders of magnitude. The highest elevation samples (>2000 m a.s.l.) were collected from Schroeder Hill, and the greatest soil moisture content is from Mt. Wasko at 12.3 % with a mean of 2.1 % for all samples.
Shackleton Glacier region surface soils can be separated into three zones based on their water-soluble geochemistry: an upper zone near the Polar Plateau, a middle zone near the center of the glacier, and a lower zone where the glacier flows into the Ross Sea (Figs. 1, 2). The upper zone samples are characterized by the highest total water-soluble salt concentrations, with the highest values greater than 80 000 µg g −1 at Schroeder Hill, while the lower zone samples have the lowest total salt concentrations, with the lowest values near 10 µg g −1 at Mt. Wasko (Fig. 2a- Table S2).
Observed trends between the zones appear to be driven, at least partially, by geography. Regressions of total watersoluble salt concentration, water-soluble N : P ratio, and ClO − 3 concentration with elevation, distance from the coast, and distance from the glacier are all positive (Fig. 2). The strongest relationships are between total salts and elevation, and N : P ratio and elevation with R 2 values of 0.59 and 0.52, respectively, and p values <0.001 with a Bonferroni correction which was applied to minimize the Familywise Type 1 error rate associated with multiple comparisons (Fig. 2a, d). The weakest relationships are between ClO − 4 and distance from the coast and ClO − 3 and distance from the glacier with R 2 values of 0.11 and 0.06, respectively (Fig. 2h, i). Distance from the glacier varies widely between individual zones with frequent overlaps, but there appears to be a moderate relationship with N : P ratio and total salts (Fig. 2c, f). Overall, total salt concentration has the strongest relationship with geography and ClO − 4 has the weakest relationships. Ternary diagrams highlight the specific geochemical gradients within and between the zones (Fig. 3). The anion ternary diagram only includes SO 2− 4 , NO − 3 , and Cl − , which are the major water-soluble salts in the region (Claridge and Campbell, 1968;Diaz et al., 2020b). Though carbonate and bicarbonate salts have been identified in both lacustrine sediments and soils in Antarctica, previously measured concentrations in the Shackleton Glacier region were low, ranging from 0.07 % to 2.5 %, and bicarbonate salts were not identified in the highest elevation and furthest inland soils (Claridge and Campbell, 1968;Diaz et al., 2020b;Lyons et al., 2016). The most abundant anion for the upper zone is SO 2− 4 , which is greater than 99 % of the total anion budget in some Schroeder Hill and Roberts Massif samples, though other locations are dominated by NO − 3 (Fig. 3). The anions are more evenly distributed in the middle zone, though the majority of samples are most abundant in NO − 3 and Cl − . The lower zone has much lower SO 2− 4 fractions than the upper zone, and the dominant anion is generally Cl − . The cation distribution is very similar for all three zones (Fig. 3). Na + + K + is the most abundant cation pair representing over 90 % of the total cations for many upper and middle zone samples, while Ca 2+ is the second most abundant. In general, Mg 2+ is the least abundant cation across all sampling locations.

Statistical geochemical variability
A principal component analysis (PCA) using the correlation matrix (i.e., scale = TRUE) was performed in R (using factoextra; Kassambara and Mundt, 2017) and built in R software libraries to determine which geochemical variables most strongly differ across the samples. For the PCA, the first two principal components (PC1 and PC2) account for over 50 % of the total dataset variability at 44.2 % and 11.6 %, respectively. The different zones are correlated with different principal components (Fig. 4). The samples from the middle zone are positively correlated with PC1 and PC2. In the biplot, they plot in the upper-right quadrant with high concentrations of Cl − , NO − 3 , water-soluble N : P ratio, and Ca 2+ and with a minor influence from soil moisture and H 4 SiO 4 . The upper zone samples generally plot along PC1 and are most associated with Sr 2+ , SO 2− 4 , Mg 2+ , Na + , K + , F − , ClO − 4 , and ClO − 3 . The samples from the lower, more coastal zone are negatively correlated with PC1 and are distinguished by their higher PO 3− 4 concentrations. Most samples from all locations plot within the 95 % confidence interval ellipses. However, there are two strong outliers from Schroeder Hill and Mt. Heekin.
Similar to the PCA, we performed a simple Spearman's rank correlation for the entire dataset to visualize the statistical dependence between all variables. Since a goal of this study is to relate water-soluble ion concentrations to geography, we focused on latitude, longitude, distance from the coast, distance from the glacier, and soil moisture. The strongest correlation coefficients are between Cl − and latitude, elevation, and distance from the coast and Sr 2+ and soil moisture (Fig. 5). Most other correlations are moderate to weak, though there appear to be notably stronger correlations between ClO − 3 and latitude and distance from the coast, Ca 2+ and longitude, elevation, and distance from the coast, NO − 3 and latitude, and SO 2− 4 and distance from the glacier. Longitude, elevation, and distance from the coast have the greatest number of strong and moderate correlations with the geochemistry data. Outside of the geographic parameters, Na + is highly correlated with total water-soluble salts, likely representative of the high Na + + K + percentages (Fig. 3), and Sr 2+ is highly correlated with K + , likely reflecting a common ion source.

Spatial interpolation and machine learning model performance
The total salt concentrations of individual samples at Bennett Platform produce the most defined interpolation gradient from the glacier front to further inland compared to Roberts Massif and Thanksgiving Valley (Fig. 6). Bennett Platform also has the smoothest salt concentration contours suggesting that the interpolation model is the strongest and most robust at this location. The second strongest interpolation is Thanksgiving Valley. Contrary to the measurements at Bennett Platform, Thanksgiving Valley has the highest salt concentrations in the center of the valley with lower concentrations to both the east and west. The lowest concentration contours are closest to the glacier for both Bennett Platform and Thanksgiving Valley, which is likely related to glacial history since the soils near the glacier are relatively younger than those further inland based on meteoric 10 Be data (Diaz et al., 2020a). The interpolation from Roberts Massif does not have a distinguishable spatial trend. The multiple linear regression and random forest models vary in their strength for the individual analytes. The highest R 2 value from the linear regression is 0.65 for Na + , while total water-soluble salts, water-soluble N : P ratio, ClO − 4 , and ClO − 3 have values of 0.61, 0.60, 0.44, and 0.55, respectively ( Table 2). The lowest R 2 value is for PO 3− 4 at 0.17. The p values for all analytes are 0.001 even with a Bonferroni cor-rection. The highest out-of-the-box explained variance values from the random forest models are for total salts and ClO − 3 at 75 % and 63 %, respectively. The lowest explained variance is for Sr 2+ at 37 %. The values of N : P ratio and ClO − 4 are 52 % and 48 %, respectively. We also evaluated the most important and least important variables in the random forest models based on node purity. The most important variable for the majority of analytes is elevation, while distance from the glacier is most important for N : P ratio and latitude for ClO − 3 ( Table 2). The least important variables are distance from the coast and latitude for every analyte except ClO − 3 , for which distance from the glacier is least important.

Implications for ecological habitat suitability
By establishing geochemical zones for the Shackleton Glacier region, we can better understand the relationship between geochemistry and geography and ultimately biogeography. As stated in the introduction, we focused particularly on total water-soluble salt concentrations, water-soluble N : P ratios, and ClO − 4 and ClO − 3 concentrations. Figure 2. Total water-soluble salts, water-soluble N : P molar ratio, and ClO − 4 and ClO − 3 concentrations (log scale) were compared to elevation, distance from the coast, and distance from the glacier for samples from the three geographic zones (blue for upper, yellow for middle, gray for lower zones). Linear regression lines are plotted, for which dashed lines represent regressions where p>0.05 with a Bonferroni correction, and R 2 values are reported for each relationship. The horizontal orange lines represent nematode salt tolerance of ∼ 2600 (Nkem et al., 2006), and the green lines represent the Redfield ratio, N : P = 16, for phytoplankton in the ocean.

Elevation and moisture controls on total water-soluble salt gradients
The elevational trends of total salt concentrations at the Shackleton Glacier are similar to those previously described in the TAM, where soils from higher elevation sites typically have higher salt concentrations (Bottos et al., 2020;Lyons et al., 2016;Magalhães et al., 2012). Our results are also consistent with those from Scarrow et al. (2014) who found that salt concentrations typically decreased with distance from the glacier in the Beardmore and Lennox-King glacial regions. Our total water-soluble salt interpolation maps high-light the spatial variability in Shackleton Glacier region soils (Fig. 6). The most spatially variable location is Robert Massif, which does not appear to follow local elevational, latitudinal, and/or distance inland gradients. This heterogeneity is not necessarily due to currently active soil leaching as the soil moisture values are not drastically different between the samples (Table S2). Though the variability in cation concentrations is likely due to the weathering of tills, scree, and bedrock (Claridge and Campbell, 1968), recent work on the isotopic composition of water-soluble nitrate and sulfate, the major anions in the upper zone, suggests a common, atmospheric source (Diaz et al., 2020b). We argue that the heterogeneity in the total salt concentrations at Roberts Massif (Figs. 2, 6) is probably related to different and complex wetting history, in which seasonal snow patch melt may pool in local depressions, transporting watersoluble salts from slightly higher elevations and/or from saline wet patches (Levy et al., 2012). This is demonstrated on a larger scale at Thanksgiving Valley, a glacially carved valley where the higher concentrations of salts in the center of the valley are likely due to the transport of salts from nearby higher elevation slopes during melting events. This is further evidenced by the presence of two small, closedbasin ponds in the center of the valley which likely formed from glacial melt and may have been larger in size in the recent past . Similarly, streams and meltwater tracks in the MDV leach soils and carry salts into closedbasin, brackish to hyper-saline lakes where salts are cryoconcentrated over time (Lyons et al., 1998). Our results suggest that elevation and wetting history are important contributors to total salt gradients in the Shackleton Glacier region as they influence the accumulation of salts and subsequent leaching from soils.

Influence of glacial history on water-soluble N : P ratios
Stoichiometric dependencies have been identified for Antarctic terrestrial organisms, in which nutrient concentrations, in addition to soil aridity, limit ecosystem development (Nkem et al., 2006). Since nitrate is primarily derived from atmospheric deposition and phosphorus is primarily liberated from minerals by chemical weathering in the CTAM, many inland and higher elevation soils have accumulated high concentrations of NO − 3 , resulting in stoichiometric imbalance with soluble PO 3− 4 (Ball et al., 2018;Barrett et al., 2007;Diaz et al., 2020b;Lyons et al., 2016;Nkem et al., 2006). As in the MDV, younger and coastal soils at lower elevations in the Shackleton Glacier region have the lowest water-soluble N : P ratios driven by relatively low concentrations of NO − 3 and high concentrations of PO 3− 4 due to an increase in moisture content and chemical weathering (Heindel et al., 2017) (Figs. 2, 4). It is not surprising that life was conspicuous in these soils with thick lichen growth on several rocks and the presence of both Collembola and mites at Mt. Speed and Mt. Wasko (Fig. S1). However, despite overall elevational and latitudinal gradients, some inland locations in the middle and upper zones have water-soluble N : P ratios near those from the lower zone (Fig. 2).
The interpolation model from Bennett Platform shows that some locations near the glacier have lower total watersoluble salt concentrations (Fig. 6), similar to soils surveyed in the MDV (Bockheim, 2002). However, the samples near the glacier at Bennett Platform not only have lower total salt concentrations, they also have lower N : P ratios than samples collected further inland. This is also the case for the middle zone locations (Fig. 2f). We argue that this is due to differences in glacial history between the locations. Our previous work showed that soils near the glacier are younger than soils further inland in the Shackleton Glacier region (Diaz et al., 2020a). These soils are shielded from nitrate accumulation during glacial periods, and the recently exposed rocks likely serve as fresh mineral weathering material for PO 3− 4 mobilization (Heindel et al., 2017). Recently exposed and relatively nutrient-rich soils might be important refugia for invertebrates. Previous hypotheses have suggested that organisms may have persisted at higher elevations during glacial periods (Bennett et al., 2016;Stevens and Hogg, 2003). However, abiotic gradients in the Beardmore Glacier region suggest that higher elevation soils have salt concentrations that would classify them as unsuitable habitats (Lyons et al., 2016). If few organisms survived glaciations, the nearglacier, relatively P-rich soils may be important in helping communities recover and restructure post-glaciation.    trations from the Atacama and Mojave deserts (Jackson et al., 2015). Though not a strong correlation, the highest elevation samples (upper zone) have the highest ClO − 4 and ClO − 3 concentrations (Fig. 2g, j). Similar to NO − 3 , ClO − 4 and ClO − 3 are derived from atmospheric deposition, and, because of their high solubilities, their accumulations are related to wetting and glacial histories (Jackson et al., 2016(Jackson et al., , 2015. Therefore, soils which have been exposed for long periods of time and have not experienced snow or ice melt, such as those from Schroeder Hill and Roberts Massif, are able to accumulate high concentrations of ClO − 4 and ClO − 3 . Interestingly, our ClO − 4 concentrations are lower -maximum of ∼ 1.9 g L −1 -than the highest recorded tolerance -1.1 M (∼ 130 g L −1 ) NaClO 4 -for the extremotolerant bacteria Planococcus halocryophilus, yet a recent study shows no detectable biomass for Schroeder Hill samples (Dragone et al., 2020). (Per)chlorates are strong oxidizers and are well established as toxic; thus the concentrations of ClO − 4 and ClO − 3 might be additional, crucial indicators of habitat suitability. However, the concentrations are highly heterogenous across our sampled locations (Fig. 2k-l), and unlike ClO − 3 , neither the multiple linear regression nor random forest models were able to adequately capture the variability in ClO − 4 concentrations (Table 2).

Machine learning as a tool to predict soil geochemical trends
We sought to evaluate our multiple linear regression and random forest models using a testing dataset from the Shackleton Glacier region (n = 31) and a second dataset from the Darwin Mountains (∼ 80 • S) (n = 10) (Magalhães et al., 2012). Few published/available TAM datasets include sam-ple GPS coordinates, soil moisture, and water-soluble ion geochemistry. As stated in Sect. 3.3, the Shackleton Glacier region test data were not included in the random forest model generation so we could evaluate our models with an independent dataset. For the Darwin dataset, distance from the glacier, distance from the coast, and elevation were determined using the Reference Elevation Model of Antarctica (REMA), while location, soil moisture, and geochemistry were retrieved from the literature (Howat et al., 2019;Magalhães et al., 2012). We evaluated all 15 analytes from the original models with the Shackleton dataset and, due to a lack of data, only evaluated seven analytes from the Darwin soils (Fig. 7). Both the multiple linear regression and random forest model outputs are moderately well-correlated for the Shackleton dataset, as determined by Pearson correlations between the measured and predicted values ( Fig. 7a; Table 3). The random forest models outperform the linear regression models for nearly every analyte with the exception of Sr 2+ , NH 3 , and PO 3− 4 , and nearly all p values are <0.001. For Cl − , in particular, the random forest model significantly outperforms the multiple linear regression model with R 2 values of 0.67 and 0.16, respectively. N : P molar ratio is the most accurately predicted analyte with R 2 values of 0.88 and 0.59 for the random forest and linear regression models, respectively. However, the highest R 2 value for the multiple linear regression model is for Na + at 0.64 (Table 3). In terms of our analytes of interest regarding habitat suitability, total salts have the second strongest correlation (following N : P ratio) with the random forest model (R 2 = 0.81), followed by ClO − 4 (R 2 = 0.78) and ClO − 3 (R 2 = 0.74). MAE and root mean squared error (RMSE) values indicate that the random forest models also have a smaller error compared to Table 3. Multiple linear regression and random forest statistics between predicted and measured concentrations from the Shackleton and Darwin Glacier regions. R 2 and p values are reported for the correlations between measured and predicted concentrations. Regression slopes and test statistic values (t) were calculated using the smatr library (Warton et al., 2012) in R to evaluate the null hypothesis (H 0 ) of slope = 1. Higher test statistic values (closer to 1) indicate that we reject the null hypothesis. All geochemical data were log-transformed. the multiple linear regression models (Table 4). MAE values are lower than RMSE values for both models, indicating the strong influence of outliers in the testing dataset. This is unsurprising as the standard deviation and coefficient of variation values for the entire dataset are relatively large for all analytes. Additionally, the outliers are likely one reason why the random forest models are stronger than the multiple linear regression models. Similar to the model performance in the Shackleton Glacier region, the water-soluble ion predictions for the Darwin Glacier region are more strongly correlated with measured values in the random forest models compared to the multiple linear regressions (Fig. 7b). In fact, the linear regression models fail for nearly all the Darwin samples, and most concentration outputs are negative, which is likely due to overfitting during model generation. Here, Ca 2+ and K + are exceptions, and the multiple linear regression models out-perform the random forest models in both cases. MAE and RMSE values for both models are higher than those for the Shackleton dataset (Table 4). On the other hand, the random forest models perform particularly well for some analytes. Though a small sample size, the R 2 values for N : P molar ratio and Ca 2+ are 0.68 and 0.66, respectively, with p values 0.001. Total salts is moderately correlated (R 2 = 0.47). It is unclear why some analytes, such as N : P molar ratio, are the most accurately predicted, though we suspect that this is due to (1) weathering trends of local lithology across the TAM since chemical weathering is probably the major source of these ions and (2) deposition and accumulation of atmospherically derived ions at higher elevations (Diaz et al., 2020b).
It should be noted that the R 2 values simply measure the strength of the correlations between the measured and predicted values. We performed slope tests by fitting bivariate lines using the standardized major axis (SMA) to further understand the relationship between the two values using the smatr library in R (Warton et al., 2012). For this test, we specifically evaluated the null hypothesis (H 0 ) in which slope = 1, which would indicate whether an ideal, direct 1 : 1 relationship exists between the measured and predicted values. Test statistic values (t) were used to measure the sample correlation between the residuals and fitted values (Warton et al., 2012). Test statistic values near 1 indicate that we reject the null hypothesis. In other words, higher absolute test statistic values indicate a slope other than 1. Of the 15 analytes in the Shackleton dataset, 5 analytes have slopes near 1 for the multiple linear regression models and 11 for the random forest, as indicated by test statistic values less than 0.5. For the Darwin, no analytes have test statistic values less than 0.5 ( Fig. 7; Table 3).
These data indicate that while some analytes have high correlations between measured and predicted values, the models perform best with the Shackleton Glacier region soils. Additionally, though the relationship may not be 1 : 1, Figure 7. R 2 values for the multiple linear regression and random forest predicted and measured values for the different analytes (Table 3). Test datasets include the Shackleton Glacier region (n = 31) and the Darwin Glacier region (n = 10) (Magalhães et al., 2012). Analytes with slopes near 1, indicating good agreement between measured and predicted values, are indicated ( * t<0.5; * * t<0.20).
the random forest models are effective at predicting the measured geochemical gradients. For example, similar to our data, the Darwin Glacier samples generally have greater water-soluble N : P ratios and total water-soluble salt concentrations further from the glacier and at higher elevations (Magalhães et al., 2012), a trend that is reflected by our model results despite offset values. Additionally, corrections for the offset of the model from a slope = 1 (i.e., multiplying the model output value by the regression slope) can be made to better estimate specific concentrations, though the difference between modeled and measured values can still be up to 2 times greater. Our sample size for building the multiple linear regression and random forest models is small. We anticipate that as more data are collected throughout the CTAM, these data can be added to the model training dataset, expanding our prediction capabilities and increasing model reliability.

Conclusions
The soil ecosystems found in the Transantarctic Mountains are among the least diverse on Earth, and their structure is influenced by environmental variables. We characterized environmental and geochemical gradients in the Shackleton Glacier region, which will aid in our understanding of the abiotic properties in soils governing biodiversity and biogeography. The 220 samples we analyzed represent a wide range of soil environments: those with different elevation, latitude, longitude, glacial history, and geochemistry. We determined three soil zones: an upper zone near the head of the glacier which is characterized by high total water-soluble salt concentrations, high water-soluble N : P ratios, and high ClO − 4 and ClO − 3 concentrations, a lower zone with low total salt concentrations and higher PO 3− 4 concentrations, and a middle zone with intermediate values. The zones help elucidate the geographic influences on soil geochemistry. In addition, our total water-soluble salt interpolations at Roberts Massif, Bennett Platform, and Thanksgiving Valley reflect the local small-scale variability in salt concentrations and possible influences from soil age and wetting history.
Similar to previous studies, our results suggest that high elevation and inland soils, such as those from the upper zone, were likely unsuitable candidates for refugia during the Last Glacial Maximum. However, glacial advance and retreat and climate shifts may leach soils, lowering otherwise toxic total water-soluble salt concentrations and N : P ratios. These more recently exposed soils may be particularly important in maintaining and reviving contemporary and past biological communities.
Five geographic variables (latitude, longitude, elevation, distance from the coast, and distance from the glacier) and soil moisture were correlated with soil geochemistry. We used these variables to develop multiple linear regression and random forest models to predict ion concentrations and geochemical gradients. The model results generally reflected the measured geochemical variability across the region. Test datasets from the Shackleton and Darwin Glacier regions showed that the random forest models typically outperformed the multiple linear regression models when correlating measured and predicted values, especially for the Darwin region. Though most correlations did not exhibit a 1 : 1 relationship and had varying slopes, the random forest models were able to adequately predict geochemical gradients, as demonstrated by moderate to high R 2 values between measured and model predicted concentrations. As terrestrial Antarctic geochemical databases expand and are included in the random forest model training dataset, we anticipate the model's predictive capabilities will expand and improve as well. While these results are currently most applicable for central Transantarctic Mountain soils, similar techniques can be applied to other hyper-arid environments (e.g., Namib and Atacama deserts, Mars) to inform patterns of biodiversity and biogeography.
Data availability. The datasets generated for this study are included in the article or Supplement.
Author contributions. The project was designed and funded by BJA, DHW, IDH, NF, and WBL. Fieldwork was conducted by BJA, DHW, IDH, NF, and MAD. CBG, SAW, and MAD prepared and analyzed the samples for water-soluble ion and nutrient analyses. WAJ prepared and analyzed the samples for ClO − 4 and ClO − 3 . MAD generated the scripts and performed the analyses for the IDW interpolations and multiple linear regression and random forest models. MAD wrote the article with contributions and edits from all authors.
Competing interests. The authors declare that they have no conflict of interest.