Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences

Prediction of the direction of change of a system under specified environmental conditions is one reason for the widespread utility of thermodynamic models in geochemistry. However, thermodynamic influences on the chemical compositions of proteins in nature have remained enigmatic despite much work that demonstrates the impact of environmental conditions on amino acid frequencies. Here, we present evidence that the dehydrating effect of salinity is detectable as chemical differences in protein sequences inferred from (1) metagenomes and metatranscriptomes in regional salinity gradients and (2) differential gene and protein expression in microbial cells under hyperosmotic stress. The stoichiometric hydration state (nH2O), derived from the number of water molecules in theoretical reactions to form proteins from a particular set of basis species (glutamine, glutamic acid, cysteine, O2, H2O), decreases along salinity gradients, including the Baltic Sea and Amazon River and ocean plume, and decreases in particle-associated compared to free-living fractions. However, the proposed metric does not respond as expected for hypersaline environments. Analysis of data compiled for hyperosmotic stress experiments under controlled laboratory conditions shows that differentially expressed proteins are on average shifted toward lower nH2O. Notably, the dehydration effect is stronger for most organic solutes compared to NaCl. This new method of compositional analysis can be used to identify possible thermodynamic effects in the distribution of proteins along chemical gradients at a range of scales from microbial mats to oceans.

Abstract. Prediction of the direction of change of a system under specified environmental conditions is one reason for the widespread utility of thermodynamic models in geochemistry. However, thermodynamic influences on the chemical compositions of proteins in nature have remained enigmatic despite much work that demonstrates the impact of environmental conditions on amino acid frequencies. Here, we present evidence that the dehydrating effect of salinity is detectable as chemical differences in protein sequences inferred from (1) metagenomes and metatranscriptomes in regional salinity gradients and (2) differential gene and protein expression in microbial cells under hyperosmotic stress. The stoichiometric hydration state (n H 2 O ), derived from the number of water molecules in theoretical reactions to form proteins from a particular set of basis species (glutamine, glutamic acid, cysteine, O 2 , H 2 O), decreases along salinity gradients, including the Baltic Sea and Amazon River and ocean plume, and decreases in particle-associated compared to free-living fractions. However, the proposed metric does not respond as expected for hypersaline environments. Analysis of data compiled for hyperosmotic stress experiments under controlled laboratory conditions shows that differentially expressed proteins are on average shifted toward lower n H 2 O . Notably, the dehydration effect is stronger for most organic solutes compared to NaCl. This new method of compositional analysis can be used to identify possible thermodynamic effects in the distribution of proteins along chemical gradients at a range of scales from microbial mats to oceans.

Introduction
How microbial populations adapt to environmental gradients is a major challenge at the intersection of geochemistry, microbiology, and biochemistry. Patterns of amino acid usage in proteins are important indicators of microbial adaptation, and amino acid composition at the genome level is well known to depend on growth temperature (Zeldovich et al., 2007). Furthermore, measures of evolutionary distance and community composition based on protein sequences predicted from metagenomic sequencing are strongly associated with environmental temperature and pH (Alsop et al., 2014). It is widely acknowledged that the effect of amino acid substitutions on the structural stability of proteins is a major factor affecting amino acid usage in thermophiles (Sterner and Liebl, 2001;Zeldovich et al., 2007). Similarly, a large body of work has demonstrated amino acid signatures associated with proteins from halophilic organisms (Kunin et al., 2008;Paul et al., 2008;Oren, 2013;Boyd et al., 2014). The most common interpretation of these trends is that particular amino acid substitutions are selected through evolution to increase the stability and solubility of the folded conformation and enhance other structural properties such as flexibility (Paul et al., 2008).
An interrelated approach to interpreting patterns of amino acid composition is based on the energetics of amino acid synthesis. Energetic costs in terms of ATP (adenosine triphosphate) requirements have been used to model protein expression levels in bacterial and yeast cells (Akashi and Gojobori, 2002;Wagner, 2005). Although ATP demands depend on environmental conditions (Akashi and Gojobori, 2002), a limitation of ATP-based models is that they are derived for specific biosynthetic pathways, such as whether cells are grown in respiratory or fermentative (i.e., aerobic or anaerobic) conditions (Wagner, 2005). A different class of models, based on thermodynamic analysis of the overall Gibbs energy of reactions to synthesize metabolites from inorganic precursors, quantifies the energetics of the reactions in terms of temperature, pressure, and chemical activities of all the species in the reactions, including those that define pH and oxidation-reduction potential (Shock et al., 2010). Notably, the overall Gibbs energies for amino acid synthesis become more favorable but to a different extent for each amino acid, between cold, oxidizing seawater and hot, reducing hydrothermal solution (Amend and Shock, 1998). A recent systems biology study demonstrates trade-offs between Gibbs energy of alternative pathways for amino acid synthesis and cofactor use efficiency (which affects ATP costs) in the model organism Escherichia coli and suggests that pathway thermodynamics play a role in thermophilic adaptation (Du et al., 2018). The oxidation state of proteins as well as lipids has been shown to be associated with oxidationreduction (redox) gradients in a hot spring (Dick and Shock, 2011;Boyer et al., 2020), but so far energetic models have not been broadly adopted as a tool for relating metagenomic and geochemical data. This may be because few studies have asked whether specific changes in the chemical composition of biomolecules reflect specific environmental conditions.
To help close this gap, here we use compositional analysis of protein sequences to identify chemical signatures of two types of environmental conditions: redox and salinity gradients. In a previous study , we compared one broad class of geochemical conditions (redox gradients) with one compositional metric for proteins (carbon oxidation state). Here, we expand the geobiochemical framework to two dimensions by considering another set of environments (salinity gradients) and another compositional metric (stoichiometric hydration state). Thermodynamic considerations predict that redox gradients supply a driving force for changes in the oxidation state of biomolecules (similar reasoning applies to the oxygen content of proteins; Acquisti et al., 2007), while salinity gradients, through the dehydrating potential associated with osmotic effects, exert a force that selectively alters the hydration state of biomolecules.
To test these predictions, we used two compositional metrics: the carbon oxidation state (Z C ) and stoichiometric hydration state (n H 2 O ). Z C is computed from the chemical formulas of organic molecules and takes values between the extremes of −4 for CH 4 and +4 for CO 2 , although the range for particular classes of biomolecules is much smaller (Amend et al., 2013). n H 2 O is derived from the number of water molecules in theoretical formation reactions of proteins from basis species (Dick, 2016(Dick, , 2017. Through the compositional analysis of representative metagenomic and metatranscrip-tomic datasets, we show that Z C and n H 2 O are most closely aligned with environmental redox and salinity gradients, respectively. These findings apply to freshwater and marine environments, but trends for hypersaline environments deviate from the thermodynamic predictions, most likely due to evolutionary optimizations of hydrophobicity and isoelectric point to stabilize the structures of proteins in halophilic organisms.

Conceptual background
In this study we use compositional analysis to uncover environmental imprints in protein sequences. Analysis of compositional data is used by geochemists to study processes such as water-rock interaction and ore deposition and is often one of the first steps in constructing thermodynamic models, but its application to living systems is relatively uncommon. Therefore, it is important to describe the conceptual basis for our methods. To do this, we identified six areas of concern summarized as (1) intracellular or environmental conditions, (2) amino acids or atoms, (3) condensation or theoretical formation reactions, (4) chemical composition or conformational stability, (5) oxidation and hydration state or temperature and pH, and (6) mathematical or biosynthetic models.
A first concern is that intracellular conditions are maintained within physiological ranges, so the influence of external conditions on the composition of microbial biomolecules may be limited. However, cell membranes are permeable to uncharged species such as hydrogen (Slonczewski et al., 2009), supporting the argument that the oxidation state of the cytoplasm and therefore the energetics of metabolic reactions are influenced by the external environment (Poudel et al., 2018;Canovas and Shock, 2020). Likewise, oxygen diffuses rapidly through lipid membranes, depending on their composition and structure, and rates of diffusion increase with temperature (Möller et al., 2016). Cell membranes are also permeable to water (Record et al., 1998). For E. coli, which grows most rapidly at about 0.3 Osm L −1 (osmolarity), increasing the extracellular osmotic strength from 0.1 to 1.0 Osm L −1 (approximately the osmotic concentration of seawater; BioNumbers BNID 100802 (Milo et al., 2010)) reduces the amount of free cytoplasmic water by more than half (Record et al., 1998). Halophiles, which thrive at even higher salinities, accumulate inorganic salts or organic solutes to maintain osmotic balance with the environment (Garner and Burg, 1994;Oren, 2013). The result is that, with few exceptions, intracellular conditions must be isosmotic with the environment, or somewhat higher, to maintain turgor pressure (Gunde-Cimerman et al., 2018). Water activity is lower in more concentrated solutions, and intracellular water activity estimated from freezing point and cell composition data closely follows that of the growth medium but is often offset to lower values (Chirife et al., 1981), perhaps due to macromolecular crowding effects (Garner and Burg, 1994). To summarize, high osmotic strength causes a decrease in hydration potential, measured as water activity, both outside and inside cells.
This brief review suggests that oxidation and hydration potentials in cell interiors, at least under experimental conditions, are influenced by (but not equal to) environmental conditions. Ideally, we would like to compare the compositions of biomolecules to conditions actually measured inside cells or in the immediate surroundings of cells, but these measurements are generally not available for microbial communities in their natural environments; thus, we make comparisons with large-scale geochemical gradients, except for different layers of the Guerrero Negro microbial mat, where metagenomic and chemical data are available on the scale of millimeters.
Second, previous authors have emphasized the importance of changes in elemental stoichiometry -that is, atomic composition -and not only amino acid composition in the molecular evolution of proteins (Baudouin-Cornu et al., 2001). Although stoichiometric predictions are amenable to experimental tests, such as the long-term evolution of E. coli in the laboratory (Turner et al., 2017), the omission of a major bioelement, hydrogen, and the oxidation state of organic matter from most stoichiometric models (Karl and Grabowski, 2017) means that there are also significant opportunities for theory development. Because redox reactions are inherent in many aspects of metabolism, while hydration and dehydration reactions are essential for the synthesis of biomacromolecules (Braakman and Smith, 2013), our approach is shaped by the assumption that O 2 and H 2 O are two primary components that link environmental conditions to the energetics of biomolecular synthesis.
The third point follows from the previous one. The polymerization of amino acids is a condensation reaction that releases one H 2 O per bond formed, independent of the particular amino acids that are involved. By contrast, our analysis depends crucially on the concept of a "formation reaction", which in the thermodynamic literature represents the composition of a chemical species, either in terms of elements (Warn and Peters, 1996) or in terms of other species (May and Rowland, 2018). When these other species are restricted in number to the minimum needed to represent the composition of all possible species in the system, they constitute a set of "basis species", which can be thought of as the building blocks of the system, similar to the concept of thermodynamic components (Anderson, 2005). Therefore, a formation reaction from basis species is a mass-balanced (but nonunique) stoichiometric representation of the chemical composition of the protein. This type of reaction in general does not correspond to amino acid biosynthesis or polymerization, so to avoid confusion, we refer to these formation reactions as "theoretical formation reactions"; the number of water molecules in the theoretical formation reactions, nor-malized by the protein length, is the "stoichiometric hydration state".
From a mechanistic standpoint, an analysis using any set of basis species is inadequate, since the number of basis species (five, corresponding to the elements C, H, N, O, and S) is smaller than the number of biochemical precursors and inorganic species that are actually involved in amino acid synthesis (Du et al., 2018). The use of O 2 , H 2 O, and other basis species to represent the composition of proteins reflects the hypothesis that they are conjugate to thermodynamically meaningful descriptive variables (specifically, chemical potentials) even if they are not directly involved in the biosynthetic mechanisms for amino acids. The projection of amino acid composition (20-D) into the compositional space represented by basis species (5-D) is a type of dimensionality reduction, but the variables are chosen based on a physicochemical hypothesis, unlike principal components analysis (PCA) or other unsupervised methods, where the projection is determined by the data.
A fourth concern is that this analysis is based on the hypothesis that thermodynamic forces affect the chemical compositions of proteins over evolutionary time, which is different from the more common hypothesis of optimization of structural stability. Thermodynamic models define the "cost" of a protein as a function of not only amino acid composition but also environmental conditions. Conceptually, this follows from Le Chatelier's principle, in that increasing the chemical activity of a reactant (on the left-hand side of a reaction) drives the reaction toward the products. Stated in more general terms, the overall Gibbs energy of a reaction depends on the activities of species in the reaction (Shock et al., 2010;Amend and LaRowe, 2019). Consider two proteins with different amino acid compositions and therefore also different chemical compositions and theoretical formation reactions, which should be normalized by the number of residues in order to compare proteins of different length. The formation of the protein with more water as a reactant is theoretically favored by increasing the water activity, whereas the formation of the protein with more oxygen as a reactant is favored by increasing the oxygen activity. The water and oxygen activity are thermodynamic measures of hydration and oxidation potential and can be converted to other scales, such as oxidation-reduction potential (ORP).
This reasoning provides the theoretical justification for using chemical composition as an indicator of molecular adaptation to specific environmental conditions but does not replace interpretations based on structural considerations. Halophilic organisms exhibit well documented patterns of amino acid usage, including lower hydrophobicity and higher abundance of acidic residues that impart greater stability, solubility, and flexibility of proteins (Paul et al., 2008). These adaptations are reflected in lower values of the GRAVY hydrophobicity scale (Paul et al., 2008;Boyd et al., 2014) and/or isoelectric point of proteins (pI) (Oren, 2013). In Sect. 4.3 and 4.4, we compare the compositional metrics with GRAVY and pI for the same datasets.
Fifth, temperature, pH, and other environmental parameters besides redox and salinity might influence the oxidation and hydration state of proteins. For instance, the redox gradients in hydrothermal systems are also temperature gradients, due to the mixing of seawater and hydrothermal fluid, and we have not attempted to disentangle the effects of temperature and redox conditions. However, our previous analysis of other redox gradients, including stratified hypersaline lakes, indicates that the carbon oxidation state of biomolecules can vary even in systems where temperature changes are much smaller . It is an axiomatic statement that changes in oxidation state can be associated with one thermodynamic component of a system; our objective in the present study is to explore the differences between this and one other component, represented by hydration state. Future work should also account for the effects of pH and temperature, which is possible using thermodynamic models for proteins (Dick and Shock, 2011).
Finally, it should be noted that the basis species used in the stoichiometric analysis are chosen primarily for mathematical convenience and not because of evolutionary or biosynthetic requirements. The main criterion we consider for the choice of basis species is to reduce the covariation between the metrics for oxidation and hydration state, which arises as a mathematical consequence of projecting the atomic formulas of proteins into a particular compositional space, and may not reflect meaningful differences of chemical composition. Additional considerations are described in Sect. 3.2.

Carbon oxidation state
The most common metric used in geochemistry for the oxidation state of organic molecules is the average oxidation state of carbon (Z C ), which also goes by other names such as nominal oxidation state of carbon (NOSC) (LaRowe and Van Cappellen, 2011). This quantity measures the average degree of oxidation of carbon atoms in organic molecules. For a protein for which the primary sequence has the chemical formula C c H h N n O o S s , the value of Z C can be calculated from the following (Dick and Shock, 2011;Dick, 2014): The derivation of Eq.
(1) is based on the relative electronegativities of the elements, expressed as oxidation numbers (e.g., Kauffman, 1986;Minkiewicz et al., 2018). When bonded to carbon, H is assigned an oxidation number of +1, and N, O, and S have oxidation numbers of −3, −2, and −2.
Equation (1) gives the remaining charge that must be present on each C atom, on average, to satisfy overall neutrality. Because of the relatively simple structures of amino acids and the primary structure of proteins, in which N, O, and S are bonded to only H and C, it is possible to calculate the average oxidation state of carbon using Eq. (1). However, this equation is not necessarily valid for other classes of organic molecules or some types of post-translational modifications of proteins, including the formation of disulfide bonds. An important relation inherent in Eq. (1) is the redox neutrality of hydration and dehydration reactions; any pair of hypothetical (or real) proteins whose formulas differ only by some amount of H 2 O have equal carbon oxidation states.

Choice of basis species: theoretical considerations
A major premise of this study is that oxidation state and hydration state are two primary variables in geobiochemical systems. Accordingly, when choosing the basis species that can be combined to make the proteins, O 2 and H 2 O are the only fixed requirements. This leaves three basis species that when combined with each other and with O 2 and H 2 O must be able to give any possible formula written as C c H h N n O o S s . We reiterate that this analysis refers to the chemical formulas of polypeptide sequences, that is, the primary structure of proteins, not post-translational modifications or H 2 O molecules in the hydration shell of folded proteins.
Equation (1) is derived from electronegativity relations and therefore allows for the calculation of the carbon oxidation state from a given chemical formula, independent of any chemical reactions. In contrast, there is no way to count the number of H 2 O molecules in a chemical formula; H 2 O appears only in chemical reactions. But it is important to note that any particular reaction that involves only H 2 O is redox neutral. Conversely, the coefficient of O 2 in redox reactions is closely related to the number of electrons transferred. Let us consider the 20 protein-forming amino acids as a baseline for compositional analysis; the numbers of H 2 O and O 2 in the formation reactions of the amino acids from a particular set of basis species are denoted by n H 2 O and n O 2 . The choice of basis species in our study is guided by the dual objectives that (1) n H 2 O of amino acids should have very little correlation with Z C and (2) n O 2 of amino acids should be strongly correlated with Z C . It should be emphasized that these are not criteria for "correctness", since basis species, like thermodynamic components, only have to be the minimum number needed to represent the chemical composition of all the species that can be formed from them (Anderson, 2005). Instead, basis species selected using these conditions yield a convenient mathematical projection of elemental composition; that is, nearly horizontal or vertical trends on n H 2 O -Z C scatterplots for proteins specifically reflect changes in oxidation state or hydration state, respectively.
An additional consideration is that a biologically meaningful set of basis species is likely to comprise metabolites that have high network connectivity, that is, are involved in reactions with many other metabolites. Reactions involving glutamine and glutamic acid (or its ionized form glutamate) are major steps of nitrogen metabolism (Morowitz, 1999;De-Berardinis and Cheng, 2010), and these amino acids have been characterized as "nodal point" metabolites (Walsh et al., 2018). Either methionine or cysteine would provide the sulfur required for the system, but cysteine is relevant as a constituent of the glutathione molecule, which has important roles in cellular redox chemistry (Walsh et al., 2018). These considerations support the proposal of the amino acids glutamine, glutamic acid, and cysteine (collectively abbreviated QEC) together with O 2 and H 2 O as a biologically relevant set of basis species for describing the chemical compositions of proteins (Dick, 2016). These three amino acids are among the top eight amino acids ranked by number of reactions in a metabolic model for E. coli (Feist et al., 2007)

Choice of basis species: stoichiometric analysis
Here we compute the stoichiometric hydration state by analyzing the compositions of the 20 proteinogenic amino acids in detail. We start with a "default" set of basis species chosen for their common occurrence in overall catabolic reactions (Amend and LaRowe, 2019): CO 2 , NH 3 , H 2 S, H 2 O, and O 2 . Using these basis species (designated CHNOS), the theoretical formation reaction of alanine (C 3 H 7 NO 2 ) is and the oxygen and water content of the amino acid (i.e., n O 2 = −3 and n H 2 O = 2) are the opposite of the coefficients on O 2 and H 2 O in the reaction. Analogous reactions for the other amino acids were used to make Fig. 1a-b. Using glutamine (C 5 H 10 N 2 O 3 ), glutamic acid (C 5 H 9 NO 4 ), cysteine (C 3 H 7 NO 2 S), H 2 O, and O 2 (the QEC basis species), the theoretical formation reaction of alanine is showing that the oxygen and water content are n O 2 = −0.3 and n H 2 O = 0.6. Calculations for all the amino acids using the QEC basis were used to make Fig. 1c-d.
As measured by R 2 in linear regressions, the CHNOS basis yields a strong negative correlation between Z C and n H 2 O for the amino acids (Fig. 1a) but a relatively weak correlation between Z C and n O 2 (Fig. 1b). The QEC basis provides a stronger association between Z C and n O 2 and reduces the correlation between Z C and n H 2 O (Fig. 1c-d). However, there is still a small negative correlation for amino acids (Fig. 1c). A plot with the R 2 values for all possible combinations of H 2 O, O 2 , and three amino acids indicates that QEC has relatively low R 2 of n H 2 O -Z C and high R 2 of n O 2 -Z C (Fig. 1e). Therefore, it is a suitable candidate to meet the objectives described above. Although another combination of amino acids -methionine, tryptophan, and tyrosine (MWY) -has even lower R 2 for the n H 2 O -Z C fit (Fig. 1e), tryptophan and tyro-  ), and number of carbon atoms (n C ). Standard one-letter abbreviations for the amino acids (denoted AA) are used.
sine are not highly connected metabolites and therefore are less preferable as basis species. By strengthening the association between Z C and n O 2 , which represent alternative metrics for oxidation state, and by reducing the correlation between Z C and n H 2 O , the QEC basis species provides a more convenient projection of elemental composition than a default choice of inorganic species, such as CO 2 , NH 3 , H 2 S, H 2 O, and O 2 , which commonly appear in overall catabolic reactions (Amend and LaRowe, 2019). The selection of basis species is an evolving method, and further analysis with other metabolites may lead to a more convenient set of basis species to project the elemental composition of proteins into chemical variables.

Compositional metrics for proteins and metagenomes
For a given protein, the stoichiometric hydration state was calculated from where n i is the frequency of the ith amino acid (i = 1 to 20) in the protein and n H 2 O,i is the stoichiometric hydration state of that amino acid (Table 1). The "−1" in the numerator accounts for the loss of H 2 O in the polymerization of amino acids, and the "+1" after the summation accounts for the Nterminal H and C-terminal OH of the polypeptide. Unlike n H 2 O , Z C for proteins must be weighted by the number of carbon atoms in each amino acid, i.e., where n C,i and Z C,i are the number of carbon atoms and carbon oxidation state of the ith amino acid (see Table 1). For example, Z C of the dipeptide Ala-Gly can be calculated as (3 × 0 + 2 × 1)/(3 + 2), where 3 and 2 are the numbers of carbon atoms and 0 and 1 are the Z C of Ala and Gly, respectively. The result, 0.4, can be checked by applying Eq. (1) to the chemical formula of alanylglycine (C 5 H 10 N 2 O 3 ). The methods for calculating n H 2 O and Z C from elemental composition and amino acid composition are shown schematically in Fig. 2.

Amino acid composition of proteomes of nif-bearing organisms
In a separate study, Poudel et al. (2018) used carbon oxidation state as a metric for comparing proteomes of organisms containing the nitrogenase gene (nif). The evolution of these organisms is associated with rising atmospheric oxygen through geological history. fewer than 1000 RefSeq protein sequences. As a result, the numbers of organisms included in the present calculations (Nif-A: 155, Nif-B: 68, Nif-C: 14, Nif-D: 7) are less than those identified in Poudel et al. (2018). Note that values of Z C calculated here (Fig. 3a) are lower than those shown in Fig. 5 of Poudel et al. (2018). This difference is associated with the weighting by carbon number (described above), which was not performed by Poudel et al. (2018).

GRAVY and pI
The grand average of hydropathicity (GRAVY) was calculated using published hydropathy values for amino acids (Kyte and Doolittle, 1982). The isoelectric point (pI) was calculated using published pK a values for terminal groups (Bjellqvist et al., 1993) and side chains (Bjellqvist et al., 1994); however, the calculation does not implement positionspecific adjustments (Bjellqvist et al., 1994). The pK a values used for calculating pI (Bjellqvist et al., 1993(Bjellqvist et al., , 1994 and transfer free energies used in the derivation of the GRAVY scale (Kyte and Doolittle, 1982) correspond to 25 • C and 1 bar, and no attempt was made here to account for the temperature effects on these properties. The charge for each ionizable group was precalculated from pH 0 to 14 at intervals of 0.01, and the isoelectric point was computed as the pH where the sum of charges of all groups in the protein is closest to zero. These calculations were implemented as new functions in the canprot R package (Dick, 2017) (see "Code and data availability" section). Comparisons for selected proteins (UniProt IDs: LYSC_CHICK, RNAS1_BOVIN, AMYA_PYRFU) show that the calculated values of GRAVY and pI are equal to those obtained with the ProtParam tool (Gasteiger et al., 2005).

Prediction of protein sequences
Protein sequences were predicted from metagenomic reads using a previously described workflow . Briefly, reads were trimmed, filtered, and dereplicated using scripts adapted from the MG-RAST pipeline (Keegan et al., 2016). For metatranscriptomic datasets, ribosomal RNA sequences were removed using SortMeRNA (Kopylova et al., 2012). Protein-coding sequences were identified using FragGeneScan (Rho et al., 2010), and the amino acid sequences of the predicted proteins were used in further calculations. For large datasets, only a portion of the available reads were processed (at least 500 000 reads; see Supplement Tables S1 and S2). This reduces the computational requirements without noticeably affecting the calculated average compositions . Means and standard deviations of Z C , n H 2 O , GRAVY, and pI were calculated for 100 random subsamples of protein sequences from each metagenomic or metatranscriptomic dataset. The number of sequences included in each subsample was chosen to give a total length closest to 50 000 amino acids on average. The subsample density (or number of sequences included in each sample) depends on the average length of the metagenomic or metatranscriptomic sequences and is listed in Tables S1 and S2. This number ranges from 251 for the dataset with the highest mean protein fragment length (199.1; metagenome of hot-spring source of Bison Pool) to 1696 for the dataset with the lowest mean protein fragment length (29.5; metatranscriptome of site GS684 in the Baltic Sea).

Comparison of redox and salinity gradients
To search for the hypothesized dehydration signal in metagenomic data, we began with redox gradients as a negative control. Submarine hydrothermal vents are zones of complex interactions between reduced endmember fluids and relatively oxidized seawater (Reeves et al., 2014;Ooka et al., 2019). Terrestrial hydrothermal systems, such as the hot springs in Yellowstone National Park, USA, provide a source of reduced fluids that are oxidized by degassing and mixing with air and surface groundwater as well as biological activity including sulfide oxidation (Lindsay et al., 2018). Redox gradients can also develop over smaller length scales. The surface of the Guerrero Negro microbial mat (Baja California Sur, Mexico) is exposed to ca. 1 m deep hypersaline, oxygenated water (approximately 200 µM O 2 ), but in the mat, oxygen rises during the daytime and is depleted within a few millimeters, giving way to anoxic and then sulfidic conditions (Ley et al., 2006).
Using metagenomic data for these redox gradients (Kunin et al., 2008;Havig et al., 2011;Swingley et al., 2012;Reveillaud et al., 2016;Fortunato et al., 2018), Dick et al. (2019) showed that the carbon oxidation states of DNA, messenger RNA, and proteins increase down the outflow channel of Bison Pool and between fluids from diffuse hydrothermal vents and relatively oxidizing seawater. Moreover, intact polar lipids extracted from the microbial communities of Bison Pool and other alkaline hot springs also exhibit downstream increases in carbon oxidation state (Boyer et al., 2020), revealing that parallel compositional trends characterize many major types of biomacromolecules in these hot springs. The Z C of proteins increases more subtly toward the surface in the upper few millimeters of the Guerrero Negro microbial mat; it also increases at greater depths, perhaps due to heterotrophic degradation and/or horizontal gene transfer . Furthermore, an evolutionary trajectory associated with the occurrence of different homologs of nitrogenase (nif) in anaerobic and aerobic organisms is characterized by increasing Z C of the proteomes of these organisms (Poudel et al., 2018). The trends of carbon oxidation state described above are visible in the scatter plot in Fig. 3a, with an added dimension: stoichiometric hydration state. The guidelines in this plot are parallel to the n H 2 O -Z C trend for amino acids (Fig. 1c); their slope represents the background correlation between n H 2 O and Z C that is associated with the choice of basis species. Sample data for Bison Pool and the submarine vents are distributed parallel to these guidelines. Therefore, the decrease of n H 2 O along these redox gradients can be attributed to the background correlation in the stoichiometric analysis, and the differences between samples within each dataset are specifically associated with changes in carbon oxidation state and not stoichiometric hydration state. This is an expected outcome, as the redox gradients considered here do not have large changes in salinity. In particular, concentrations of Cl − , a conservative ion, increase by less than 10 % (6.1 to 6.6 mM) in the outflow of Bison Pool due to evaporation (Swingley et al., 2012). The diffuse vents considered here have concentrations of Cl − between 515 and 624 mM, not greatly different from bottom seawater at 545 mM (Dataset S1 of Reeves et al., 2014).
As a well known example of a regional salinity gradient, the Baltic Sea exhibits a freshwater to marine transition over 1800 km, but dissolved oxygen at the surface is at or near saturation with air (Dupont et al., 2014), so this transect does not represent a redox gradient. For protein sequences derived from metagenomes in the 0.1-0.8 µm size fraction, there are large changes in stoichiometric hydration state along the Baltic Sea transect but relatively small differences in the carbon oxidation state (Fig. 3b). This pattern holds for samples from both the surface and chlorophyll a maximum (9-30 m deep; Fig. 3c).

Multifactorial hydration effects
The stoichiometric hydration state of proteins can be influenced by factors other than just salinity. Previous authors have observed large differences in microbial community composition between free-living and particle-associated fractions, which may be due in part to anoxic conditions arising from limited diffusion in particles (Simon et al., 2014). As described below, we found a trend of relatively low n H 2 O in particles compared to free-living fractions in both the Baltic Sea and Amazon River. This effect is probably associated with phylogenetic differences among the size fractions, but reduced accessibility to bulk water may be a contributing factor. Further support for the possible influence of physical accessibility is the reduced n H 2 O in the interior compared to upper layers of the Guerrero Negro microbial mat.
For the Baltic Sea metagenomes and metatranscriptomes, the 0.1-0.8 and 0.8-3.0 µm size fractions of particles that do not pass through the filter, which are used for subsequent DNA extraction and sequencing, represent free-living bacteria, while the 3.0-200 µm fraction contains particleassociated bacteria with average larger genome sizes and greater inferred metabolic and regulatory capacity (Dupont  Figure 4a-c shows that proteins inferred from metagenomes for larger particles have lower n H 2 O than those for the smallest size fraction. The Guerrero Negro microbial mat offers another opportunity to compare exposed and interior environments. Unlike Z C , which reaches a minimum a few millimeters into the mat, n H 2 O decreases throughout the mat, but the changes are most pronounced in the upper few millimeters (Fig. 3a).
One hypothesis that could explain these findings is that the interiors of particles and the mat are sequestered to some extent from the surrounding aqueous environment. If limited accessibility to the aqueous phase were manifested as lower water activity, perhaps due to surface effects associated with geological nanomaterials (Wang et al., 2003) and/or higher concentrations of solutes, it would provide a thermodynamic drive that favors lower n H 2 O of proteins. However, it should be noted that particles are also suitable habitats for multicellular and eukaryotic populations (Simon et al., 2014). Therefore, the trends in stoichiometric hydration state may require an explanation in terms of both physical and phylogenetic differences, which should be explored in future studies.
An important evolutionary transition is the emergence of heterotrophic metabolism, which is a later innovation than autotrophic core metabolism (Morowitz, 1999;Braakman and Smith, 2013). It is notable that the deeper layers of the Guerrero Negro mat show greater evidence for heterotrophic metabolism (Kunin et al., 2008); likewise, heterotrophs in the "photosynthetic fringe" in Bison Pool may outcompete the autotrophs that dominate at higher and lower temperatures (Swingley et al., 2012). These putative heterotroph-rich zones show locally lower values of n H 2 O (Fig. 3a). If decreasing stoichiometric hydration state is a common theme across some evolutionary transitions, then the relatively high n H 2 O in the proteomes of organisms carrying the ancestral nitrogenase Nif-D (Fig. 3a) is not unexpected. A better understanding of these trends would require more extensive phylogenetically resolved comparisons of the compositional differences as well as quantitative analyses of water fluxes in different metabolic pathways.

Compositional trends in rivers, lakes, and hypersaline environments
The Amazon River and ocean plume provide another example of a freshwater to marine transition, with salinities that range from below the scale of practical salinity units (PSU) in the river to 23-36 PSU in the plume (Satinsky et al., 2014(Satinsky et al., , 2015. We used published metagenomic and metatranscriptomic data for filtered samples classified as freeliving (0.2 to 2.0 µm) and particle-associated samples (2.0 to 156 µm) (Satinsky et al., 2014(Satinsky et al., , 2015. River samples form a tight cluster on a plot of stoichiometric hydration state against carbon oxidation state of proteins, and the plume samples are scattered over lower Z C and low values of n H 2 O , particularly for the particle-associated fraction (Fig. 5a). For metatranscriptomes, there is a noticeable decrease of n H 2 O from the river to the ocean plume but little difference in carbon oxidation state (Fig. 5b), and the particle-associated samples again exhibit a generally lower n H 2 O than the free-living samples. Together with the lower n H 2 O for proteins inferred Figure 5. Compositional analysis and hydropathicity and isoelectric point calculations for proteins from the Amazon River and plume and other metagenomes. Samples representing freshwater, marine, and hypersaline environments are indicated by the colored convex hulls.
(a) Metagenomic and (b) metatranscriptomic data for particle-associated and free-living fractions from the lower Amazon River (Satinsky et al., 2015) and plume in the Atlantic Ocean (Satinsky et al., 2014). (c) Freshwater (lakes in Sweden and USA) and marine metagenomes considered in a previous comparative study (Eiler et al., 2014)  from metagenomes and metatranscriptomes in the larger size fractions from Baltic Sea samples, this could reflect a lower availability of H 2 O to organisms living near the particle surface due to physical separation from the bulk aqueous phase and associated diffusion limitation or lower water activity (Wang et al., 2003). We also considered data used in a previous comparative study and data for hypersaline environments including evaporation ponds (salterns) and lakes in desert areas. Eiler et al. (2014) characterized microbial communities using metagenomic data for various freshwater samples (lakes in the USA and Sweden) and marine locations. For hypersaline settings, we used metagenomic data from the Santa Pola salterns in Spain (Ghai et al., 2011;Fernandez et al., 2013), natural soda lakes of the Kulunda Steppe in Serbia (Vavourakis et al., 2016), and South Bay salterns in California, USA (Kimbrel et al., 2018). The compositional analysis reveals a relatively low n H 2 O of proteins inferred from the marine metagenomes compared to freshwater samples in the Eiler et al. (2014) dataset (Fig. 5c). Surprisingly, hypersaline metagenomes have ranges of n H 2 O of proteins that are similar to marine environments but considerably higher Z C (Fig. 5c). To interpret these results, we considered other factors that are known to influence the amino acid compositions of proteins in halophiles.
"Salt-in" halophilic organisms have proteins with relatively low isoelectric point that remain soluble at high salt concentrations (Ghai et al., 2011). It should be noted that proteins with a lower pI also tend to have relatively high Z C due to higher abundances of aspartic acid and glutamic acid, which are relatively oxidized (see Amend and Shock, 1998;Dick, 2014;and Fig. 1). Consequently, the lower pI characteristic of salt-in organisms is also associated with an increase of carbon oxidation state. Because of the large pI differences (Fig. 5f), the increase of Z C in hypersaline environments can not be interpreted as an indicator of an environmental redox gradient. Some halophilic organisms are also known to have proteins that are less hydrophobic, with lower values of GRAVY (Paul et al., 2008;Boyd et al., 2014). Because hydrophobic amino acids have relatively low values of Z C (Dick, 2014), a negative correlation between GRAVY and Z C is also expected. Consistent with these well known features of halophilic adaptation, marine metagenomes exhibit lower hydrophobicity than most of the freshwater samples, and hypersaline metagenomes are shifted to both lower GRAVY and pI (Fig. 5f). However, there are irregular trends in the Amazon River data. Compared to the river, the proteins in plume metagenomes exhibit lower GRAVY and either higher or lower pI (Fig. 5d). Similarly, other authors have reported that although lower pI is a signature of many hypersaline environments, it does not clearly distinguish marine from lower-salinity environments (Rhodes et al., 2010). In contrast, the plume metatranscriptomes do show decreased pI but no major difference in GRAVY compared to river samples (Fig. 5e).
There is not enough space here to comprehensively examine all the available metagenomic data for environmental salinity gradients. However, we have identified one dataset that gives a contradictory result and therefore offers more perspective on the compositional relationships of proteins coded by metagenomes in salinity gradients. This dataset was generated in a time series study of microbial and viral community dynamics in a freshwater aquaculture facility ("tilapia channel" and "prebead bond") and low-, medium-, and high-salinity salterns in southern California (Rodriguez-Brito et al., 2010). Here, we have used only the reported microbial sequences (not the viral dataset) and considered all time points together. Contrary to our starting hypothesis, the stoichiometric hydration state of proteins is lowest in the freshwater samples, which is the reverse of the trend from the Baltic Sea ( Fig. 6a-b). A side-by-side comparison of the Baltic Sea and the datasets by Rodriguez-Brito et al. (2010) shows large changes of GRAVY in the former but pI in the latter (Fig. 6c-d), which is another indication that these variables are responsive only in certain ranges of salinity.
This counterexample demonstrates that the sign of differences of n H 2 O is not predictable in all environments; however, the large negative offset in the freshwater samples may be a signal of some other influence, perhaps related to the human control of these ponds, which are used as fish nurseries. Specifically, the microbial communities in the aquaculture ponds may not be responding as they would in a typical natural system that is less nutrient rich. As noted above for putative heterotroph-rich zones in other systems, the lower stoichiometric hydration state could be associated with the enrichment of heterotrophic taxa, in this case due to the addition of organic compounds to the aquaculture ponds.
Considering all the datasets shown in Figs. 5 and 6, there appears to be no globally consistent metric for environmental salinity gradients that can be derived from amino acid composition. If we exclude the Rodriguez-Brito et al. (2010) dataset, then n H 2 O exhibits a consistent decreasing trend in marine compared to freshwater samples. However, this trend does not continue into hypersaline environments.

Compositional analysis of differentially expressed proteins
While biomolecular data for environmental salinity gradients reflect both ecological and evolutionary differences, laboratory experiments provide information on the physiological effects of osmotic conditions on protein expression in particular organisms. It is also important to recognize that osmotic stress can be imposed by solutes other than NaCl; the effects of organic solutes differ in relation to their ability to permeate or depolarize cell membranes and to be sensed by cellular osmoregulatory systems (Kanesaki et al., 2002;Shabala et al., 2009;Withman et al., 2013). Because microbial acclimation to changes in osmotic conditions is a dynamic process, it is helpful to look at gene and protein expression data for a range of times and conditions that can be controlled in the lab. We searched the literature to compile data for differential gene and protein expression in non-halophilic bacteria in NaCl or other osmotic stress conditions. As a general rule, we only included datasets with a minimum of 20 downregulated and 20 up-regulated genes or proteins; however, smaller datasets were included if they are part of a study with larger datasets. This compilation consists of 49 transcriptomics and 30 proteomics datasets from 36 studies (note that different time points and treatments are considered separate datasets); descriptions and references for all datasets are given in Figures S1 and S2. In addition, four datasets for differential expression of proteins in halophilic archaea in hyperosmotic stress were located (Leuko et al., 2009;Zhang et al., 2016;Lin et al., 2017;Jevtić et al., 2019) (see Fig. S3). This is a major update to an earlier compilation of data for hyperosmotic stress experiments (Dick, 2017), but we have limited the present compilation to data for bacteria or archaea; data for osmotic stress induced by NaCl or glucose in eukaryotic cells are considered in a separate paper (Dick, 2020a).
We assembled the lists of up-and down-regulated proteins in each dataset or, for gene expression studies, the proteins corresponding to the up-and down-regulated genes and converted gene names or accession numbers to UniProt accessions using the UniProt mapping tool (Huang et al., 2011). The compiled data are available as CSV files in R packages (see the "Code and data availability" section). After removing genes or proteins with unavailable or duplicated UniProt IDs and those with ambiguous differences (appearing in both the down-and up-regulated groups), the amino acid compositions computed for protein sequences downloaded from UniProt (The UniProt Consortium, 2019) were used for the compositional analysis of carbon oxidation state and stoichiometric hydration state. Median differences (i.e., n H 2 O and Z C ) were calculated as the median value for all up-regulated proteins minus the median value for all downregulated proteins in each dataset. Figure 7a shows results for time-course experiments for hyperosmotic stress. Note that all values are differences calculated relative to the same control (initial time point) in a given study. In transcriptomic experiments for a commensal species (Enterococcus faecalis), a soil bacterium (Methylocystis sp. strain SC2), and two pathogens (E. coli O157:H7 and Salmonella enterica serovar Typhimurium) (Solheim et al., 2014;Han et al., 2017;Kocharunchitt et al., 2014;Finn et al., 2015), there is a marked progression toward lower n H 2 O of the associated proteins with time. In a transcriptomic experiment for salt stress in Synechocystis sp. PCC 6803 (Qiao et al., 2013), n H 2 O is shifted negatively between 24 and 48 h but rises to a slightly positive value at 72 h. Proteomic data are available from two of these studies, indicating that the differentially expressed proteins in E. coli (Kocharunchitt et al., 2014) also show decreasing n H 2 O with time, but in the proteomic experiment for Synechocystis sp. PCC 6803 (Qiao et al., 2013), n H 2 O changes sign from negative to positive between 24 and 48 h (Fig. 7a).
Perhaps the most striking result to emerge from this analysis is the strong dehydrating signal associated with osmotic stress imposed by organic solutes. We compared pairs of datasets from the same study for NaCl and another solute at concentrations that give similar total osmolalities. Transcriptomic data for sorbitol (Kanesaki et al., 2002;Han et al., 2005), sucrose (Kohler et al., 2015), and glycerol (Finn et al., 2015) compared to controls all show a lower n H 2 O of the associated proteins than for NaCl compared to controls (Fig. 7b). Data from the study of Finn et al. (2015) are plotted at 1 and 6 h in the experiment, indicating a time-dependent decrease of n H 2 O under both NaCl and glycerol treatment as well as more negative values for glycerol than NaCl. Experiments with different strains of E. coli show a slightly more positive value for sucrose than NaCl (Shabala et al., 2009) and a much larger positive difference for urea compared to NaCl (Withman et al., 2013). The available proteomic data also show lower n H 2 O for sucrose (Kohler et al., 2015) and glucose (Schmidt et al., 2016) compared to NaCl (Fig. 7b). Note that the latter dataset is actually a comparison between growth on glucose and glucose with NaCl; growth on glucose alone produces a lower n H 2 O of the differentially expressed proteins.
The marked decrease of n H 2 O induced by solutes such as sorbitol, which does not permeate the plasma membrane, could result from a higher effective osmotic pressure compared to NaCl (Kanesaki et al., 2002). Because it permeates cells, solutions of urea are not considered hypertonic (Burg et al., 2007), which may be one reason for the higher n H 2 O for urea compared to NaCl. Sucrose, which permeates but unlike NaCl does not depolarize the plasma membrane (Shabala et al., 2009), produces a slightly higher n H 2 O than NaCl in one transcriptomics dataset for E. coli (Shabala et al., 2009) but has a more marked dehydrating effect in both transcriptomics and proteomics datasets for Caulobacter crescentus (Kohler et al., 2015). The negative shift of n H 2 O associated with most organic solutes compared to NaCl lends support to the notion that high organic loading could contribute to the relatively low n H 2 O of protein sequences from metagenomes of freshwater aquaculture systems (Fig. 6b).
Considering all transcriptomic datasets together (see Fig. S1 for references), the proteins coded by differentially expressed genes in non-halophilic bacteria under hyperosmotic stress do not show significant differences in Z C , n H 2 O , pI, or GRAVY ( Fig. 7c-d). However, the average difference of n H 2 O would become more negative if the early time points in individual time-course experiments were excluded from the average (see Fig. 7a). Unlike the results for transcriptomes, the average value of GRAVY for all proteomics datasets (see Figs. S2 and S3 for references) increases significantly ( Fig. 7f; p = 0.011). The proteomic data also exhibit a small decrease of pI (p = 0.083), which is expected for halophiles, but the increase of GRAVY -that is, higher hydrophobicity -is the opposite of the evolutionary trend for proteomes of halophilic organisms (Paul et al., 2008) and the metagenomic comparisons described above. Overall, the proteomic experiments record a significant decrease of n H 2 O in hyperosmotic stress ( Fig. 7e; p = 0.016). We therefore conclude that n H 2 O is a metric with consistent behavior for field and laboratory datasets, since it records decreasing hydration state of proteins with increasing salinity in the Baltic Sea and Amazon River and plume and of differentially expressed proteins in microbial cells grown under hyperosmotic stress.

Conclusions
This study was focused on describing the chemical compositions of proteins in a geochemical context. The theoretical novelty of this study is the derivation of a compositional metric for stoichiometric hydration state (n H 2 O ) that is largely decoupled from changes in oxidation state (Z C ) of proteins. Therefore, based on mass-action effects in thermodynamics, n H 2 O is predicted to decrease toward higher salinity but be mostly insensitive to redox gradients. We found that protein sequences inferred from metagenomes in re-gional salinity gradients, including the Baltic Sea freshwatermarine transect and Amazon River and plume, are characterized by changes of n H 2 O in the predicted direction. Although this trend does not continue into hypersaline environments, the applicability of the compositional analysis to microbial cells is supported by compilations of transcriptomic and proteomic data, which indicate decreasing n H 2 O on average for the differentially expressed proteins in hyperosmotic stress experiments. The dehydration signal becomes larger during many time-course experiments and is stronger for most organic solutes than for NaCl.
The central message of this study is that geochemical and laboratory conditions can influence, but naturally do not completely determine, the chemical compositions of proteins. As a step toward constructing multidimensional chemical thermodynamic models of microbial communities, the present results provide evidence that different compositional metrics, representing the oxidation state and hydration state of molecules, can be associated specifically with redox and salinity gradients, respectively. The findings of this study underscore an opportunity for the integration of hydration state into evolutionary models that already consider changes in oxidation state or oxygen content of proteins (Acquisti et al., 2007;Poudel et al., 2018).
Code and data availability. All metagenomic and metatranscriptomic data analyzed here were obtained from public databases using the accession numbers listed in Supplement Table S1 for salinity gradients and Table S2 for redox gradients. The amino acid compositions of subsampled sequences from the metagenomic and metatranscriptomic data are available in the JMDplots R package, version 1.2.4 (https://github.com/jedick/JMDplots), which is archived on Zenodo (Dick, 2020b). Specifically, the data are contained in the file inst/extdata/gradH2O/MGP.rds, which can be read using the R function readRDS (minimum R version: 2.3.0). The compilation of differential gene expression data is available in the JMDplots package as xz-compressed CSV files in the directory inst/extdata/expression/osmotic/. The compilation of differential protein expression data is in the corresponding directory of the canprot R package, version 1.1.0 (https://cran.rproject.org/package=canprot), which is also archived on Zenodo (Dick, 2020c). The results of the compositional analysis of differential expression data, which are used for Fig. 7, are in the inst/vignettes/ directories of the JMDplots and canprot packages. The code used to make all of the figures and perform statistical testing is in the JMDplots package. The gradH2O.Rmd vignette in the package demonstrates the functions used to make the figures.
Author contributions. JMD designed and carried out the analysis. JMD, MY, and JT interpreted the results. JMD wrote the article with editing input from MY and JT.