Diversity and distribution of nitrogen fixation genes in the oxygen minimum zones of the world oceans

Diversity and community composition of nitrogen (N) fixing microbes in the three main oxygen minimum zones (OMZs) of the world ocean were investigated using operational taxonomic unit (OTU) analysis of nifH clone libraries. Representatives of three of the four main clusters of nifH genes were detected. Cluster I sequences were most diverse in the surface waters, and the most abundant OTUs were affiliated with Alphaand Gammaproteobacteria. Cluster II, III, and IV assemblages were most diverse at oxygendepleted depths, and none of the sequences were closely related to sequences from cultivated organisms. The OTUs were biogeographically distinct for the most part – there was little overlap among regions, between depths, or between cDNA and DNA. In this study of all three OMZ regions, as well as from the few other published reports from individual OMZ sites, the dominance of a few OTUs was commonly observed. This pattern suggests the dynamic response of the components of the overall diverse assemblage to variable environmental conditions. Community composition in most samples was not clearly explained by environmental factors, but the most abundant OTUs were differentially correlated with the obvious variables, temperature, salinity, oxygen, and nitrite concentrations. Only a few cyanobacterial sequences were detected. The prevalence and diversity of microbes that harbor nifH genes in the OMZ regions, where low rates of N fixation are reported, remains an enigma.


Introduction
Nitrogen fixation is the biological process that introduces new biologically available nitrogen (N) into the ocean and, thus, constrains the overall productivity of large re-gions of the ocean where N is limiting to primary production. The most abundant and most important diazotrophs in the ocean are cyanobacteria, members of the filamentous genus Trichodesmium and several unicellular genera, including Crocosphaera sp. and the symbiotic genus Candidatus Atelocyanobacterium thalassa (UCYN-A). Although these cyanobacterial species are widespread and have different biogeographical distributions (Moisander et al., 2010), they are restricted to sunlit surface waters, mainly in tropical or subtropical regions.
Because diazotrophs have an ecological advantage in Ndepleted waters, and because those conditions occur in the vicinity of oxygen minimum zones, due to the loss of fixed N by denitrification, it has been proposed that N fixation should be favored in regions of the ocean influenced by OMZs (Deutsch et al., 2007). It has also been suggested that the energetic constraints on N fixation might be partially alleviated under reducing, i.e., anoxic, conditions (Großkopf and LaRoche, 2012). In response to these ideas, the search for organisms with the capacity to fix nitrogen has been focused recently in regions of the ocean that contain OMZs. That search usually takes the form of characterizing and quantifying one of the genes involved in the fixation reaction, nifH, which encodes the dinitrogenase reductase enzyme. Diverse nifH assemblages have been reported from the oxygen minimum zone of the eastern tropical South Pacific (Turk-Kubo et al., 2014;Loescher et al., 2016;Fernandez et al., 2011) and the Costa Rica Dome, at the edge of the OMZ in the eastern tropical North Pacific (Cheung et al., 2016). The search for non-cyanobacterial diazotrophs has resulted in the discovery of diverse nifH genes, but they have rarely been associated with significant rates of N fixation (Moisander et al., 2017;Bentzon-Tilia et al., 2015). Thus, the occurrence and diver-A. Jayakumar and B. B. Ward: Diversity and distribution of nitrogen fixation genes sity of putative diazotrophs in nitrogen-rich aphotic waters remains unexplained.
Here, we report on the distribution and diversity of nifH genes in all three of the world ocean's major OMZs: the two Pacific OMZs, the eastern tropical North (ETNP) and South (ETSP) Pacific, are both highly productive eastern boundary regions. The ETSP is the one of the most productive regions in the world ocean and has an oxygen-depleted layer of about 400 m at its greatest depth. The ETNP is less well ventilated and less productive, with an anoxic layer of more than 700 m. The third major OMZ is the Arabian Sea, which is geographically constrained to the northern Indian Ocean. It experiences an annual monsoon cycle but is permanently and stably stratified with a maximum anoxic layer of about 800 m. Both surface and anoxic depths as well as both DNA and cDNA (i.e., both the presence and expression of the nifH genes) were investigated. The approach used here to investigate diazotroph assemblages is based on clone library analysis of nifH sequences. Next-generation amplicon sequencing would yield greater numbers of sequences, although it might not overcome the primer bias associated with polymerase chain reaction (PCR) and cloning. The strength of the current study is the inclusion of similar data from all three OMZs. By comparing these results to previous studies using the same and other methods, we find robust biogeographical patterns and community structure among the non-cyanobacterial diazotroph assemblages.

Materials and methods
Samples analyzed for this study were collected from the three major OMZ regions of the world oceans (16 total samples, Table 1) from the surface and from oxygen minimum zone (OMZ, including oxycline and anoxic) depths. Particulate material from water samples (5-10 L), collected using Niskin samplers mounted on a CTD (Conductivity-Temperature-Depth) rosette system (Sea-Bird Electronics), was filtered onto Sterivex capsules (0.2 µm filter, Millipore, Inc., Bedford, MA) immediately after collection using peristaltic pumps. The filters were flash frozen in liquid nitrogen and stored at −80 • C until DNA and RNA could be extracted. For samples from the Arabian Sea, DNA extraction was carried out using the PUREGENE™ Genomic DNA Isolation Kit (Qiagen, Germantown, MD), and the RNA was extracted using the ALLPrep DNA/RNA Mini Kit (Qiagen, Germantown, MD). For samples collected from ETNP and ETSP, DNA and RNA were simultaneously extracted using the ALLPrep DNA/RNA Mini Kit (Qiagen, Germantown, MD). A SuperScript III First Strand Synthesis System (Invitrogen, Carlsbad, CA, USA) was used to synthesize cDNA immediately after extraction following purification of RNA using the procedure described by the manufacturer, including no-RT controls. The extracted DNA was treated with DNase before transcription, and no-RT controls verified the absence of nifH DNA in the RNA preps. DNA was quantified using PicoGreen fluorescence (Molecular Probes, Eugene, OR) calibrated with several dilutions of phage lambda standards.
PCR amplification of nifH genes from environmental sample DNA and cDNA was done on an MJ100 Thermal Cycler (MJ Research) using a Promega PCR kit following the nested reaction (Zehr et al., 1998), with slight modification as in Jayakumar et al. (2017). Briefly, 25 µL PCR reactions containing 50 pmol each of outer primer and 20-25 ng of template DNA were amplified for 30 cycles (1 min at 98 • C, 1 min at 57 • C, and 1 min at 72 • C), followed by amplification with 50 pmol each of inner PCR primers (Zehr and McReynolds, 1989). Water for negative controls and PCR was freshly autoclaved and UV irradiated every day. Negative controls were run with every PCR experiment in order to minimize the possibility of amplifying contaminants (Zehr et al., 2003). The PCR preparation station was also UV irradiated for 1 h before use each day, and the number of amplification cycles was limited to 30 for each reaction. Each reagent was tested separately for amplification in negative controls. nifH bands were excised from PCR products after electrophoresis on 1.2 % agarose gel, and they were cleaned using a QIAquick Nucleotide Removal Kit (Qiagen). Clean nifH products were inserted into a pCR ® 2.1-TOPO ® vector using One Shot ® TOP10 Chemically Competent E. coli and a TOPO TA Cloning ® Kit (Invitrogen), according to manufacturer's specifications. This process resulted in 30 clone libraries, 16 of DNA and 14 of RNA, from the 16 samples (Table 1).
Inserted fragments were amplified with M13 Forward (−20) and M13 Reverse primers from randomly picked clones. PCR products were sequenced at Macrogen DNA Analysis Facility using Big Dye™ terminator chemistry (Applied Biosystems, Carlsbad, CA, USA). Sequences were edited using FinchTV version 1.4.0 (Geospiza Inc.), and they were checked for identity using BLAST. Consensus nifH sequences (359 bp) were translated to amino acid (aa) sequences (108 aa after trimming the primer region) and aligned using ClustalW in MEGA X (Kumar et al., 2018;Stecher et al., 2020) along with published nifH sequences from the NCBI database. The alignment was used to construct a maximum likelihood (ML) phylogenetic tree in MEGA X, based on the Poisson model, and the phylogenetic tree was edited using iTOL (Letunic and Bork, 2016). Bootstrap analysis was used to estimate the reliability of phylogenetic reconstruction (1000 iterations). The nifH sequence from Methanosarcina lacustris (AAL02156) was used as an outgroup. The accession numbers from GenBank for the nifH sequences in this study are Arabian Sea DNA sequences JF429940-JF429973 and cDNA sequences accession numbers JQ358610-JQ358707, ETNP DNA sequences KY967751-KY967929 and cDNA sequence KY967930-KY968089, and ETSP DNA sequences MK408165-MK408307 and cDNA sequences MK408308-MK408422. The nifH nucleotide alignment (of 787 sequences) was used to define operational taxonomic units (OTUs) on the basis of DNA sequence identity. Distance matrices based on this nucleotide alignment were generated in mothur (Schloss and Handlesman, 2009). The relative nifH richness within each clone library was evaluated using rarefaction analysis. OTUs were defined as sequences that differed by ≤ 3 % using the furthest-neighbor method in the mothur program (Schloss and Handlesman, 2009). The 3 % OTU definition is similar to the level at which species are conventionally defined using 16S rDNA sequences, so it may overestimate the meaningful diversity of the functional gene. Redundancy analysis was performed in R using the vegan package. Environmental variables were transformed using decostand.

Results and discussion
DNA and cDNA sequences (787 in total) derived from the OMZ regions of the Arabian Sea (AS), the eastern tropical North Pacific (ETNP), and the eastern tropical South Pacific (ETSP) were subjected to OTU and phylogenetic analyses to compare the diversity and community composition, biogeography, and gene expression of nifH-possessing microbes among the three OMZ regions. Phylogenetic analysis of the sequences from the AS, ETNP, and ETSP have been reported separately in previous publications (Jayakumar et al., 2012;Jayakumar et al., 2017;Chang et al., 2019), but the sequences have been combined for additional global analyses here. We compared the threshold OTU definitions at 3 % and 10 % and found that the number of OTUs decreased, as expected, as the resolution decreased. Even at the 3 % threshold, however, OTUs tended to separate by depth and location, indicating a functionally useful distinction at this level. Thresholds of 3 %-5 % as the OTU definition correspond to within and between species-level distinctions for nifH (Gaby et al., 2018). The sequences from the OMZ regions represented three of the four sequence clusters (I, II, III, and IV) described by Zehr et al. (1998).

Cluster I nifH OTU distributions
Diversity analysis of the nifH Cluster I sequences for the three OMZs based on OTUs using mothur-identified 41 OTUs at a distance threshold of 3 % ( Table 2). The number of sequences and the number of OTUs varied widely among depths and stations, so the results are grouped by region (AS, ETNP, and ETSP) or depth horizon (surface or OMZ, including upper oxycline depths), or by cDNA vs. DNA (Table 2). Grouping the sequences by depth horizon (surface or OMZ), region (AS, ETSP, and ETNP), or DNA or RNA, allows for the detection of patterns that are not driven by the relatively low number of sequences obtained from some of the individual clone libraries. The OTUs are numbered in order of decreasing abundance in the clone library, i.e., OTU-1 was the most common OTU; OTU designations for Cluster I are listed in Table 1 in the Supplement.
For all regions and depths combined, the number of OTUs detected (41) was less than the sum of OTUs detected when each region was analyzed separately (45), indicating that there was some overlap of OTUs among regions. However, the overlap was not large. Only 3 of the 12 most abundant OTUs contained sequences from more than one region, and none contained sequences from all three regions (Fig. 1a). When sequences for all three regions were combined, only both depth horizons (Fig. 1b). Most OTUs represented a single depth, and many represented a single sample. This suggests a pattern of dominance, rather than evenness, in the nifH assemblage. Therefore, deeper sequencing is expected to discover a larger number of rare OTUs, but it might not change the picture that emerges here of a small number of abundant clades. Interestingly, Cheung et al. (2016) reported a similar pattern of dominance based on a larger DNA sequence dataset from only one location. Using 454pyrosequencing to obtain a similar number of OTUs (37 total) from the Costa Rica Dome, all of the 15 samples investigated by Cheung et al. (2016) were dominated (> 50 %) by one of five major OTUs. The Arabian Sea was strikingly less diverse than other regions and sample subsets (Fig. 2). For example, when all DNA and cDNA sequences for all depths are grouped together, the Arabian Sea (OTUs = 14, Chao = 21) contains less species richness than the combined surface samples from all three regions (OTUs = 25, Chao = 52), despite having a similar number of total sequences (178 for the Arabian Sea and 198 for all surface samples combined). This lack of diversity in the AS data may be partly due to the preponderance of cDNA sequences, which generally contained less diversity than a similar number of DNA sequences (see below).
Although similar numbers of sequences were obtained for cDNA (255) vs. DNA (257), the OTU "density", i.e., number of OTUs per number of sequences analyzed, was higher for DNA (0.136 for DNA, 0.094 for cDNA). The Chao statistic verified this observation for the combined data from each region in predicting higher total numbers of OTUs for DNA (Chao = 42) than for cDNA (Chao = 24). This difference could indicate that some of the nifH genes present were not expressed at the time of sampling, but the cDNA sequences were not simply a subset of the DNA community. Half of the 12 most abundant OTUs contained either cDNA or DNA (Fig. 1c), meaning that some genes were never expressed and some expressed genes could not be detected in the DNA. Based on a similar number of sequences from each sample (1-52 per sample) from the ETSP, Turk-Kubo et al. (2014) also found that DNA and cDNA clones were differently distributed among stations; one phylotype was recovered exclusively from cDNA and only one phylotype occurred in both DNA and cDNA. The relatively low sequencing depth associated with clone library studies limits the sensitivity of this comparison, but it clearly shows that dominant components of the DNA and cDNA libraries frequently represent different subsets of the total assemblage. For all regions combined, similar numbers of OTUs were detected in surface waters (OTUs = 25) and in OMZ samples (OTUs = 23), although a larger number of sequences was analyzed for the OMZ environment (198 vs. 314 sequences for surface and OMZ depths, respectively). It might be expected that the presence of phototrophic diazotrophs in the surface water would lead to greater diversity there, but only one OTU representing a known cyanobacterial phototroph (Katagnymene spiralis or Trichodesmium in OTU-12) was identified, so most of the additional diversity must be present in heterotrophic or unknown sequences.
Rarefaction curves (Fig. 2) indicate that sampling did not approach saturation for region nor depth. The Chao statistic also indicated that much diversity remains to be explored, despite the great uncertainty in these estimates. The total number of OTUs detected, the shape of the rarefaction curve, and the diversity indicators (Fig. 2, Table 2) all indicate that the Figure 1. Histogram of the 12 most common OTUs from the Cluster I nifH clone libraries from the three OMZ regions. OTUs were considered common if the total number of sequences in an OTU was ≥ 2 % of the total number of nifH clones analyzed (the common OTUs contained 441 of the 512 Cluster I sequences). OTUs were defined according to 3 % nucleotide sequence difference using the furthest-neighbor method. OTU designation is from most common (OTU-1) to least common.  greatest nifH diversity occurred in surface waters and that much of that diversity was in singletons, i.e., not represented in the 12 most abundant OTUs, which represented 441 (86 %) of the total 512 nifH Cluster I sequences analyzed. Most of that diversity was contained in the ETNP, not solely a function of number of sequences analyzed (Fig. 2).

Cluster I nifH Phylogeny
Phylogenetic affiliations at both the DNA and protein level are shown for the 12 most abundant OTUs in Table 3. The most abundant OTU (129 sequences), OTU-1, contained Gammaproteobacterial DNA and cDNA sequences from both the surface and OMZ depths of the ETNP as well as cDNA sequences from oxycline and OMZ depths in the Arabian Sea (Fig. 3). Although very similar to each other, none of these sequences had higher than 91 % identity at the DNA level (96 % at the aa level) with cultivated strains and were most closely related to Pseudomonas stutzeri. P. stutzeri is a commonly isolated marine denitrifier, but it is also known to possess the capacity for N fixation (Krotzky and Werner, 1987). OTU-4, OTU-6, and OTU-8 also contained Gammaproteobacterial sequences. All had high identity with cultivated strains at the protein level, but none were > 91 % identical to cultivated strains at the DNA level.
Gammaproteobacterial sequences with very close identities to Azotobacter vinelandii have been reported from the Arabian Sea ODZ (oxygen-deficient zone, refers to the depths where oxygen concentrations are low enough to induce anaerobic metabolism, and OMZ denotes the oceanographic region where low-oxygen waters are found) and also from the ETSP (Turk-Kubo et al., 2014). This group Table 3. OTU identities. OTU identities for both clusters. Cultivated species with closest nucleotide identity to the OTUs identified in the nifH clone libraries from three OMZ regions. Only the 12 most common OTUs (out of 41 total) are listed for Cluster I sequences, and the 11 most common (out of 18 total) are listed for the Cluster II, III, and IV libraries. of nifH sequences with close identities to A. vinelandii was also retrieved from the English Channel, Himalayan soil, the South Pacific gyre, the Gulf of Mexico, mangrove soil, and many other environments (Fig. 3). Azotobacter-like sequences were included in OTU-6 but were not the closest identity at the DNA level. Although a large number of clones were analyzed here, no sequence that was closely associated with A. vinelandii was retrieved from the three regions. None of the g-244774A11 sequences, Gammaproteobacterial relatives that were abundant in the South Pacific (Moisander et al., 2014), were detected in this study.
OTUs-2, 3, 5, 10, and 11 all represented Alphaproteobacterial sequences, with closest identities to various Bradyrhizobium, Sphingomonas and Methylosinus species. Thus, Alphaproteobacterial sequences (206 sequences) were the most abundant in the clone library. OTU-2 almost exclusively contained ETSP ODZ DNA and cDNA sequences (as well as one AS ODZ DNA sequence). OTU-3 contained DNA sequences from ETNP surface waters. OTU-5 exclusively contained Arabian Sea DNA sequences from Station 3, whereas OTU-10 contained only surface samples from the ETNP. An OTU threshold of 11 % grouped all (179 sequences in five OTUs) of these Alphaproteobacterial sequences together, but  Table 3. the 3 % threshold is consistent with the phylogenetic tree, which shows small-scale biogeographical separation of sequence groups.
OTUs-7 and -9 were identified as Betaproteobacteria with closest identities to Rubrivivax gelatinosus and Burkholderia, 91 % and 90 %, respectively, at the DNA level. However, at the aa level, these sequences were 99 % and 100 % identical to Novosphingobium malaysiense and S. azotifigens, respectively, both Alphaproteobacteria, and they were again biogeographically distinct. OTU-7 contained 25 DNA sequences from the ODZ depths in the Arabian Sea, and OTU-9 contained 17 Burkholderia-like sequences from the oxycline at Station 1 in the Arabian Sea. No Betaproteobacterial nifH sequences were detected in the ETNP or ETSP, but se-quences similar to Burkholderia phymatum, Cupriavidus sp., and Sinorhizobium meliloti have previously been reported from the ETSP (Fernandez et al., 2015). Consistent with our previous report, however, there is no clear separation between the alpha and the beta groups in nifH phylogeny (Jayakumar et al., 2017).
Most of the Cluster I ETSP sequences from this study were contained in two OTUs (2 and 4). OTU-2 contained 89 Alphaproteobacterial sequences with > 98 % identity to nifH sequences from Bradyrhizobium sp. Uncultured bacterial sequences retrieved from the South China Sea, the English Channel, mangrove sediment, wastewater treatment, and grassland soil were related to these ETSP sequences. OTU-4 contained 29 Gammaproteobacterial sequences re-5960 A. Jayakumar and B. B. Ward: Diversity and distribution of nitrogen fixation genes trieved from both the surface and ODZ depths. Four of the remaining ETSP Cluster I sequences were grouped together as OTU-17 (Alphaproteobacteria, 89 % and 96 % identities with Methyloceanibacter sp. and Bradyrhizobium sp. at the DNA and aa level, respectively), three were in OTU-23 (Bradyrhizobium 100 % identity), and two were singletons. One of the singletons was most closely related to uncultured soil and sediment sequences and to Azorhizobium sp. (86 %) and one had 97 % identity with Bradyrhizobium denitrificans and many sequences from marine sediments.
OTU-22 represents the Deltaproteobacterial group. This novel group has been previously reported from the ETNP (Jayakumar et al., 2017) and has three sequences from the Arabian Sea (OTU-22) and two singletons from ETNP surface waters. nifH-possessing Deltaproteobacteria have been reported not only from all the three ODZs but also in several other marine environments including the Chesapeake Bay water column, microbial mats from an intertidal sandy beach on a Dutch barrier island, Jiaozhou Bay sediment, Rongcheng Bay sediment, the Bohai Sea, the Mediterranean Sea, Narragansett Bay, and the South Pacific gyre.
Proteobacteria-like sequences, especially Alpha-and Gammaproteobacteria, are the most frequently reported nifH sequences from the OMZs studied here and similar environments. A total of 31 of 37 OTUs detected by Cheung et al. (2016) in the Costa Rica Dome OMZ were Proteobacteria, with the two most common OTUs being closely related to Alphaproteobacterium Methylocella palustris and the Gammaproteobacterium Vibrio diazotrophicus. Loescher et al. (2014Loescher et al. ( , 2016 also found V. diazotrophicus-like sequences as well as several other Gammaproteobacteria in the ETSP. V. diazotrophicus has been previously reported in the Arabian Sea (Jayakumar et al., 2012) but was not prominent in the present study. Sequences most similar to various V. diazotrophicus, other Vibrio species, and other Gammaproteobacteria, including P. stutzeri, were the most common non-cyanobacterial Cluster I sequences reported for the lowoxygen waters of the Southern California Bight (Hamersley et al., 2011). Bradyrhizobium spp., one of the most common genera reported here and in surface waters of the Arabian Sea (Bird and Wyman, 2013) as well as by Fernandez et al. (2011) in the ETSP were also detected in the Costa Rica Dome OMZ and were the dominant OTU at 1000 m at one station (Cheung et al., 2016). Bradyrhizobium-like sequences were the most abundant among those amplified from ODZ incubations in which the N 2 fixation rate was enhanced by the addition of glucose (Bonnet et al., 2013). In addition to Bradyrhizobium-like and Teredinibacter-like nifH sequences, Turk-Kubo et al. (2014) found four other abundant Gammaproteobacteria-like nifH sequences, which were entirely novel. The "Gamma A", which are commonly reported non-cyanobacteria diazotroph nifH sequences from non-OMZ environments (Langlois et al., 2015;Moisander et al., 2017), were represented by a singleton from the ETNP in the present study. Figure 4. Histogram of the six most common OTUs from the Cluster II, III, and IV nifH clone libraries from the three OMZ regions. OTUs were considered common if the total number of sequences in an OTU was ≥ 2 % of the total number of nifH clones analyzed (the common OTUs contained 252 of the 275 Cluster II, III, and IV sequences). OTUs were defined according to 3 % nucleotide sequence difference using the furthest-neighbor method. OTU designation is from most common (OTU-1) to least common.  nifH sequences related to various Alphaproteobacterial methylotrophs are commonly found in OMZs: Methylosinus trichosporium-like sequences, which are reported here in OTU-5 from the Arabian Sea both at the surface and at ODZ depths, were also reported by Fernandez et al. (2011) in the ETSP. Methylocella palustris-like nifH genes comprised the most common OTU in the ODZ core depths in the Costa Rica Dome (Cheung et al., 2016). M trichosporium and M. palustris represent obligate and facultative methanotrophs, respectively, and are both also obligately aerobic. Detection of nifH genes closely related to those of methanotrophs does not prove that methanotrophy is present or important in the anoxic environment of the ODZ, but the consistency of this finding across sites motivates further investigation of the potential for methane production and consumption in ODZs.
The pattern of the high diversity of nifH-bearing, mostly heterotrophic microbes, in addition to the dominance of one or a small number of nifH OTUs in each sample, suggests a bloom and bust pattern of organic matter-supported growth. That is, we suggest that organic matter, which is supplied episodically in the upwelling regimes, stimulates the growth of copiotrophic microbes that respond rapidly in a bloom-like fashion. This bloom scenario has been described for denitrifying bacteria based on the OTU patterns observed in the nirS and nirK genes as a function of the stage of denitrification in both natural assemblages and incubated samples from OMZs (Jayakumar et al., 2009). Amino acids and glucose both stimulated N 2 fixation in OMZ samples from the ETSP, and nifH sequences associated with Alpha-and Gammaproteobacteria, as well as Cluster III phylotypes, were found in a glucose enrichment experiment (Bonnet et al., 2013) The role of nifH in these heterotrophic microbes is unclear, es-pecially because rates of nitrogen fixation in these locations in the absence of cyanobacteria or nutrient enrichment is often very low (Turk-Kubo et al., 2014;Loescher et al., 2016;Chang et al., 2019).
Although Trichodesmium-like clones have been retrieved from the surface waters of the Arabian Sea and the ETNP OMZs, only 10 clones  in the combined clone library analyzed here were related to Trichodesmium (98 % identity), including both cDNA and DNA from the Arabian Sea and cDNA from the ETNP. These sequences were actually 100 % identical to Katagnymene spiralis, a close relative of Trichodesmium isolated from the South Pacific Ocean. Turk-Kubo et al. (2014) also retrieved only a few cyanobacterial sequences from the ETSP. No other cyanobacterial nifH sequences were identified.

Clusters II, III, IV nifH OTU distributions
The other three nifH clusters were combined for OTU analysis due to the limited number of sequences and OTUs obtained. A total of 18 OTUs were identified in the combined set of 275 sequences with a 3 % distance threshold (Table 2); OTU designations for Cluster II, III, and IV are listed in Table 2 in the Supplement. Most of the Cluster II, III, and IV sequences were from the ETNP and ETSP. As with the Cluster I sequences, there was very little geographic and depth overlap among these OTUs (Fig. 4a, b). Only OTU-1 contained sequences from more than one site, the ETNP and the ETSP. OTU-2 contained only cDNA sequences representing ODZ depths at both ETNP stations. OTU-3 exclusively contained ETSP DNA sequences from the surface and cDNA sequences from ODZ depths. Only 10 of the Cluster II, III, and IV sequences were from the Arabian Sea, and they formed three separate OTUs, a greater "OTU density" than was present at either of the Pacific sites. As observed for Cluster I, most of the OTUs that were detected in the DNA were not being expressed, and those that were expressed were not detected in the DNA (Fig. 4c).
Rarefaction curves (Fig. 5) indicate that sampling for Cluster II, III, and IV did not approach saturation. The Chao statistic also indicated that much diversity remains to be explored, despite the great uncertainty in these estimates. Unlike the Cluster I analysis, there were relatively few singletons in the Cluster II, III, and IV data, and the assemblages were dominated by a few types.

Cluster II, III, and IV nifH phylogeny
Four large OTUs (OTU-1, -2, -4, and -6) in clusters II, III, and IV belonged to nifH Cluster IV, and Alphaproteobacteria/Spirochaeta and Deltaproteobacteria were the dominant phylogenies (Table 3, Fig. 6). The largest OTU, OTU-1, contained 88 DNA sequences from the ETNP ODZ depths from both stations and from both depths in the ETSP. This OTU had no similarity to any cultured microbe. OTU-4 contained 5962 A. Jayakumar and B. B. Ward: Diversity and distribution of nitrogen fixation genes Figure 6. Maximum likelihood (ML) phylogenetic tree, based on the Poisson model, of Cluster II, III, and IV partial nifH-translated amino acid sequences from DNA and cDNA. Bootstrap values > 50 % of 1000 replications are labeled with black circles on the branches. Accession number of reference sequences from NCBI are provided at the end of each reference name. Positions of the OTUs are shown relative to their nearest neighbors from the database. Individual sequence identities comprising each OTU are listed in Table 3. 30 sequences from the ETSP, all cDNA from one surface station, in nifH Cluster IV.
OTU-2 (75 sequences) in Cluster IV contained only cDNA sequences, all from ODZ samples in the ETNP (both stations), and had no close relatives among cultivated species. Although Turk-Kubo et al. (2014) retrieved a few clones identified as belonging to Cluster II from the euphotic zone of the ETSP, we did not find any sequence falling into this cluster. OTU-3 contained 35 sequences in Cluster III and was dominated by DNA sequences from surface depths of the ETSP. OTU-5 represented Deltaproteobacteria in nifH Cluster III and contained 18 identical DNA sequences from 90 m at Station BB1 in the ETNP. Thus, of the five most common OTUs (89 % of the total Cluster II, III, and IV sequences  I and (b) clusters II, III, and IV, illustrating the relationships among OTUs (green circles containing the OTU number) and sites. DNA is represented using squares, and cDNA is represented using circles. The Arabian Sea is cyan (surface) and blue (OMZ), the ETNP is pink (surface) and red (deep), and the ETSP is yellow (surface) and orange (deep). Panel (a) shows the 12 most abundant OTUs for Cluster I and the four most independent environmental variables, T denotes temperature, S denotes salinity, NO 2 denotes the nitrite concentration, and O 2 denotes the oxygen concentration. Panel (b) shows the six most abundant OTUs for clusters II, III, and IV and all six environmental variables, NO 3 denotes the nitrate concentration and Z denotes depth. analyzed), only one could be identified as a closely related genus (i.e., OTU-4 with 90 % identity with R. palustris) and there was no overlap between DNA and cDNA OTUs from the same depths.
The other 13 OTUs in the Cluster II, III, and IV sequences represented either Cluster III or IV. None of these were very closely related to any cultivated sequences. OTU-6 contained both DNA and cDNA from the OMZ at one ETSP station. OTU-7 contained four sequences from ETNP surface waters with close identities to a sequence retrieved from the Bohai Sea. OTU-11 had one DNA and one cDNA sequences from the ETSP. All of the other sequences were less than 84 % identical to any sequence in the database and could only be loosely identified as Firmicutes or Proteobacteria.
Although there were few high identities with known species, many of the Cluster II, III, and IV sequences (OTUs -2, -5, -7, -9, and -10) were most closely affiliated with sulfate-reducing clades at either the DNA or protein level.
Four OTUs with highest identity to known sulfate reducers were reported by Cheung et al. (2016), and one of them comprised nearly 40 % of the sequences in one anoxic sample. nifH sequences that cluster with Desulfovibrio spp. are often reported from ODZ samples (Turk-Kubo et al., 2014;Loescher et al., 2014;Fernandez et al., 2011). Consistent reports of nifH genes associated with obligate anaerobes involved in sulfate reduction suggests a role for this metabolism in the ODZ, again motivating further research on the significance of both sulfate reduction and associated N 2 fixation in ODZ waters.

Biogeography and environmental correlations
The dominant factor determining OTU composition and distribution is clearly biogeography (Fig. 4). That geographical factor is also evident in the redundancy analysis (Fig. 7). (Only sites that contained sequences from one of the top OTUs are represented in the plots, so the number of site symbols is less than 30 for both plots.) For example, Cluster I OTU-5 containing only Arabian Sea surface sequences was positively correlated with both temperature (T ) and salinity (S) and all of the Arabian Sea samples clustered in the quadrant associated with high T and S (Fig. 7a). Surface samples from the ETSP were also in that quadrant, but surface ETNP samples were negatively correlated with S. The surface ETNP samples correlated with OTUs-3. -6, -10, and -11, all of which contained exclusively surface samples. The two largest Cluster I OTUs were associated with the deep samples from the ETNP and ETSP and correlated positively with nitrite concentration and negatively with oxygen -a signature of the OMZ. Nitrate concentration and depth did not increase the power of the analysis and were omitted from the Cluster I RDA. Most of the sites and five of the most common Cluster I OTUs were not well differentiated by any of the usual environmental parameters.
The Arabian Sea contained very few sequences in clusters II, III, and IV and none of them were in the top six OTUs, so only ETNP and ETSP samples are represented in the RDA for these clusters (Fig. 7b). The two largest OTUs in clusters II, III, and IV were negatively correlated with T and S but separated along the second RDA axis, demonstrating opposite relationships with oxygen, nitrite, and nitrate concentrations. OTU-1 included ETSP surface sequences, as well as ODZ sequences from both ETNP and ETSP, whereas OTU-2 contained only ODZ sequences but both OTUs were phylogenetically related to anaerobic clades ( Table 2). Inclusion of all six environmental variables was necessary to obtain maximum separation of the sites and OTUs for clusters II, III, and IV.