Ideas and perspectives: When ocean acidification experiments are not the same, repeatability is not tested

Can experimental studies on the behavioural impacts of ocean acidification be trusted? That question was raised in early 2020 when a high-profile paper failed to corroborate previously observed responses of coral reef fish to high CO2. New information on the methodologies used in the “replicated” studies now provides a plausible explanation: the experimental conditions were substantially different. High sensitivity to test conditions is characteristic of ocean acidification research; such response variability shows that effects are complex, interacting with many other factors. Open-minded assessment of all research results, both negative and positive, remains the best way to develop processbased understanding. As in other fields, replication studies in ocean acidification are most likely to contribute to scientific advancement when carried out in a spirit of collaboration rather than confrontation.


Introduction
Ocean acidification involves a reduction in seawater pH (increased hydrogen ion concentration), currently caused by increased carbon dioxide (CO 2 ) in the atmosphere.Associated chemical changes include an increased concentration of bicarbonate ions and dissolved inorganic carbon and a decreased concentration of carbonate ions in the ocean and, unless compensated for, the body fluids of marine organisms.Although the chemistry of the carbonate system has been well-understood for decades, research on the biologi-cal and ecological implications of anthropogenic ocean acidification only began in earnest about 20 years ago (Gattuso and Hansson, 2011).A wide range of potential consequences have since been identified, with an early appreciation of the diverse vulnerability of plant and animal species (Kroeker et al., 2013;Wittmann and Pörtner, 2013).Effects on the production of shells and skeletons have been a major research focus; however, reduced calcification is not the only impact, since there is also strong evidence for low pH affecting many other physiological processes (Pörtner et al., 2014;Baumann, 2019;Hurd et al., 2020), including vertebrate and invertebrate behaviour (Clements and Hunt, 2015;Cattano et al., 2018;Zlatkin and Heuer, 2019).Laboratory experiments have investigated the biological impacts of ocean acidification through a reductionist approach; i.e. conditions are deliberately simplified.This approach has the advantage of enabling statistical testing of cause and effect for single factors, yet necessarily omits many of the complexities of natural conditions, which may involve temporal as well as biotic and abiotic environmental factors (Kapsenberg and Cyronak, 2019).

The challenge of contradictory results
A two-step experiment has been used by many research groups to investigate the possible effects of ocean acidification on fish behaviour.Initially, individual fish are given a binary choice of water conditions in a flume tank, with one choice including an odour (e.g. from predators or a conspe-P.Williamson et al.: When ocean acidification experiments are not the same cific alarm cue) known to elicit an avoidance response.Those observations of discriminatory ability then provide the "control" strength of preference, for comparison with treatment results using the same choice under raised CO 2 (lowered pH) conditions throughout the test tank.Several versions of such experimental conditions and treatments have been developed, with differences between protocols known to affect the strength of the response change (Jutfelt et al., 2017).
Based on that binary-choice approach and with the intention of replicating previous work, Clark et al. (2020a) reported their findings in an unambiguously titled paper: "Ocean acidification does not impair the behaviour of coral reef fishes".To exclude the possibility of inadvertent observer bias, they deployed video recording and automatic tracking software in their study, making that digital information openly available.They also used data simulations to conclude that previously reported results were "highly improbable", with an estimated likelihood of 0 out of 10 000 -assuming identical experimental conditions and that their own data were valid.Since Clark et al. (2020a) went to "great lengths" (in their own words) to replicate earlier work yet failed to observe the same effects, there was the implication that other researchers' work was either flawed or fraudulent, reflecting earlier concerns expressed by Clark et al. (2016) and Clark (2017).
In the context of a reported "crisis" in research reproducibility for many disciplines (Baker, 2016;Nature, 2018), Clark et al. (2020a) attracted media coverage and scientific responses, including praise for its thoroughness by several independent commentators (Enserinck, 2020; Science Media Centre, 2020).However, those initial reactions also identified three potential weaknesses.First, Clark et al. (2020a) did find several significant ocean acidification effects, contrary to the paper's title, although less dramatic than those previously reported.Second, their analysis gave scant attention to the extensive literature on factors causing variability in ocean acidification research.The third, more fundamental, concern related to how closely the original experiments had been repeated and whether that issue had been thoroughly checked before the paper was published.

Experimental differences
Any deficiencies in the peer review of Clark et al. (2020a) were addressed 9 months after its publication, with a detailed (online) critique by Munday et al. (2020a) that challenged the effectiveness of the claimed replication: "Clark et al. did not closely repeat previous studies, as they did not replicate key species, used different life stages and ecological histories, and changed methods in important ways that reduce the likelihood of detecting the effects of ocean acidification".
Experimental differences identified by Munday et al. (2020a) between the original and repeated results included the following.
- Clark et al. (2020a) did not use clownfish, one of the original test species.
-Adult and sub-adult fish were mostly used, rather than larvae and small juveniles (with older fish known to be less responsive to risk cues).
-For one species, the juveniles were from an inbred aquarium population (likely to be pre-adapted to high CO 2 and hence less sensitive).
-Many experiments were carried out during a marine heatwave (with high temperatures known to reduce or reverse responses in the studied species).
-Dissolved CO 2 levels were unstable, with an average daily pCO 2 range of 581 µatm in 2016 treatments.Such variability can reduce behavioural impacts (Jarrold et al., 2017) and did not match the stable conditions of directly compared studies.
There were also crucial changes to the design of the testing apparatus, the dilution and nature of odour cues, and the duration of tests.Such changes weakened the control response, hence reducing the likelihood of significant CO 2 treatment effects.In total, 16 differences between the original studies and the re-runs were described by Munday et al. (2020a), any one of which could potentially invalidate the comparisons.The counter-argument, made at the time of the original publication (Enserink, 2020) and subsequently re-iterated by Clark et al. (2020b), is that minor experimental differences are inevitable and can be considered as reflecting natural environmental variability.They should not matter if the original findings are widely applicable and robust.The question of what does or does not constitute a valid replication is therefore critical, yet inherently problematic.Since it is widely accepted that a fully exact repeat of a biological study is impossible, due to the dynamic nature of both animate and inanimate factors ("No man ever steps in the same river twice; it is not the same man, nor is it the same river", widely ascribed to Heraclitus), it is valid to distinguish "reproducibility" from "replicability".Whilst both terms relate to the repeatability of outcomes, the test for reproducibility is conventionally limited to conditions where very tight control is achievable, e.g. in data treatments, or when re-using the original experimental set-up.In contrast, greater flexibility is allowed for investigating replicability, reflected in a definition of replication as "a study for which any outcome would be considered diagnostic evidence about a claim from prior research" (Nosek and Errington, 2020a).This broad definition has merit, although consistency is needed across disciplines (e.g.Stevens, 2017;Junk and Lyons, 2020), to avoid contributing to semantic confusion in a contested topic area.
Three further generic issues are also relevant here.First, it is important that the design of a replication study adequately addresses all key components of existing hypotheses, for example, the strong life-stage dependence of the response to high CO 2 conditions.Second, the limitations of statistical analyses need to be recognized: statistically non-significant results do not necessarily mean there is no effect (Amrhein et al., 2019).Third, any single study does not disprove the consensus, since broadening the concept of replication has the clear corollary that novel outcomes need to be interpreted using all available lines of evidence, with awareness of both similarities and differences in relation to previous work.Table 1 of the Supplement to Munday et al. (2020a) identified 110 research papers published between 2009-2019 that investigated how ocean acidification might, or might not, affect the behaviour and sensory physiology of fish.Out of 44 that involved coral reef fish, 41 of those studies (carried out by 68 researchers at 35 institutions in 15 countries) reported significant effects, including several that used video recording, blind-testing, and raw-data publication.The remaining 66 papers (for other tropical, temperate and polar fish; marine and freshwater) provided additional support: 44 of those reported significant behavioural effects of ocean acidification.We are aware of five more recent publications on this topic, in addition to Clark et  A closely similar result was found in a meta-analysis of 95 marine and freshwater studies by Clements et al. ( 2020), with T.D. Clark included in the authorship team: they found that 64 of those papers reported either strong or weak behavioural effects.Whilst the proportion showing a strong effect declined over the period 2009-2019, that decrease is unsurprising, since the early strong-effect studies were all on the most sensitive (marine) species.Additional independent evidence is provided by molecular studies, showing direct effects of high CO 2 on neurotransmission in fish (e.g.Schunter et al., 2019) and other taxa (e.g.Moya et al., 2016;Zlatkin and Heuer, 2019); further biochemical and pharmacological examples are given by Munday et al. (2020a).An objective summary of the global evidence is that ocean acidification can adversely affect fish behaviour under experimental conditions, whilst also recognizing that the occurrence and scale of such impacts vary with circumstances, species and the life stage tested.

Taking account of response variability
A recent IPCC assessment (Bindoff et al., 2019) confirmed the pervasive and complex effects of high CO 2 and warming, not only on marine organisms and ecosystems but also on ecosystem services and society.Improved knowledge of all these response levels is crucial for effective mitigation and adaptation.This increasing appreciation of the interactions between ocean acidification and other biochemical, physiological, behavioural, ecological and physical processes is both scientifically exciting and sobering, showing the difficulty in developing comprehensive understanding of this im-portant component of ocean climate change.The complexity of these interactions should, however, not be surprising, since marine species have experienced natural variability in pH and CO 2 levels throughout their evolution and also in their diverse habitats (Kapsenberg and Cyronak, 2019).Species will inherently have differently vulnerabilities and different ways of responding, and response differences can therefore be expected to occur in experimental studies.
Recognition of widespread response variability in ocean acidification experiments is not novel.It was noted for studies on survival, calcification, growth and reproduction in early meta-analyses (Kroeker et al., 2013) and subsequently provided the focus for much national and international research.It is therefore now well-established that closely related marine species can respond very differently to experimental pH treatments and that the magnitude of singlespecies responses can be affected by many factors.These influences include length of exposure, population-level genetic differences due to local adaptation, food availability, interactions with other stressors, seasonality, energy partitioning, life stage and the sex of the organisms used in experiments (e.g.Thomsen et al., 2012;Suckling et al., 2014;Sunday et al., 2014;Breitburg et al., 2015;Vargas et al., 2017;Ellis et al., 2017;Dahlke et al., 2018) as well as physico-chemical conditions (Riebesell et al., 2011).
Given this known variability, the results from any single ocean acidification study cannot provide the final word, overriding the consensus of other findings.Whilst many important uncertainties remain (Busch et al., 2016;Baumann, 2019;Hurd et al., 2020), we consider that scientific progress can be hindered by the sequence of polarizing criticisms (Clark, 2017;Clark et al., 2020a), rebuttal (Munday et al., 2020a), reply (Clarke et al., 2020b) and a further point-bypoint response (Munday et al., 2020b).A more constructive approach would involve experimental co-design in a collaborative, comparative framework (Boyd et al., 2018), with appropriate rigour (Cornwall and Hurd, 2016) -which can still be consistent with scientific scepticism, replication tests and the reporting of negative results (Browman, 2016).Future ocean acidification experiments would also benefit from an update of Riebesell et al. (2011), to provide improved guidance on the key parameters that can affect laboratory results.Since a very wide range of factors are potentially important, pragmatism will be needed with regard to associated issues of resource deployment and measurement accuracy, recognizing that chemists and biologists may have different priorities on such matters.

Wider implications
The concept of generalizability (Nosek and Errington, 2020a)   therefore enabling the underlying hypothesis to be tested, and potentially disproved, by the latter?The scientific benefits of that framing are greatest when the outcome of a replicability test is accepted by two research groups that initially favour different hypotheses -thereby requiring a more nuanced, non-confrontational framework for experimental planning, analysis and interpretation (Fanelli, 2018;Nosek and Errington, 2020a, b).
Figure 1 provides a diagrammatic summary of these issues, with situation (a) showing close congruence between two experimental studies, carried out by two research groups.If both groups recognize that there is a very close match when Study no. 2 is planned (following the arrangements proposed by Nosek and Errington, 2020b), the replication provides a valid test of any hypotheses arising from Study no. 1.In contrast, situation (b) shows a pair of studies that only partly overlap; i.e. they differ in many regards, and where prior agreement between research groups on their congruence may not have been achieved.If results from both studies in situation (b) are consistent, the generalizability of Study no. 1 is extended.However, if inconsistent, the generalizability of Study no. 1 and Study no. 2 will each be constrained to its specific experimental conditions, with evidence from other studies providing the context for interpretation of the different outcomes.A range of intermediate situations between (a) and (b) can also occur.
The above proposals for clearer "rules of engagement" for future replication studies could be greatly encouraged if research funders not only recognized that major insights can arise from closely similar or repeated work, but also required liaison between competing research teams as a condition of award in such circumstances.Our final recommendation is that high-profile publishers should give additional attention to the quality control of potentially controversial papers, whilst also providing the opportunity for rapid, and preferably simultaneous, publication of responses by other researchers who may consider that their work has been unfairly criticized.

Figure 1 .
Figure 1.Visual summary of contrasting situations relating to (a) very close matching and (b) part-matching of pairs of studies where Study no. 2 is intended to provide a test of repeatability (and generalizability) of Study no. 1. Whilst "other studies" are also relevant to situation (a), their importance is increased when interpreting results from situation (b).See text for more detailed explanation and discussion, including the importance of experimental co-design between research groups with contradictory hypotheses.