Reviews and syntheses: The promise of big soil data, moving current practices towards future potential

In the age of big data, soil data are more available than ever, but -outside of a few large soil survey resourcesremain largely unusable for informing soil management and understanding Earth system processes outside of the original study. Data science has promised a fully reusable research pipeline where data from past studies are used to contextualize new findings and reanalyzed for global relevance. Yet synthesis projects encounter challenges at all steps of the data reuse pipeline, including 5 unavailable data, labor-intensive transcription of datasets, incomplete metadata, and a lack of communication between collaborators. Here, using insights from a diversity of soil, data and climate scientists, we summarize current practices in soil data synthesis across all stages of database creation: data discovery, input, harmonization, curation, and publication. We then sug1 https://doi.org/10.5194/bg-2021-323 Preprint. Discussion started: 15 December 2021 c © Author(s) 2021. CC BY 4.0 License.


Database
. Database pipeline with pain points (indicated by "-") and suggestions for improvement (indicated by "+") to conform more closely to FAIR Data Principles. Data sources can be diverse, including published (online repositories and scientific literature) and unpublished sources (direct from principal investigator; PI). After these sources have been discovered, the data must be accessed, harmonized according to a standard format or data model (internal to the project or a community-driven standard). The aggregated data must then be curated before ultimately being published for reuse. 4 https://doi.org/10.5194/bg-2021-323 Preprint. Discussion started: 15 December 2021 c Author(s) 2021. CC BY 4.0 License.

Availability
Reproducible analysis is fundamental to robust science and data analysis. Naively, a newcomer to the field of 21st century science might be forgiven the assumption that a published peer-reviewed journal article would, by default, also be accompanied by a published dataset in a machine readable format. In the authors' experiences, this is uncommon for a number of reasons.
While some peer-reviewed journals and funding organizations require the data to be deposited in a trusted repository that supports the FAIR principles (Fox et al., 2021), confirming that data meet these high standards is often overlooked during the 80 review process. Indeed, there is often confusion in the field as to what exactly such 'high standards' are. Key contextual data for one study may be mostly irrelevant for a second. Anticipating these contextual data needs is challenging and leads many data providers who would otherwise support data sharing to become frustrated with the existing guidance (Couture et al., 2018).
On the data aggregation side, many data aggregators are challenged by unclear data documentation and metadata. This leads to interactions between data providers and data aggregators that can vary from no contact (e.g. left the field due to career 85 changes, retirement, death, or are unwilling to interact) to high-contact (data providers collaborate with data aggregators to fill out a harmonized template). Intermediates along this gradient could include reaching out to data providers to confirm variable ranges, addressing possible errors, requesting specific unpublished measurements, or clarification of ambiguous descriptions.
Data providers have unique knowledge about their systems and can be instrumental in expanding or modifying the scope of the resulting database analysis. Data-centered collaborations can lead to new communities of practice and better science (see 90 Future recommendations section).
In addition to these benefits there are also tradeoffs. Acknowledgement and level of visibility of original data contributions remain an open question. Data providers may expect to be listed as co-authors upon reuse of their data, in recognition of their past effort collecting the data, despite having limited or no engagement in the reuse project. This is often frustrating to a data aggregator who, in turn, is left with an ever expanding list of coauthors with varying levels of involvement (see Publication 95 section). This can lead to conflict and a lack of trust in the community (Longo and Drazen, 2016).
In general however, we feel that direct collaboration between data providers and data aggregators is a critical relationship to nurture. As the community continues to converge on shared tenets of good data governance (FAIR - (Wilkinson et al., 2016) TRUST - (Lin et al., 2020) and CARE - (Carroll et al., 2020)) it is becoming increasingly clear that 'just put it on a repository' is only the beginning.

Data input and harmonization
Datasets typically reflect the purposes of the original study; however examining those same data in a broader context often requires a different data format. Harmonizing one data contribution with a broader collection entails merging or breaking apart data tables, renaming columns, and occasionally converting units of observation (see Section 2.3 below). Translation of the data to turn from one method to another as a database grows. Regardless of which method is used, one of the primary goals is to maintain data provenance that allows each data point to be traced to an original study or author.
Manual transcription is the most common method, and typically entails data taken from the original source and entered into a common template by either the data provider or aggregator (see Sections A1 and A2). Asking the data provider to fill 110 out these data templates is often identified as a major hurdle to contribution, yet data aggregators may be unfamiliar with the data provided and thus capture an incomplete or incorrect translation of the original data into the new format without the help of the data provider. Regardless of who fills out the template, human transcription of data is error prone. In some cases this is unavoidable when the data are not available in a machine-readable format. There are a number of software that allow for data extraction from figures (i.e. Web Plot Digitizer (Rohatgi, 2021), Data Thief (Tummers, 2006), and metaDigitise (Pick 115 et al., 2018)) and tables (pdftools (Jeroen Ooms, 2021), tabula (Aristarán et al., 2012(Aristarán et al., -2020), that can reduce human error in transcribing these machine-hostile formats. Despite its flaws, manual transcription is flexible and easy to set up, making it a frequent choice for data aggregation studies with a tight timeline.
An alternative approach to manual transcription is scripted transcription. Scripted template transcription involves writing a computer program, customized to the specific data being ingested, to reformat the data tables and column names to match a 120 target data standard or template (see Section A3). This approach requires familiarity with both soil science (to understand the measurements) and programming (to write the scripts). In practice, the authors find such a skill combination unusual for any individual researcher, necessitating the use of interdisciplinary teams and adding organizational complexity. The codebase can also become unwieldy if written on a case-by-case basis for each input dataset. These costs are countered by an increase in accuracy, transparency, and reproducibility when compared with manual transcription.

125
Keyed translation is the most general approach and, as a result, requires the most extensive informatics work. Keyed translation is related to scripted transcription but uses a dictionary to define relationships between the input data format and target data format; for example mapping Column 1 to Column A or Table A to Table B. Keyed translation combines metadata about each dataset with a generalized conversion script to generate a harmonized database (see A4). Such a generalized approach can be more easily extended to expand the number of data sources. However there is currently no broadly agreed upon annotation 130 vocabulary making it necessary to annotate each dataset individually within each project. In addition, the computational expertise needed for this approach is the highest of the three outlined here. While we feel this holds great promise for future studies, this is an uncommon approach due to these challenges.

Curation
Data that are in an integrated database often still need to be curated to ensure accuracy, convert units, address missing data or 135 gaps, and reduce or aggregate data to derive relevant data products. While scripts are often used extensively at this phase, expert interpretation and review is a critical component. Finally, reuse of databases often requires a repeat of this curation phasewhat may be appropriate for one question or purpose may not be appropriate for another.
Scripting is heavily utilized to augment expert review to quality control data. These scripts both automate and document quality control criteria, however setting those criteria often requires an extensive knowledge of the system and measurement methods. For example, the ISRaD database has an automated quality control protocol accessible via web interface which ensures that values are within reasonable ranges, and checks that records and critical metadata are appropriately linked across database tables (Lawrence et al., 2020). Following this initial filter, a manual individual 'expert review' is conducted by a trained ISRaD volunteer (see A2). These extensive quality control procedures require time and diverse expertise, making them unattractive for many open source database projects without broad recognition of service by the field.

145
Gap-filling expands data coverage, and can include a number of strategies to fill in missing data at both the layer or horizon level as well as the profile or site level. Strategies include linear interpolation, pedotransfer functions, georeferenced data extraction, or more sophisticated machine learning algorithms. Given the wide variability in gap-filling practices and objectives, these methods must be extensively documented, clearly state use case restrictions, and estimate uncertainties.
Finally, a common step is to remove unnecessary data from the data product via pruning. Pruning removes samples based 150 on location or type of measurement after database compilation, typically to cater to specific data user needs, and reduces the size of the data product. Both pruning and gap-filling highlights the importance of maintaining an intermediary harmonized database as well as the final data product both to preserve the original contextual data and reuse of that data for alternative projects.

155
Authorship issues are common in data aggregation projects due to unclear expectations and conflicting conventions across what are often large teams of collaborators. Indeed these problems are common in large collaborative projects (Cooke and Hilton, 2015). This can be mitigated with an expanded co-author list to include both data providers, aggregators, and reanalysis teams, but requires significant project management and organizational overhead. As always with larger team science, we highly recommend a formal authorship policy prior to beginning database compilation including what the role of each 160 contributor is and who are in lead manuscript positions, listed as co-authors, and listed in acknowledgements. While it can be tempting in a data aggregation project to fall back on what you are legally entitled to (i.e. if the data are there, use them), we strongly feel that collaborative projects build trust within the scientific community, leading to better data interpretation and seeding future collaborations.
Related to issues of credit and co-authorship described above are issues with data licenses. Similar to manuscripts, data are 165 often released under a specific legal license with requested reuse considerations, which may hinder the inclusion of otherwise 'public' data in a data synthesis. There is a tension in choosing between an adequately restrictive license, which can help ensure that a specific project and data providers are given credit, and a permissive license, which can increase data reuse. The Creative Commons provides a framework to examine these considerations but there are many other standard and custom licenses. The most permissive is CC-0 or public domain license that puts no restriction on data use. A 'By-Acknowledgment' (BY) rider 170 requests that the original data source be acknowledged in the derivative product in some way (sometimes this acknowledgement is specified, sometimes not). A 'Non-Commercial Use' clause restricts the sale of the data for commercial purposes. Finally a 'Share-Alike' or copy-left clause says that the data may be reused if it is released under the same license. A 'CC-BY' license is probably the closest to the traditional academic practice of research citation and many scientific repositories including the Environmental Data Initiative and Pangaea encourage data providers to select this option.

175
In all cases database creation does not have to be a single push, but is ideally part of an ongoing synthesis effort, leading to the need for database versioning. The COSORE database is an example for such an approach (Bond-Lamberty et al., 2020).
After each major change (release) the database receives a new doi and is permanently archived on a repository. This allows maximal transparency, allowing data users to reproduce an analysis from a given version and making it easy to find the newest version of the database. The hope of big data is to have any data collected at any time anywhere in the world at the tip of your fingers (see Section B). For soil science the potential for long-term (multi-decadal) understanding is particularly exciting. Long-temporal coverage of soil data could lead to a better understanding of soil carbon sequestration potential to mitigate climate change, or better management of soils for crops. How do we attain these above futures, where data reuse is equally as valued as data production? 185 We recommend implementing a core set of measurements and processes to facilitate soil data reuse. The recommendations in section 3.1 are aimed at researchers collecting soil data who wish to ensure the long-term value and reusability of their datasets. These recommendations are also relevant for journals and peer reviewers of soil science research as a short checklist of key details that should be reported or addressed. Section 3.2 outlines recommendations for researchers who wish to participate in the data harmonization process. These recommendations encompass both technical and social considerations for data 190 harmonization efforts and focus on what can be done right now to further soil data exchange.

What to measure and report?
Soils are inherently rooted in time and space, making high resolution spatial and temporal information (including sampling date, latitude, longitude, and geographic datum, and depth of sample) critical for building context and data reuse. Data providers will often ask 'what should I measure' to be relevant to data aggregation efforts, and there are efforts to provide such guid-195 ance (Billings et al., 2021). We have chosen instead to focus on critical temporal-location information to allow data to be expanded, contextualized, and annotated. The issue is not that researchers do not know how to record this information, but rather conflicting objectives may prevent its recording.
Geospatial metadata may present a privacy concern, for example when the soil measurements are tied to the economic valuation of the land as in agricultural systems. For data collected on privately owned land, such as on-farm research and 200 observations, researchers may not be at liberty to release detailed location information publicly in order to protect landowner privacy (Richardson et al., 2015). There are efforts to bridge data sharing and data privacy. For example, the platform under development by the International Agroinformatics Alliance will integrate secure data storage, granular data permissions, and options to register privately hosted data to facilitate data discovery and sharing while protecting privacy (Gustafson et al., 2017). Clearly this is an ongoing discussion that will require more research and conversations with stakeholders. The advantages of high precision geolocations are significant; and regardless of the precision, the level of uncertainty in the provided geolocation is critical and often missing in archived datasets. Location information enables soil data to be joined with the growing number of gridded global datasets that can provide key contextual information for interpretation and modeling.
While there are privacy concerns in some locations, not reporting the location of a sample collection should be the exception and not the rule, especially in publicly funded research data.

210
In addition to location, sample depth is also critical due to the variation of soil properties and processes with depth. Unfortunately over 60% of studies fail to report measurements of sample depth (i.e. layer defining upper and lower bounds) associated with soil data (Yost and Hartemink, 2020). This is particularly critical to advancing our understanding of deeper soil properties and functions, but also relevant for the effects of surface tillage and grazing on managed lands. Soils are temporally dynamic and time of collection can provide key insights into decadal level changes in soils. Soils change 215 over time owing to pedogenesis, historical land use and increasingly, global climate change (Tugel et al., 2005;Richter et al., 2011;Ellis, 2011;Harden et al., 2018). Recording the time of collection for modern datasets can produce valuable returns in future reanalysis. Older datasets, consisting of historic measurements and archived samples, are increasingly valuable to track soil responses to global change. Such datasets can provide a window into the dynamics of how soil properties change, and should be a high priority for data rescue and documentation.

220
Decadal scale soil records provide valuable information for the study of global change and land management, and therefore sites associated with older observations should be prioritized for re-survey efforts (Hawkins et al., 2013). Decadal scale datasets have their own challenges. These data are often associated with a particular researcher or group that are typically released at the retirement of the primary investigator, and likely have been reformatted multiple times across several generations of storage systems and lab staff and may even be on analog storage and not digitized. Thus, these re-survey efforts can be combined with 225 data rescue to be an incredibly valuable data product involving digitization of paper copies, or conducting structured interviews of personnel to enrich metadata of prior observations (Karasti et al., 2006). In the opinion of the authors, data rescue efforts are an underutilized resource in the field.
By adopting these recommendations to record geolocation, depth of sample, and collection date we can greatly increase the value of soil data, extending the measurement reusability for future analysis.

How to harmonize?
We touched on several common approaches to data harmonization in this paper. Often driven by a single research question or objective, data harmonization has historically been a laborious process carried out by a single or small group of researchers for a specific project. Based on our experiences in various harmonization projects, we propose a more community-centered approach moving forward, founded on the principles of open and transparent science. Outputs from these groups should include semantic 235 tools like ontologies and shared vocabulary lists with clear and transparent governance, as well as a new community-centered approach to the practice of data harmonization and the resulting databases.

Community tools
There is an understandable tendency by many scientists towards standards. If all soil data adhered to a common template, or data model, with uniform tables and column names, then it would be trivial to append the data from one study with the data 240 from a second. Unfortunately, due to the diversity of soil types and methodologies, as well as ever evolving measurement technologies, we feel that this is impractical for soil research data, although several valiant efforts are underway to do this (Nave et al., 2016;Lawrence et al., 2020). In practice, researchers will continue to develop their own data tables and internal conventions that make sense for their experimental structure, location, and measurement type. However, semantic tools and standards are still useful (Onerhime, 2021).

245
Annotating datasets with a common vocabulary forms the theoretical backbone of all data harmonization work. Whether this is a manual copy-paste from a source data table to a common data template, or creating a thesaurus that cross-references given data columns to some internal standard name, both processes rely on a vocabulary. Classical soil glossaries and lab manuals have been printed as dictionary style references that are difficult to transform into digital resources for both copyright reasons and technical ones. This vocabulary could be a valuable community resource but would require ongoing engagement with the 250 research community to remain accessible, relevant, and up to date. Further extending this vocabulary into an ontology, that captures the relationship between terms in addition to their definitions, could drive the next generation of data-driven machine learning. Community developed ontologies and vocabulary lists like ENVO (Buttigieg et al., 2016), CSDMS (CSDMS, 2019), GLOSIS (Palma et al., 2020), and CF (Hassell et al., 2017) could provide reusable resources that are currently missing and underutilized in the soil community. The soil community as a whole needs to engage with these broader resources to ensure 255 the informatics reflects new developments in the understanding of soil science and measurements being made.

Community practice
Before data can talk, communities need to talk. Based on the experience of the authors, developing, adopting, and maintaining semantic resources is beyond the scope of any one lab or organization and requires community. In the development phase, a diverse community can ensure that the broadest possible needs are being addressed. Adoption is more likely if the resource 260 addresses the needs of the community and that community has ownership over the resource. Finally, maintaining semantic resources requires ongoing updates and revisions as methods shift. All of this requires a new type of community, one centered on data and tools to support the interoperability of that data.
Successful data centered communities are open, transparent, diverse, and rewarding (Cooke and Hilton, 2015). They are open, in the sense that anyone can join or contribute and are empowered through educational activities to participate. Trans-265 parency ensures it is easy to contribute and to understand decision making processes. Diverse communities can draw on a wide range of skill sets, from experience in soil processes to knowledge representation. They are also rewarding, furthering members' careers through creation of tangible products (for example citations or grant dollars) and opportunities for scientific leadership and service. While there are several approaches to achieve this, one possible workflow for the establishment of a new harmonized database might look something like the following: One of the first challenges with an interdisciplinary team is establishing agreement on goals and methods. This requires developing a shared understanding and vocabulary (e.g. through educational activities on computational tools or soils surveys and measurements). In an academic community, shared purpose is most easily motivated by a synthesis paper or research question. Data may provide a clear shared motivation but it's uses and governance processes need to be clearly identified and 285 revisited regularly.
Sustainable creation and curation of the harmonized database is essential to create relevant data products to serve a database's specific purpose and enable future reuse to address a variety of questions. Accessing, annotating and merging the datasets is a well-established technical process once the community tools and community of practice are in place. Curation of databases could be patterned after the manuscript review process, where domain researchers review proposed database additions to ensure 290 the accuracy of new contributions. This review process should keep the diverse needs and practices of the soil community in mind, including soil surveyors, field/lab experimentalists, and land managers. In the end, the synthesis of existing data is not the goal: it is their application to scientific problems. In that regard, successful product development from a database can encourage growth and adoption of the data resource by others.

295
Soils are the foundation for our food and fiber system as well as a significant component of the global carbon cycle. As such, information and measurements of the soil system, from hydrological conductivity to soil carbon stocks to changes in nutrient content, are a key public good for a varied group of users. However, this valuable scientific resource is currently under-utilized due to many of the issues outlined above. We suggest data use and reuse could be facilitated by addressing issues along the database construction pipeline. 300 We outlined database creation as a common set of steps: generation, discovery, access, curation, and publication. While this pipeline can look different depending on the skill sets, timeline, and funding structure of the researchers involved, we summarized common pain points throughout this process, which can reduce the accuracy and usability of a database. Data collection, synthesis, and use are inherently human endeavors, and as such, breaks in this pipeline are often driven by lack of community awareness and practices. 305 We put forth recommendations ranging from measurement prioritization to data harmonization decisions that can help move forward community practices around soil data. We recommend that contextual information like geolocation, depth of sample, observation time, and management history all be reported with soil measurements. Soil data harmonization requires the development of new semantic tools like vocabulary lists and ontologies that are co-produced by data and soil scientists. Building the capacity to create and maintain these tools require communities of practice including: open application periods to recruit 310 diverse participants, established goals, and clear outcomes. The creation of such communities are not an easy task, but are a needed one.
Ultimately, soil data are an invaluable resource generated and used by a diversity of groups. Given this value, we hope that the work of advancing soil information systems will increasingly be recognized and rewarded as a critical component of the research process. To achieve this, we need not only new tools and practices, but shifts in the broader incentive structure for 315 conducting this kind of work. Our review provides a path forward to enhance community practice around soil data so that we can begin to tackle the vast array of research and management problems, and their solutions, that lay beneath our feet.

Appendix A: Current soil projects
Below are a series of snapshots compiled to represent the range of approaches groups currently take to aggregating soil databases. These four snapshots include a manually compiled database of field warming experiments (Section A1: Crowther 320 et al. (2016)), a database using manual transcription combined with scripted curating of soil radiocarbon measurements (Section A2 - Lawrence et al. (2020)), a manual-scripted combination of coastal soils (Section A3 -Holmquist (2021)), and a keyed translation database of long-term observation (Section A4 - Wieder et al. (2021Wieder et al. ( , 2020). This is appendix not meant to be exhaustive. For a living list of researcher-driven soil databases please see Todd-Brown (2021).

A1 Field-warmed soils 325
The template-driven approach to data harmonization is exemplified by (Crowther et al., 2016). In this study, individual researchers who collected data of interest (in this case soil field-warming manipulations) were contacted directly and invited to collaborate in a meta-analysis. A post-doc was tasked with creating a data template, and working with those collaborators to capture a representation of their study. These data were then appended into an integrated set of data tables and analyzed. By working with researchers directly, this approach captured both published and unpublished data and ensured a nuanced inter-330 pretation of the study results. This careful one-on-one approach combined with co-authorship on a high profile journal ensured that researchers were comfortable sharing data that they might have otherwise withheld from a joint publication.
One challenge with this approach is patchy secondary data. Secondary data like climate and soil physical chemical characteristics may not be critical to a small study at a single site but become fundamental to a larger cross-site analysis. Crowther et al. (2016) addressed this by extracting site-level environmental covariates from gridded geospatial files generated from global modeled predictions covering an array of climate (e.g. WorldClim (Fick and Hijmans, 2017)) and soil physiochemical characteristics (e.g. SoilGrids (Hengl et al., 2017)). Although these global predictions can be characterized by considerable uncertainty -especially at the local scale -these global products at least ensure a full set of standardized meta-data from every single location.

340
The International Soil Radiocarbon Database (ISRaD) consists of an open-source database of soil and soil-related radiocarbon data. Additionally, ISRaD provides a continually-developing library of tools for data access and manipulation based in R (Lawrence et al., 2020). Radiocarbon data production is an expensive and time-consuming process but provides unique information on longer-timescale in situ processes. In addition, a global pulse in atmospheric radiocarbon content in the 1960's provides unique analytical power for data collected in the intervening decades to constrain models.

345
Initial funding was provided by the USGS Powell Center and the USDA NIFA FACT program. Currently, the Max Planck Institute for Biogeochemistry provides both ongoing funding and staffing. Since 2015, over 300 studies have been compiled, though the data collection process is active and ongoing, and user-submitted data are also welcomed.
The project utilizes a standardized Excel template for data ingestion which each contributor fills out and submits to a designated ISRaD coordinator. The core unit of a data entry is the "profile", which is a unique spatial AND temporal identifier.

350
All data must be matched to a profile, which is in turn matched to a "site", and the uppermost level of "entry", which identifies the publication from which the data originates. Such hierarchy is best preserved through vetting with both an automated and human-led QA/QC process. Therefore, prior to ingestion, data undergo both automated quality check and expert review for metadata consistency and data quality. All database and data handling tools are built in the open-access R computational language, and an official ISRaD code library is available through the R library repository, CRAN. All code and data are 355 available in an open Git repository (Beem-Miller et al., 2021). New functions and explanatory vignettes can be submitted by users for inclusion in the R package. The project website contains information, links, guides, and updates on the project (ISRaD, 2018(ISRaD, -2021.

A3 Coastal Carbon Research Coordination Network
The Coastal Carbon Research Coordination Network (CCRCN) was formed to accelerate the pace of discovery in coastal 360 wetland carbon science by providing the community with access to data, analysis tools, and synthesis opportunities. Funded by a National Science Foundation Research Coordination Network, the project's primary staff includes a funded research scientist as well as several part-time data technicians. Besides organizing topical working groups and communities events, one of the primary engagements of the CCRCN is the development and maintenance of a database of carbon stock and sequestration in coastal marshes, mangroves, swamps, scrub/shrub, and seagrass.

365
Both the database and its software is hosted on GitHub , and its structure and naming conventions (Holmquist, 2018)  datasets into the database is via scripted transcription, by which curation to CCRCN standards is performed in a unique "hook" script tailored for each dataset. A suite of helper tools aids unit conversion, quality control, and spatiotemporal processing specific to soil carbon data. Datasets are joined together to construct the multi-level database, partitioned primarily by scale of 370 observation (depth series, core/plot, site, and methods levels). An automatically-generated bibliography tracks primary citations of data contributors alongside the secondary citation of the database itself. Post synthesis QA/QC script identifies possible duplicate plot-level entries between datasets. Internally-facing visuals and reports are generated via Markdown implementation to track database growth as well as geographic/biophysical gaps in the database. Finally, the online version of the database feeds into the backend of its primary public interface, the Coastal Carbon Atlas . This R Shiny App allows anyone 375 to explore global representation, as well as query desired according to a variety of environmental/methodological parameters, then download the data and (importantly) the corresponding citations.
Parallel to synthesizing the database has been a concerted effort to generate data releases, each with its own DOI, as a service to data submitters as well as the coastal carbon community. This has included dedicated staff time towards outreach, formatting of datasets, generating metadata (based upon the Ecological Metadata Language standards) and assigning DOIs to datasets, 380 which has so far resulted in the public release of 22 datasets so far on the Smithsonian Figshare repository.

A4 SoDaH
The Soils Data Harmonization (SoDaH) and Synthesis project features a tool suite for harmonizing soil organic matter data from disparate sources into a common data model, and a database of harmonized soil organic matter data and related variables that, as of this writing, includes data from over seventy unique studies (Wieder et al., 2020(Wieder et al., , 2021. The product of a Long-385 Term Ecological Research (LTER) synthesis working group (LTER Soil Orgnaic matter Working Group, a), the project brought together soil scientists with diverse backgrounds and affiliations with scientific research networks to refine and evaluate theories of soil organic matter dynamics, and to produce a soil organic matter dataset that spans a wide range of environmental and experimental conditions. SoDaH employs a keyed-translation approach using metadata about the data combined with conversion scripts to translate 390 contributed data tables into the common data model. Metadata are organized at multiple levels, including around the study (e.g., study location, data provider) and data variables, which are subdivided into profile, layer, and fraction categories. Additional metadata fields facilitate the identification of experiment details and study design, allowing users to, for example, query data associated only with specific manipulations or control conditions. The harmonization script (LTER Soil Orgnaic matter Working Group, b) maps user-provided metadata and data resulting in new flat file(s) in which the variable names and units, if 395 relevant, are standardized in the output along with appropriate quality control. All output conforms to the specifications of the SoDaH data model thereby enabling the aggregation of output data from disparate studies into a single data file.

B1 A new data-savvy world
In the late 21st century, soil data are is effortlessly collected and collated. There are still divisions between groups on what exactly the correct way to fill in certain missing data is, but in general most instruments come with their own connection interfaces that ensure interoperability between sensor data. It took the field scientists a little longer to get behind standardized digital records but most researchers now collaborate with their data libraries and archivists to adhere to established data standards.

405
Occasionally a researcher will design a completely novel method or experimental treatment that will require additions or modifications to soil semantic tools. These cases are highly sought after by informaticians and extending an existing international standard has been known to launch the careers of young researchers. More often, researchers will review new data archives to vet their annotations and report this as they do in manuscript reviews. Researchers of course complain about this additional workload, but recognize that if they are going to generate data, then there is an obligation to review their colleagues' 410 data. Once or twice in their career a researcher will serve as a domain expert on a relevant ontology board, providing updates and revisions to these key international resources.
Digital data archives are now entirely annotated and the idea of putting data online without reference to an international ontology is considered irresponsible. Data are annotated with one or more ontologies from different sub-domains and there are a range of AI/ML tools that can leverage these annotations to create an integrated database. Model development and meta-415 analysis studies now spend the bulk of their time honing hypotheses instead of cleaning data.
The great data rescue projects of decades past are sadly, mostly done now. An entire generation of researchers cut their teeth combing through old paper archives and fighting with optical recognition characters. New researchers lament that this highly fruitful line of 'new' research data is now mostly spent. Designing a new sensor processing pipeline just isn't as romantic as speculating on the nature of that old coffee stained field journal.

420
Contrary to popular belief, the 'traditional' skills of soil observation (hand texturing soils, matching horizon colors) are more in demand than ever. The ongoing climate crisis has now defined several generations of researchers and reignited interest in soils beyond an agricultural context. Knowing your soil and how you impact that soil is as important as water quality, soil science is a fundamental curriculum element for not only foresters and conservation majors but also urban planners and backyard gardeners. New passionate generations of young students have grown up on soil judging competitions and soil reports 425 are common for any land or home purchase.

B2 The era of big data rescue
In the middle of the 21st century, we are in the heyday of the Big Data Rescue of the 21st century. Driven by the need to understand the impact of legacy management and new data scrubbing technologies, researchers have dived deep into the filing cabinets and paper archives of the past century. In addition to the traditional literature review, new graduate students now do 430 targeted data rescue chapters as part of their dissertation. These data rescue projects have also drawn new researchers from the library, data, and other sciences into soils and is also used as a common undergraduate research project.
Big corporations and governments have taken on the task of parcelling out data to the general public, stripping out sensitive information reduces the utility of the data but is seen as a necessary evil. Ethicists are still debating whether this gives the corporations/governments too much knowledge, and double-blind methods are being developed to obfuscate sensitive data 435 even from the data holders.
Ontologies and other semantic resources are increasingly being adapted and extended by domain scientists. Unfortunately, there are several competing standards reflecting national, domain, and general political divisions in the research community.
However there are several clearly identified mature semantic resources that most disciplines agree are pretty good. Data management plans from funders now require identifying semantic resources in addition to the final data archive.

B3 A post-pandemic, better-connected world
Over the next few years, the soils community has fully recognized that our we have a data problem, that is actually a community problem. The collecting and publishing science in isolated labs has become increasingly frustrating to new researchers used to instantaneous web results. The COVID-19 Pandemic forced a rapid shift in how science is done, moving what might have been a few day workshop into a longer slow-burn virtual collaboration over months. This led to a new kind of decentralized 445 project management where most projects are now interconnected to similar researchers in regular virtual seminars and working groups.
This increase in researcher interactions has led to an increase in data interactions. As research interact more online, there was a correlating increase in comparing data from their study with their colleagues results. This led to an informal common vocabulary and data methodologies that increasingly showed up on newly archived data. Some graduate students are starting 450 to dive into data rescue operations and further expanding these vocabularies to include older methodologies. Combining automated optical character recognition of scanned documents with manual corrections, these older data are providing valuable insights into climate change.