the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Results from a multi-laboratory ocean metaproteomic intercomparison: effects of LC-MS acquisition and data analysis procedures
Jaclyn K. Saunders
Matthew R. McIlvin
Erin M. Bertrand
John A. Breier
Margaret Mars Brisbin
Sophie M. Colston
Jaimee R. Compton
Tim J. Griffin
W. Judson Hervey
Robert L. Hettich
Pratik D. Jagtap
Michael Janech
Rod Johnson
Rick Keil
Hugo Kleikamp
Dagmar Leary
Lennart Martens
J. Scott P. McCain
Eli Moore
Subina Mehta
Dawn M. Moran
Jaqui Neibauer
Benjamin A. Neely
Michael V. Jakuba
Jim Johnson
Megan Duffy
Gerhard J. Herndl
Richard Giannone
Ryan Mueller
Brook L. Nunn
Martin Pabst
Samantha Peters
Andrew Rajczewski
Elden Rowland
Brian Searle
Tim Van Den Bossche
Gary J. Vora
Jacob R. Waldbauer
Haiyan Zheng
Zihao Zhao
Download
- Final revised paper (published on 08 Nov 2024)
- Supplement to the final revised paper
- Preprint (discussion started on 16 Jan 2024)
- Supplement to the preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-3148', Anonymous Referee #1, 26 Feb 2024
The authors describe the results of an intercomparison study of ocean discovery metaproteomics data, in which 9 labs participated (representing a very large fraction of the current ocean proteomics scientific community). This is important and timely work for the ocean omics community, which is embarking on multiple, large, coordinated efforts to collect data about ocean metabolism. The authors detail that, despite challenges in comparing data outputs from multiple labs and despite labs differing in almost all steps of proteomic analysis, the big picture microbial community composition and functions are largely consistent across lab groups. There is also evidence that a collaborative metaproteomics approach (i.e. data collected from multiple labs) can assist in the discovery of rarer proteins. This is encouraging news.
The intercomparison is focused on 1D DDA, discovery metaproteomics methods implemented on Orbitrap instruments. This is the most common and accessible proteomics approach available to ocean researchers (in part due to the utility of Orbitrap instruments for other ocean disciplines such as geochemistry, meaning they tend to be accessible in ocean science labs and departments). While developments in 2D fractionation, DIA methods, targeted methods, and non-Orbitrap mass analyzers are exciting and will no-doubt influence the future development of ocean metaproteomics, the authors approach of focusing with current popular methods makes sense at this time.
I understand that this kind of study is challenging, the field is young, and that the authors do not wish to be prescriptive. However, one clear message was the difficulty of comparing metaproteomics data across labs, and this is striking even though the labs were provided with the same samples and sequence database. This necessitated a re-analysis by the main arbiters of the intercomparison. I wonder if the authors could include some more details about what could have helped them in that comparison. For instance, would it have been beneficial to have multiple results from different types of searches (e.g. with and without protein groupings, with and without razor peptides?) If so, this would mean that with relatively little effort in re-analyzing their own data multiple ways, metaproteomics researchers could make their data more useful and re-usable.
The rest of my comments focus on methodological details and opportunities for further discussion.
General comments
I’m finding the some figures to be fuzzy and in most cases the figure fonts to be quite small. This could be because of formatting in the preprint but I suggest to check this in future versions.
I’m curious to know how similar the relative abundances of the major organisms (cyanobacteria, alphaproteobacterial, and gammaproteobacterial) are in the metaproteomic data versus the matched metagenomes. Metagenome read recruiting is commonly used to assess microbial community composition, but there is evidence that metaproteomics can also provide good or better information.
It would be useful to point out the special significance of the depths that were sampled and whether there were expected differences in microbial community composition among them (i.e. were the 80 and 120m samples the DCM?)
The informatics intercomparison became a large portion of this work, and the authors note that the settings and nuances of the different informatics pipelines play a huge role in the metaproteomics results. I would therefore like to see the informatics methods for the labs come into the main text somehow (methods or a descriptive overview of the different informatics methods used), instead of only being in the supplement, even though they are long.
Specific major comments
Line 240: I’m hesitant on a blanket parent/fragment mass tolerance being applied the same way to all data, despite differences in mass analyzers and resolution for MS1 and MS2 used across the labs. Can the authors at least comment on this caveat?
Line 241/242: Percolator. What were the filtering settings used? Is it possible that there was a double FDR filtration (one in percolator and one in scaffold, which could result in fewer true positives being identified?)
Line 245: The protein inference and quantitation was conducted within Scaffold? Does this mean minimum evidence for a protein is 1 peptide? Was there any normalization applied?
Line 405: There is often discussion about whether perfectly matched metaG or metaT sequences are required for highest quality metaP analyses. Can the authors comment on why they believe that the same metagenome from the initial intercomparison study is still appropriate for the new samples taken a year later?
Line 556: I think it’s important to point out why ocean metaP will need to adopt metadata standards that are different than the ones that are already developed/being developed for other microbiome fields (i.e. the need to include geographical information, collection data such as filter sizes).
Data availability: Can the authors provide the annotation files for the metagenome fasta file? As currently provided in the pride submission, it would be impossible to re-analyze for function/taxonomy in the same way as the authors, and I could see that being of interest in the future i.e. for comparing annotation pipelines. This could be a flat file in the supplement or included in the genbank project.
Specific minor comments
Line 74 – what is the measure of complexity being referenced? (dynamic range, proteins identified?)
Line 86 – Authors may find it useful to point out the particular usefulness of proteomics when applied to field/environmental samples, for revealing aspects of in situ biology that might not be resolved when organisms are isolated in the lab (e.g. Kleiner et al., 2019)
Line 123 – somewhere before or in this paragraph, it would be useful to further introduce that this study is focused on “discovery” or “global” or “shotgun” or “bottom up” metaproteomics, where the organisms and functions to be identified are not known in detail nor selected ahead of time
Line 173: Did lab 438 do the analysis or were the procedures of lab 438 performed in another lab?
Line 280 please add short detail about sample: e.g. North Atlantic Ocean 80m
Line 290 Did the participants know anything at all about the samples, e.g. that they were from the oligotrophic open ocean?
Line 308: I’m interested in the proportion of shared peptides/proteins compared to the total number of unique peptides/proteins identified across the participants. Are identifications overlapping by 20%? 50%? In terms of the number of peptides/proteins ID’d when participants used their own pipelines, what was the range seen across laboratories?
Line 314: specific peptide and protein 1% FDRLine 336: Does this imply that the number of peptides/proteins ID’d is a good measurement for overall data quality (especially if one wants to compare to other datasets)?
Line 352: Related to above comment, among the deeper analyses, was the proportion of ID’s that overlapped greater than the proportion in the 1D samples?
Line 357: The sentence starting with “This indicated…” was difficult for me to understand. I suggest breaking into two sentences to clarify that if there was a fall off, it would be due to peptides being mapped to already discovered proteins.
Line 376: Is this quantitative consistency or the result of normalization?
Line 387: PstS from which organism?
Line 403: relative quantitative abundances
Citation: https://doi.org/10.5194/egusphere-2023-3148-RC1 -
AC1: 'Reply on RC1', Mak Saito, 09 Apr 2024
We thank both reviewers for their highly constructive and insightful comments on the manuscript, as well as their overall favorable support of the intercomparison effort. Below the reviewers’ comments are in quotes and our responses provided. Based on our reading of these constructive comments, we plan to complete all of the suggested revisions. We plan to prepare a manuscript that incorporates all of the suggested changes if the Editors approves this comment and invites revision.
Reviewer #1
R1 commented “However, one clear message was the difficulty of comparing metaproteomics data across labs, and this is striking even though the labs were provided with the same samples and sequence database. This necessitated a re-analysis by the main arbiters of the intercomparison. I wonder if the authors could include some more details about what could have helped them in that comparison. For instance, would it have been beneficial to have multiple results from different types of searches (e.g. with and without protein groupings, with and without razor peptides?) If so, this would mean that with relatively little effort in re-analyzing their own data multiple ways, metaproteomics researchers could make their data more useful and re-usable.” We will include further discussion about the challenges of data output from different software packages.
General comments:
“figures to be fuzzy and in most cases the figure fonts to be quite small” We will address this issue.
“I’m curious to know how similar the relative abundances of the major organisms (cyanobacteria, alphaproteobacterial, and gammaproteobacterial) are in the metaproteomic data versus the matched metagenomes.” While the community in the Northwest Atlantic Ocean is relatively well characterized in this size fraction, we will look into conducting a direct comparison from our metagenomic sample as well.
“It would be useful to point out the special significance of the depths that were sampled and whether there were expected differences in microbial community composition among them (i.e. were the 80 and 120m samples the DCM?).” We will add this environmental context.
“I would therefore like to see the informatics methods for the labs come into the main text somehow (methods or a descriptive overview of the different informatics methods used), instead of only being in the supplement, even though they are long.” We will aim to accommodate this request. To our knowledge BG does not have a page limit but does have page charges.
Specific major comments
“I’m hesitant on a blanket parent/fragment mass tolerance being applied the same way to all data, despite differences in mass analyzers and resolution for MS1 and MS2 used across the labs. Can the authors at least comment on this caveat?” This is a good point and we will check that this application of mass tolerance is consistent with the instrument platforms being employed and add a comment about this.
“Percolator. What were the filtering settings used? Is it possible that there was a double FDR filtration (one in percolator and one in scaffold, which could result in fewer true positives being identified?)” We will investigate this possibility and comment on it.
“Line 245: The protein inference and quantitation was conducted within Scaffold? Does this mean minimum evidence for a protein is 1 peptide? Was there any normalization applied?” As commented in this section we used 1 peptide per protein as the minimum threshold. We will provide information about the normalization.
“Line 405: There is often discussion about whether perfectly matched metaG or metaT sequences are required for highest quality metaP analyses. Can the authors comment on why they believe that the same metagenome from the initial intercomparison study is still appropriate for the new samples taken a year later?” This is a good point for clarification, and we will add some discussion on the topic.
“Line 556: I think it’s important to point out why ocean metaP will need to adopt metadata standards that are different than the ones that are already developed/being developed for other microbiome fields (i.e. the need to include geographical information, collection data such as filter sizes).” We will add this comment. There is prior manuscript on this specific topic: doi.org/10.1021/acs.jproteome.8b00761.
“Data availability: Can the authors provide the annotation files for the metagenome fasta file? As currently provided in the pride submission, it would be impossible to re-analyze for function/taxonomy in the same way as the authors, and I could see that being of interest in the future i.e. for comparing annotation pipelines. This could be a flat file in the supplement or included in the genbank project.” Thank you for this request, we will add this file to the repositories.
All specific minor comments will also be addressed. In particular, this point is a useful one, and was our intent with the breakdown to peptide level analysis:
“Line 308: I’m interested in the proportion of shared peptides/proteins compared to the total number of unique peptides/proteins identified across the participants. Are identifications overlapping by 20%? 50%? In terms of the number of peptides/proteins ID’d when participants used their own pipelines, what was the range seen across laboratories?” We will aim to follow up with these additional analyses.
Citation: https://doi.org/10.5194/egusphere-2023-3148-AC1
-
AC1: 'Reply on RC1', Mak Saito, 09 Apr 2024
-
RC2: 'Comment on egusphere-2023-3148', Anonymous Referee #2, 23 Mar 2024
Saito et al. present a comparison of metaproteomics sample preparation, measurement and analysis for ocean samples across several laboratories. The comparability of metaproteomics data across laboratories, especially of highly complex communities, is important to discuss, given also the increasing research efforts in this direction. As detailed in the manuscript, overall comparability is, despite a wide variety in applied methods, quite high between laboratories. At the same time, the manuscript also outlines room for further development.
General comments:
A clear rationale is given for focusing on 1D DDA, and for setting some boundaries for the analysis. The manuscript gives detailed insights into the comparisons. At the same time, it becomes at times unclear whether a reference is made to the informatics or wetlab comparison part of the study. Additionally, the manuscript would benefit from some shortening and focusing, especially in the discussion. For example, references to DIA are made at several places in the discussion, which could likely be condensed. On the other hand, some insights into reasons for differences would be interesting: Are there, e.g., indications that certain setups promote higher metaproteome coverage? Where there some re-runs by the same laboratory, which might be used to assess reproducibility within the same sample (given that it is highly complex) and laboratory?
In addition, I second the comments of Reviewer 1 regarding the re-usability of data generated by different laboratories - which parameters could or should be fixed? These insights would also be immensely helpful in the strive for agreeing upon general (meta)proteomics standards. I would also be curious to see a brief comment of how one common search strategy might have impacted the results coming from different MS instruments with different settings. Also along the lines of comments of Reviewer 1, please increase the size of figures, figure captions and labels where appropriate.
Specific comments:
Abstract
Line 43: I’d argue that metaproteomics has not only the potential for contribution to ocean ecology, but already did so - and would like to see some references for that, to make clear from the beginning why this manuscript is an important contribution to the field.
Line 52: While R2=0.83 does indeed indicate good reproducibility, a value of 0.43 does less so - maybe give a (very brief) rationale for this discrepancy.
The term “informatic” appears a bit ambiguous at times - maybe consider replacing this by “bioinformatic” or “computational”.
Line 58: To what does reproducibility refer here?
IntroductionLines 66-67: please give references.
Lines 73 ff: There is a jump in content: first, metaproteomics is introduced, and then proteomics (not metaproteomics) is detailed again.
Lines 83-86: Maybe check whether a focus can be set specifically for marine environments
Line 88: “measurement of microbial proteins” - this could refer to many different methods, but probably refers to metaproteomics? Please formulate more specifically.
Line 92: please consider replacing “the metaproteomics datatzpe”, e.g., by “metaproteomics data”
Methods
Line 245: There seems to be a word missing after “protein”
Results
Line 273, 277: please replace “activities” and “aims” with components (or another consistently used term) to aid in understanding. Potentially, you could add separate names or labels for the two study part.
Lines 314-315: please remove the surplus “see”
Line 340: Do you mean “shared” instead of “showed”?
Lines 342-346: This sentence is somewhat hard to read, maybe split.
Line 364: Do you mean “consensus” in place of “coherence”?
Lines 367-368: Isn’t a unity line always observed when comparing a dataset to itself?
Line 374: Do you mean “consensus” in place of “coherence”?
Line 383: How abundant are these organisms? Please give some estimations.
Line 387: do you mean proteins or protein functional groups, as this refers to a KEGG analysis?
Line 388: What does “functional” refer to here?
Lines 394 ff: Please re-phrase/simplify, e.g., Variability at the protein level was lower than at the peptide leveLINE
Lines 404-405: Please briefly elaborate on the use of the same database, given that the North Atlantic is a highly variable and shifting environment
Lines 410 ff: Maybe the general pipelines could be briefly presented in the main text.
Discussion
Lines 424-426: Please shorten and focus this sentence.
Lines 434-437: Please shorten.
Line 439: Did you use a specific cutoff to determine abundant proteins?
Lines 440 ff. Please formulate more clearly, e.g., “Probable reasons for this discrepancy are: …”
Lines 478-485: Please shorten.
Citation: https://doi.org/10.5194/egusphere-2023-3148-RC2 -
AC2: 'Reply on RC2', Mak Saito, 09 Apr 2024
We thank both reviewers for their highly constructive and insightful comments on the manuscript, as well as their overall favorable support of the intercomparison effort. Below the reviewers’ comments are in quotes and our responses provided. Based on our reading of these constructive comments, we plan to complete all of the suggested revisions. We plan to prepare a manuscript that incorporates all of the suggested changes if the Editors approves this comment and invites revision.
Reviewer #2
R2 commented: “At the same time, it becomes at times unclear whether a reference is made to the informatics or wetlab comparison part of the study. Additionally, the manuscript would benefit from some shortening and focusing, especially in the discussion. For example, references to DIA are made at several places in the discussion, which could likely be condensed.” We will make these edits.
“On the other hand, some insights into reasons for differences would be interesting: Are there, e.g., indications that certain setups promote higher metaproteome coverage? Where there some re-runs by the same laboratory, which might be used to assess reproducibility within the same sample (given that it is highly complex) and laboratory?” We will add this commentary.
“In addition, I second the comments of Reviewer 1 regarding the re-usability of data generated by different laboratories - which parameters could or should be fixed? These insights would also be immensely helpful in the strive for agreeing upon general (meta)proteomics standards.” We will add this commentary.
“I would also be curious to see a brief comment of how one common search strategy might have impacted the results coming from different MS instruments with different settings.” This is essentially we have done in the reanalysis, albeit within the realm of varying types of Orbitrap MS instruments (no participants used other instruments). The point is well taken that other types of instruments should be intercompared (e.g. TOF instruments) and we can add a sentence for this point.
“Also along the lines of comments of Reviewer 1, please increase the size of figures, figure captions and labels where appropriate.” We will do this.
All specific comments will also be addressed. In particular this comment is useful for us to expand on: “Line 43: I’d argue that metaproteomics has not only the potential for contribution to ocean ecology, but already did so - and would like to see some references for that, to make clear from the beginning why this manuscript is an important contribution to the field.”
Citation: https://doi.org/10.5194/egusphere-2023-3148-AC2
-
AC2: 'Reply on RC2', Mak Saito, 09 Apr 2024