Font Size: a A A

Investigation And Application Of Quality Control Algorithms For Peptide And PTM Identification Using Tandem Mass Spectrometry Data

Posted on:2016-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:C P ZhangFull Text:PDF
GTID:1220330461991115Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Protein is the important component of cell structure and the direct executor of cellular function. The primary mission of proteomics is the qualitative and quantitative analysis of all the proteins present in organelles, tissues, and cell lines, involving their expression, cellular localization, interactions, and post-translational modifications. Protein post-translational modifications(PTMs) are widespread in eukaryotic cells and have a significant influence on the structure and function of proteins. Phosphorylation affects a wide range of important cellular processes, including cell signaling and metabolism, as well as cell growth, differentiation, and proliferation, which has been one of the most imprortant and well-studied modifications within proteomics research. The rapid development of tandem mass spectrometry(MS/MS) has provided a sensitive and accurate platform for proteomics. “Bottom-up” protein analysis refers to the characterization of peptides released from the digestion of a protein mixture, which can realize the rapid and high throughput identification of peptides, proteins, as well as PTMs, and has been the major strategy of MS/MS-based proteomic studies.Database searching is the major bioinformatics strategy for interpretation of MS/MS data, which could determine the best matching peptide for each spectrum. Due to the complexity of MS/MS data, quality control for the results returned by sequence search engines is necessary. The target-decoy search strategy has been widely used in large-scale proteomic studies, which could estimate the proportion of false positive matches within an entire dataset. However, with the increasing scales of proteomics data, the existing algorithms of quality control began to face more challenges.1) Multiple database search engines have been employed for the analysis of MS/MS data, which prompts the algorithms of quality control to build interfaces for the identification results with different data formats. In addition, one dataset could be searched by more than one engine, and the results need to be integrated to ensure the accuracy and sensitivity of identified peptides.2) The localization of phosphorylation sites in peptide sequences is necessary in phosphoproteomics analysis. Some algorithms or tools have been implemented to rescore the high-confidence peptide-spectrum match, which could estimate the false localization rate of assigned sites. However, most of these algorithms were developed and evaluated in too small synthetic phosphorylated peptide sets to ensure the accuracy of large-scale phosphoproteomics analysis.3) Due to the neutral loss of phosphorylation and the influence of ion noise, the spectra of MS/MS may not always provide enough information to localize phosphorylation among neighboring potential sites. A large proportion of spectra resulted in uncertain phosphorylation sites, even though they could generate high-confidence matched peptides.4) The proteome coverage for human protein-coding genes could reach more than 60% by employing multiple samples and experimental strategies. It is necessary for the quality control algorithms to integrate different sources of proteomic datasets and eliminate the false positive identification introduced by the accumulation of multiple large-scale proteomics datasets.Our research mainly focused on the quality control algorithms for database search results of large-scale MS/MS datasets. Based on the improvement and optimization of peptide quality control algorithms, we constructed a workflow named Phospho Distiller, which proved to be highly sensitive and accurate for quality control of phosphorylated peptides and phosphorylation sites. The workflow was applied to the analysis of multiple large-scale proteome datasets. A strategy for integrating different sources of proteome datasets was implemented on the protein level, which could export a high-confidence protein list for the subsequent biological analysis.Firstly, we optimized the semi-supervised quality control algorithm named Pep Distiller, which had been proved to have high sensitivity for peptide identification. By constructing a standard format as the program input and revising the strategy of calculating features, we used the modified Pep Distiller to analyze identification results from any search engines. Models for different fragmentation techniques, such as CID, HCD, and ETD, were coverd to enhance the universality of the algorithm. By combining the results from multiple search engines, the sensitivity of peptide identification was improved significantly.Based on the platform of peptide quality control, Phospho Distiller, a workflow was produced for the analysis of database search results for phosphorylated peptides and phosphorylation sites. Phospho Distiller can facilitate the quality control of large-scale phosphoproteomics datasets by the integration of MS/MS features and motif sequences, and ensure the accuracy and sensitivity of the results.For the phosphorylated peptides, by inheriting the merits of Pep Distiller and considering the specific feature of phosphorylation neutral loss, Phospho Distiller could generate phosphorylated peptides with high sensitivity. Identification results from different fractions in an experiment are analyzed together, eliminating the false positive identification introduced by some fractions with few phosphorylated peptides. The algorithm was evaluated by the large-scale syntheic phosphopeptide reference library. The estimated FDR was higher than the true FDR, which could ensure the accuracy of peptide identification.Based on the peptide quality control results, the probability-based score and motif PEP score are employed to analyze the localization of phosphorylation sites on the peptide sequence. The probability-based score is used to evaluate the similarity of a spectrum with the theoretically fragmentation spectra of all isoforms. Cumulative binomial probabilities for each isoform are calculated based on the number of all/matched site-determining ions, which are further transferred to probability-based score to estimate the accuracy of phosphorylation site localization. Then, the probability-based score is corrected to reduce the influence of noise on the algorithm and ensure the consistence of estimated FLR and ture FLR.The disadvantage of the probability-based algorithm lies in the inability of nearly half the high-confidence phosphorylated PSMs to generate an unambiguous site, which is mainly caused by the phosphoisoformers, absence of site determining ions, and spectrum noise. So we introduced the feature of motif sequence to improve the sensitivity of site localization. The algorithm starts by calculating the score for each motif based on the number of matched phosphorylated spectra with an unambiguous site and the number of matched non-phosphorylated spectra, which could reflect the activity of the kinase associated with each motif in the sample. Then the motif score is integrated into the probability-based algorithm to distinguish the ambiguous sites. The Bayes formula is used to calculate motif PEP, which reflects the probability of a site being phosphorylated by an activated kinase. This strategy could associate the information of different spectra and reduce the reliance on the spectra quality. The integration of probability-based score and motif PEP score could improve the number of phosphorylation sites by about 15%.The workflow of quality control was applied to the analysis of large-scale proteomics datasets for the Chromosome-Centric Human Proteome Project(C-HPP). To integrate the datasets released by different MS/MS instruments and different samples, a quality control strategy at the protein level was implemented based on the high-confidence peptides identified in each experiment. The FDR of the final protein list was controlled under 1%, which could ensure the accuracy of our proteome dataset. The proteome dataset included 12,740 high-confidence proteins and covered 15.4% missing proteins according to the official definition of C-HPP. Then, using the genes identified by the RNA-seq analysis as the background, we designed a simulation strategy to analyze the ceiling of large-scale MS/MS based proteomics experiments, which could be a good guideline for subsequent C-HPP research. Besides the human proteome datasets, the algorithm was also used for the analysis of near complete yeast proteome dataset, which covers 83.5% protein-coding genes in yeast genome. Our results suggested that almost all the transcribed genes could generate the protein-level evidence. This would help us dig the expression pattern of protein-coding genes in yeast.Above all, our research focused on the shortage of peptide and PTM quality control algorithms for the analysis of large-scale MS/MS datasets and constructed a quality control workflow named Phospho Distiller based on the target-decoy strategy. Phospho Distiller could be used for the analysis of identification results from multiple instruments and database searching engines, extracts peptides and PTMs with high sensitivity and accuracy, and increases efficiency of MS/MS data used on the localization of modification sites. The workflow was applied to the analysis of CHPP proteome datasets and yeast near complete proteome datasets, which would greatly facilitate subsequent quantitative analysis and functional analysis.
Keywords/Search Tags:Proteomics, tandem mass spectrometry, quality control, post-translational modification
PDF Full Text Request
Related items