Machine learning algorithms for peptide identification and protein quantification in proteomics

Posted on:2017-09-29

Degree:Ph.D

Type:Thesis

University:Indiana University

Candidate:Ji, Chao

Full Text:PDF

GTID:2460390011497689

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Mass-spectrometry (MS) based proteomics has become an indispensable technology for modern biomedical research. One of the goals is to characterize the species and relative quantities of proteins present in complex samples. Thanks to the advancement in MS instrumentation as well as the development of new informatics techniques, proteins are nowadays being studied at unprecedented scale and with increasingly better accuracy. However, despite the success of existing methods in improving the state-of-the-art of MS based proteomics, a number of non-trivial problems are yet to be solved and call for novel approaches. In this thesis, I will address problems related to the identification and quantification of proteins using MS data.;First, a neighbor-based approach is proposed to predict the peak intensities of fragment ion spectra of previously unobserved peptides. The prediction is achieved by averaging the peak intensities of the spectra of a number of similar peptides such that more similar peptides are assigned greater weights in the weighted average. The similarity between peptides is determined using Support Vector Machine models that take features extracted from peptides' amino acid sequences. Using the spectra from real proteomics datasets, I show that the predicted spectra are practically useful for peptide identification in spectral library search. Second, a database searching algorithm for identifying cross-linked peptides is proposed. It relies on a novel data-driven scoring scheme that estimates the marginal probability of correctly identifying either peptide, as well as the joint probability that both peptides are correct identified in a cross-link. I show that this scoring scheme is able to effectively distinguish between true positives and a type of false positives that are commonly mistaken as positives by existing methods. The advantage of the proposed algorithm in achieving more identifications compared with existing methods is demonstrated using the datasets from previous cross-link studies. Third, an iterative learning algorithm is proposed to estimate protein absolute abundance in a complex proteome. The estimated protein abundance is determined by maximizing a likelihood function of protein abundance given the observed peptide signal intensity. Notably, the peptide signal intensities are calibrated by a property called Peptide Response Rate (PRR), which is used to quantify the signal intensity detected for a peptide ion at a given abundance level. The effectiveness of PRR in calibrating the bias of detected peptide signal intensity and improving the accuracy of protein quantification is demonstrated by comparing against a baseline model without PRR calibration and alternative protein quantification methods.

Keywords/Search Tags:

Protein quantification, Peptide, Proteomics, PRR, Identification, Algorithm, Methods

PDF Full Text Request

Related items

1	Applications of Probabilistic Models on Peptide MS/MS Spectra Identification and Protein Quantification
2	Isobaric Stable Isotope Phosphorylation Labeling For Protein Quantification And Its Application In Targets Identification For HIV Latency Activator
3	Research On Quantitative Proteomics Method For DIA Mass Spectrometry Data
4	New Protein Discovery Algorithm Based On The Experimental Data Of Mass-Spectrometry
5	Statistical learning algorithms for protein inference and quantification in proteomics
6	Evaluation Of Different Protein Probability Calculating Methods Using A Semi-random Sampling Model And Other MS Bioinformatics Studies
7	A New Non-restrictive Enzymatic Proteomics Analytic Strategy With The De Novo Peptide Sequencing
8	Research And Application Of Protein Theoretical Spectrum And Peptide Fragmentation Event Model
9	Studies On The Peptide Identification Algorithms By Tandem Mass Spectrometry In Proteomics
10	Statistical physics inspired methods to assign statistical significance in bioinformatics and proteomics: From sequence comparison to mass spectrometry based peptide sequencing