Machine learning algorithms for peptide identification and protein quantification in proteomics | | Posted on:2017-09-29 | Degree:Ph.D | Type:Thesis | | University:Indiana University | Candidate:Ji, Chao | Full Text:PDF | | GTID:2460390011497689 | Subject:Computer Science | | Abstract/Summary: | PDF Full Text Request | | Mass-spectrometry (MS) based proteomics has become an indispensable technology for modern biomedical research. One of the goals is to characterize the species and relative quantities of proteins present in complex samples. Thanks to the advancement in MS instrumentation as well as the development of new informatics techniques, proteins are nowadays being studied at unprecedented scale and with increasingly better accuracy. However, despite the success of existing methods in improving the state-of-the-art of MS based proteomics, a number of non-trivial problems are yet to be solved and call for novel approaches. In this thesis, I will address problems related to the identification and quantification of proteins using MS data.;First, a neighbor-based approach is proposed to predict the peak intensities of fragment ion spectra of previously unobserved peptides. The prediction is achieved by averaging the peak intensities of the spectra of a number of similar peptides such that more similar peptides are assigned greater weights in the weighted average. The similarity between peptides is determined using Support Vector Machine models that take features extracted from peptides' amino acid sequences. Using the spectra from real proteomics datasets, I show that the predicted spectra are practically useful for peptide identification in spectral library search. Second, a database searching algorithm for identifying cross-linked peptides is proposed. It relies on a novel data-driven scoring scheme that estimates the marginal probability of correctly identifying either peptide, as well as the joint probability that both peptides are correct identified in a cross-link. I show that this scoring scheme is able to effectively distinguish between true positives and a type of false positives that are commonly mistaken as positives by existing methods. The advantage of the proposed algorithm in achieving more identifications compared with existing methods is demonstrated using the datasets from previous cross-link studies. Third, an iterative learning algorithm is proposed to estimate protein absolute abundance in a complex proteome. The estimated protein abundance is determined by maximizing a likelihood function of protein abundance given the observed peptide signal intensity. Notably, the peptide signal intensities are calibrated by a property called Peptide Response Rate (PRR), which is used to quantify the signal intensity detected for a peptide ion at a given abundance level. The effectiveness of PRR in calibrating the bias of detected peptide signal intensity and improving the accuracy of protein quantification is demonstrated by comparing against a baseline model without PRR calibration and alternative protein quantification methods. | | Keywords/Search Tags: | Protein quantification, Peptide, Proteomics, PRR, Identification, Algorithm, Methods | PDF Full Text Request | Related items |
| |
|