Font Size: a A A

Research Of Protein Inference Algorithm And Statistical Validation Of Protein Identifications

Posted on:2015-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:T HuangFull Text:PDF
GTID:2180330467985763Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Shotgun proteomics has emerged as the most powerful technique to comprehensively map out a proteome. Reconstruction of protein identities from the raw mass spectrometric data constitutes a cornerstone of any shotgun proteomics workflow. Protein identification includes two steps:peptide identification and protein inference. The objective of protein inference is to assemble identified peptides a list of proteins. But the inherent uncertainty of mass spectrometric data and the complexity of a proteome render protein inference a non-trivial task. In addition, since the proteins present in the sample are unknown, how to estimate the statistical significance of protein identifications is also a subject of ongoing research.We first present a linear programming model for protein inference. The model uses a transformation of the joint probability that each peptide/protein pair is present in the sample as the variable. Then, both the peptide probability and protein probability can be expressed as the linear combination of these variables. Thus, the protein inference problem is formulated as an optimization problem:minimize the number of proteins with non-zero probabilities under the constraint that the difference between the calculated peptide probability and the input peptide probability should be less than some threshold. Experimental results on six datasets show that our method is competitive with the state-of-the-art protein inference algorithms.We also propose a decoy-free protein-level FDR estimation method. Under the null hypothesis that each candidate protein matches an identified peptide totally at random, we assign statistical significance to protein identifications in terms of the permutation P value and use these P values to calculate the FDR. Our method consists of three key steps:(i) generating random bipartite graphs with the same structure;(ii) calculating the protein scores on these random graphs; and (iii) calculating the permutation P value and FDR. As it is time-consuming to execute the protein inference algorithms for thousands of times in step ii, we train a linear regression model. Then we use the learned regression model as a substitute of original protein inference method to predict protein scores on shuffled graphs. The results show that our method is comparable with those state-of-the-art algorithms in terms of estimation accuracy.
Keywords/Search Tags:Shotgun proteomics, Protein inference, Validation of proteinidentifications, Linear programming
PDF Full Text Request
Related items