Font Size: a A A

Study On Protein Identification Algorithms Based On Tandem Mass Spectrometry

Posted on:2010-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y YuFull Text:PDF
GTID:1100360302477801Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, proteomics has been widely concerned for its important application prospect, and has become one of the most important research topics in the post-genome era. In proteomics, the protein identification is a particular important step and provides the crucial basis for protein functions and mutual cooperation analysis. At present, the tandem mass spectrometry has been recognized as one of the most powerful tools for large-scale, rapid and accurate protein identification due to its high sensitivity and accuracy. However, the characteristics of high-resolution of tandem mass spectrometry data lead to the difficulties in calculating, which is a new challenge for the computer algorithm. The main factors that cause difficulties are: (1) Ion complexity. After the CID fragmentation progress, the components of ions in tandem mass spectra are very complex, including noisy ion peaks, a variety of known types of ion peaks (such as the N-terminal a, b, c and C-terminal x, y, z, etc.), the isotope ion peaks and unknown types of ion peaks, etc. The complex ion components increase the probability of the error identification of ion type and the error assignment of the ion peaks matching, which results in high false-positive results. (2) Incompleteness of MS/MS data. In CID, the peptides may make no fragmentation at some peptide bonds, which results in the loss of the information of tandem mass spectrometry data. This makes the computer algorithms are not able to infer the correct sequence or remove the correct sequence due to its low score, and therefore increases the false-negative results. (3) Post-translational modification. The mutation and post-translational modification of the peptide may cause the ion peaks shift in its tandem mass spectra, which further increases the difficulty of the interpretation of tandem mass spectrometry data.This dissertation deeply studies the difficulties in the problem of protein identification based on tandem mass spectrometry. The contributions of this disserttion are summarized as follows.(1) A series of algorithms is proposed to solve the shortcomings of the spectrum graph model in the de novo peptide sequencing algorithms. The PShifter algorithm is first proposed to transform the other types of ions in the spectra to b ions. Then the SVM based Ion-Classifier algorithm is proposed for classifying ions ofδ_i type and other types. Finally,the b/y-Classifier algorithm is proposed based on frequent pattern mining and decision tree for classifying the b-ions and y-ions. The experimental results show that these proposed algorithms achieve good results on noisy peaks filter and ions classification, which improves the performance of the existing de novo peptide sequencing algorithms.(2) Several algorithms are proposed focusing on the problem of scoring peptide sequence and tandem mass spectrum matching. The ITPIA algorithm is first proposed based on entropy theory. It calculates the entropy of each ion in the theoretical spectrum of peptide for measuring the match between peptide sequence and experimental spectrum. Then a scoring method is proposed based on k NN which makes good use of the intensity information of spectrum and scoring the sequence and spectrum match by the knowledge set. Finally, the ReCheck algorithm is proposed. It expands the information of one peptide bond to three sites, which can overcome the incompleteness of MS/MS data. Experimental results show that these algorithms can be applied to the database search algorithms and have reached good results on several datasets.(3) The PepCheck algorithm is proposed for searching the protein database by use of peptide sequence tags. The spectrum graph is established and the peptide sequence tag generation problem is converted to the longest parallel path and complement path query problem. In addition, the enumeration tree index and the scoring between peptide sequence tag and protein sequence are used for speeding up database searching and increasing the accuracy. Experimental results show that PepCheck algorithm achieves a high accuracy level.(4) The Check-PTM algorithm is proposed based on spectral alignment for identifying protein post-translational modification. A more reasonable spectral alignment model and its solving strategies are proposed according to the characteristics of ion peak shifts in tandem mass spectra with PTM. Also, an approximately solving algorithm is proposed to improve the performance of the model. What is more, the modification type and site discovery algorithms are proposed according to the relations among shift values and the shift ion set. Experimental results show that Check-PTM algorithm achieves a high accuracy level.
Keywords/Search Tags:Protein identification, de novo peptide sequencing, tandem mass spectrometry, database search, spectral alignment, peptide sequence tag, post-translational modification
PDF Full Text Request
Related items