Font Size: a A A

Peptide Fragment Ion Intensity Modeling Based On Gradient Boosting Decision Trees

Posted on:2018-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:H HuaiFull Text:PDF
GTID:2310330515453232Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The identification of proteins is an important branch of proteomics,and its goal is to identify the quantity and variety of proteins in organism.Mass spectrometry based on tandem mass spectrometry has become one of the key technologies of protein protein sequence identification.In biological laboratories,there are a lot of mass spectrometry data generated every day,which is far beyond the ability to manually process data.At present,there are three methods for protein identification based on tandem mass spectrometry data,which are database search method,de novo sequencing method,peptide sequence tag query method.The database search method is one of the most commonly used algorithms,and its main algorithm is a peptide spectrum matching algorithm based on tandem mass spectrometry data.The goal of tandem mass spectrometry identification is to calculate the amino acid sequence according to the given mass spectrum data,and then deduce the protein,and the key point is to make a correct prediction of the theoretical mass spectra.However,due to the qualitative understanding of the fracture mechanism is not enough to make the right prediction,factors still need to analyze the fracture mechanism quantitatively,such as fracture site and fracture site of peptide fragments attributes,in order to improve the accuracy of theoretical prediction mass spectra,thereby increasing the accuracy of protein identification.In this paper,I summarized the characteristics of peptide fragment ion by reading the literature and converted the characteristic of peptide fragment ion into the experimental data for easy calculation,using gradient boosted decision tree algorithm to build the ionic strength prediction model and make the theoretical prediction.Firstly,the tandem mass spectrometry data were identified by the protein identification engine pFind.Secondly,the filtration conditions were filtered to obtain the high availability of peptide sequence;Thirdly,calculate the result of m/z and the attributes values of ions,and get the intensities by matching m/z,and then use the information of intensity and the attributes value of ions to build the experimental data.Fourthly,build a prediction model through the GBDT algorithm using training data and validation data.Finally,use the built prediction model to predict the theory intensities of peptide sequences ions which produced by protein.To analyze the similarity and the Pearson correlation coefficients between the ionic strength of the mass spectrometry peptide sequences and the ionic strength of the experimental mass spectra,the result shows that the model established has a high accuracy,and can summarize the characteristics of the ion which have great influence on the strength value from the prediction tree.
Keywords/Search Tags:tandem mass spectrometry, protein identification, peptide fragment ion intensity, Gradient Boosting Decision Tree, theoretical mass spectrometry
PDF Full Text Request
Related items