Font Size: a A A

The Analysis Of MS Data Based On Bayesian Method

Posted on:2013-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:K P YinFull Text:PDF
GTID:2214330374469899Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
The genomics, developing with the Human Genome Project (HGP), plays an imp ortant role in exploring the element of life. During this process, people realize that it i s inadequate to know the life on the genetic level, and then the proteomics grows up. The mass spectra technology is a kind of effective tools for scientists to study.The main pipelines and technologies for analyzing proteins are introduced in this article and also the common algorithms for mass spectra are discussed, such as SEQUEST, MASCOT, X!Tandom. We also summarize the distinction and merit of two stratigies for quantitive analysis of proteins, the isotopic labeling and lable free method. Moreover, the common methods for discovering and identifing the post-translational modifications of proteins are depicted in this paper.Becasue different algorithms for identifing proteins have their own qulities, we attempt to indegrate the existing methods by machine learning combining with naive bayes theory. The approaches of machine learing include SVM, LDA, Logistic Regression, KNN, Bayesian Network and Artificial Neural Networks. We choose the parameters used in SEQUEST as the characters for classification. The traing data sets come from the mass specture of known protein mixture divided by18teams. By machine learning, we acquire the interface of classifier and calculate the conditional distribution of positive and negtive samples with the classifing function. By the conditional distribution and the scores of features, we could count the posterior probabilities of idntification, utilizing the bayes methods on the priori homogeneous distribution. From the cross validation, the accuracy of our method could achieve between80%and90%and the recall could be between40%and50%, which accounts for the utility value of the novel method.How to identify the post-translational modifications of proteins is always a key problem in proteomics. The troditional algorithm for identifing the unknown proteins by mass spectra is to search the protein database employing the methods of machine learning. However, it is time-consuming and the false positive rate will increase. We try to classify the mass spectra using the projection distance and then discover the post-translational modifications, which could not only decrease the time complexity, but also improve the accuracy. The projecting direction is calculated by LDA and SVM, which making the distance in classter smaller and out classter bigger. We get the distance matrix by projection and perform the classification with certain classtering algorithms. The peptides in the same class may derive from the same protein with different post-translational modifications. Comparing the peptides in the same class, we could find the latent post-translational modifications more quickly and efficiently. By cross validation, the accuracy and recall could both reach about70%. Since the concept of cloud computing is put forward by google company, many kinds of distributed cloud computings have emerged. Due to the characters of high flux and parallelism, the calculation of proteomics could assign to the cloud computing platform. So we suggest two kinds of strategies and tell the advantages and defects of them.
Keywords/Search Tags:Proteomics, Mass Spectrogram, Bayesian and Machine Learning
PDF Full Text Request
Related items