Font Size: a A A

Predicting Protein-protein Interactions Based On Machine Learning Algorithms Using Logistic Regression Model To Improve Accuracy Of Peptide Identification In Mass Spectrometry Analysis

Posted on:2009-03-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ShaoFull Text:PDF
GTID:1100360275975425Subject:Pathology and pathophysiology
Abstract/Summary:PDF Full Text Request
Proteomics has become a hot subject in the post-genomic era.In the recent years,high-throughput technologies such as biological mass spectrometry and protein chip have greatly promoted the development of proteomics.This article works on further improving the accuracy and efficiency of current experimental technologies by the adoption of bioinformatics methods,in order to reduce the cost of biological experiments and to obtain more comprehensive and accurate data.Protein-protein interactions play an essential role in life course. During the past years,great amounts of interactions were found by various high-throughput biological experiments.However,there are still many unknown interactions.Unfortunately,experimental screening for protein binding partners is not only labor intensive but almost futile in screening for low abundant binding species,due to the suppression by high abundant ones.A more plausible way of studying protein-protein interactions is by using high-throughput computational predictions rather than experimental approaches to screen for interactions from protein sequence databases, consequently directing the validating experiments towards the most promising peptides.Compared to traditional experimental essays,computational prediction offers a higher throughput strategy for identifying interactions on a proteomic scale.It also provides a satisfactory settlement for the abundance suppression problem.A fairly large set of protein-protein interactions are mediated by families of peptide binding domains(PRM,Peptide recognition module).The first chapter of this article predicted protein-protein interactions by studying the binding selectivity of PRMs and their ligand peptides.Taking PDZ domain family as an example,an integrated prediction system was set up to predict ligand peptides for PRMs based on both structural and sequential information.In this system,amino acid residues on the interface of the interacting domain-ligand pairs were extracted to take place of their full-length sequences.Next,three novel coding methods were invented to represent different aspects of interactions between the amino acid residue pairs.Support vector machine and artificial neural network were employed as machine learning algorithms and three independent predictors were built to process the encoded data.Prediction results of these three predictors were assembled to make the final prediction.Evaluated by the cross-validation method,specificity of the assembled system was 0.99 and sensitivity was 0.60.However,since the number of known ligands of a PRM is usually only a few dozens or hundreds,which is much less than the size of a protein database(usually over ten thousands),the performance on cross-validation cannot represent the real performance when the whole protein database are screened.In this paper,we screened the Swissprot protein databases for potential ligands of 3 PDZ domains by this trained system.A large fraction of predictions have already been experimentally confirmed by peptide SPOT array assays,indicating a satisfying generalization capability of this prediction system.Tandem mass spectrometry(MS/MS) has been widely used in proteomics studies.In such an approach,protein mixture are firstly digested into peptide mixture by enzymes,then ionized and fragmented to produce large numbers of MS/MS spectra.Database searching is a common method to process MS/MS data by comparing experimental spectra with theoretical spectra,which are predicted from peptides in a target protein database,and finding the best matches based on some scoring methods.Due to the complexity of mass spectrometry experiments and the samples tested,the MS/MS spectra involve high level of noises,hence processing MS/MS data is a difficult work. Currently,various algorithms have been developed to improve peptide identification from MS/MS spectra.However,correct and incorrect matches between the experimental spectra and peptides in database still cannot be very well distinguished.To guarantee the confidence of peptide identification, strict criteria of the scoring functions have to be used,the sensitivity of proteomics research has to be scarified.In the second chapter of this article,a new measurement Oscore was developed by logistic regression based on a training dataset produced from 18 known proteins mixture.Oscore directly estimates the probability of a correct peptide assignment for each MS/MS spectrum.Variables involved in this regression model were:SEQUEST variables Xcorr,â–³Cn,Sp;and the homemade software AMASS(Sun et al.Mol Cell Proteomics.2004 Dec;3(12):1194-9.) output variables MatchPct,Cont,Rscore;peptide charge state and number of peptide internal missed cleavage sites(NIMCS).The AMASS variables provide supplemental information to SEQUEST variables by considering fragment ion intensity and b/y ion continuity.Because of the complicated associations among AMASS and SEQUEST variables,combining them together rather than applying them to a threshold model improved the classification of correct and incorrect peptide identifications.Oscore achieved both lower false negative rate and lower false positive rate than PeptideProphet on datasets generated from 18 known protein mixture and several proteome-scale samples of different complexity,database size, and separation methods.By a three-way comparison among Oscore, PeptideProphet and another logistic regression model which only made use of PeptideProphet variables,the main contributor for Oscore' s improvement was discussed.Presently,Oscore is restricted to be used for identifying fully-tryptic peptides.To extend Oscore for non- and partially-tryptic peptides will be the future work.
Keywords/Search Tags:proteomics, protein-protein interactions, bioinformatics, tandem mass spectrometry, machine learning algorithms
PDF Full Text Request
Related items