Font Size: a A A

Studies On The Simulated Generation Of Proteomic Data By Mass Spectrometry

Posted on:2016-08-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:1220330509461056Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Mass spectrometry(MS) is the most comprehensive and versatile tool in large-scale proteomics and MS-based proteomics. MS data are primary sources for proteome information mining. At present, there are lots of tools for MS data processing and analysis that make significan achievments for the study of proteomics based on MS.For the creation of efficient and robust methods and tools, developers need benchmark data to compare and validate their approach. This is a difficult task to obtain standard database since carefully compiled databases of annotated test data are scare in MS-based proteomics, and only few data sets can be available. In order to test the robustness of the algorithms and tools, the data set need to include different levels of noise data distribution. Based on these ideas, it is a feasible way to make an attempt at these objectives as above by the simulation of MS data in this dissertation. The aim of simulating the MS data is not to create a detailed physical model of mass spectra, and the simulated data may be reasonably close to the real MS data. The simulation data take protein sequences as the input data, and the content of simulation include the cleaved peptides after digestion, retention times, isotopic distribution, charge states and chromatographic elution curves of peptides, the mass-to-charge ratio information of peptide ion signals, the mass-to-charge ratio information of fragment ions, background noise and detectability of peptides.The milestone of this paper is to explore the realization of a prototype system of simulation.We mainly focused on some issues in the simulation data as follow:(1)The calculation of protein cleaved probability based on the Markov chain. We present a model for the calculation of protein cleaved probability. The cleaved probability will be calculated according to the information of amino acid residues in close proximity to the cleavage site. In order to test the capability of our model, we used 2 datasets from different laboratories. The results show that the model has good predictive effect and robustness.(2)The prediction of retention times and the simulation of peak shape for the separation of peptides in chromatography. Liquid chromatography is one of the most important means for separation of biological macromolecules, and the simulation of peptide chromatographic process consists of two parts, namely the prediction of retention time and the prediction of peak shape. The prediction model of retention time relies on summation of the retention coefficients of individual amino acids, but additional terms are introduced that depend on the retention coefficients for amino acids at N-terminal of the peptide and the length of peptides. The simulation model of peak shape is described by the exponentially modified Gaussian(EMG) function, and it introduces an asymmetry factor that can be suitable to describe the symmetrical or asymmetrical chromatographic peaks. The model tests showed that the correlation coefficient between predicted data of retention time and experimental data was 0.94, and the correlation coefficient between simulated data of chromatographic peak shape and experimental data was 0.98. The results illustrate the simulation of chromatographic process is close to the reality of experiments.(3)The prediction of peptide charge states in electrospray ionization(ESI) process. We present a proposal that predict the charge state of peptides based on amino acid composition of peptide using a combination of linear regression and multi-normal distribution function. In order to test the performance of the model, we used 2 datasets from different laboratories, and applied 5-fold validation method. The results show that the model has good behavior with predictive accuracy of 96.89%. The model can meet the application requirements between different datasets with more than 88% predictive accuracy.(4)The prediction of peptide detectability based on Logistic regression. We summarize these factors and propose a model of peptide detectability prediction based on Logistic regression. We select 6 peptide properties as model parameters that affect peptide detectability. In order to test the model performance, we use 2 datasets from different resources and nested cross validation method. The results indicate that the average area under the ROC curve is 0.9466, and the prediction accuracy is 0.87. In addition, we compare our model with other methods reported in the literature, and the performance of our method is equal to or better than other models. The model we proposed can offer a sufficient performance to predict the probability of peptide that can be detected in the experiment.(5)The simulated data generation of proteomic MS experimental data. The primary coverage of simulated MS data include the mixed peptide list of trypsin digestion, the prediction data of retention time, the simulated data of chromatographic peak shape, the isotopic peak distribution of peptides, charge states, m/z information of peptide ions, m/z information of fragment ions and the corresponding intensities value, background noise and the peptide detectability calculation, etc.. We chose 4 datasets from different sources, and make the similarity analysis between simulation data with experimental data about the digested peptide, peptide detectability, charge states, isotopic peaks distribution, MS/MS spectra and noise data distribution. The simulation data can reflect the characteristics of the experimental data. We test the existing database searching tools by using our simulation data; the results show the simulation data with or without different level of noise can be taken as a way and used to test the performance of different tools or algorithms.
Keywords/Search Tags:Bioinformatics, Prediction, Mass spectrometry, Proteome, Simulated data, Protein cleavage, Charge state, Peptide detectability
PDF Full Text Request
Related items