Font Size: a A A

Research On Prediction Of Protein Domains And Prediction Of Saliva Secreted Protein And Applications

Posted on:2014-08-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:J X WangFull Text:PDF
GTID:1260330425465897Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the post-genomic era, with the development of technology and the gradual humanexploration in life science, various omics data grew exponentially, genomes and proteomes ofmore and more species had been sequenced. Now, traditional experimental methods could nolonger meet the requirement of processing and analyzing all kinds of omics data which grewexplosively. Therefore, researchers applied computer science in processing life sciences data.In the research process, an advanced interdisciplinary—Bioinformatics emerged and it mainlyaimed to apply computational methods in storing, processing, analyzing and interpretingbioinformatics data.In various study of omics, proteomics, as an important life science disciplines, played animportant role in revealing the mysteries of life. After Nature and Science published humangenome draft in February2001, the status of proteomics was promoted to an unprecedentedheight. It could be said that life activities in almost any biological body were closely related toproteins. To understand the mystery of life, we must study protein, and first of all was to studyprotein structure. In recent years sequenced proteins grew10million annually, but theirstructures were unknown. Traditional experimental methods for protein structuremeasurement were not only expensive but also need much time and effect, they could notmeet the requirement of post-genomic era, and therefore effective computational methodswere needed to explore protein structure. In the protein-protein structure process, the mostimportant procedure was to study protein domains. Using calculation methods to predictprotein domains were challenging. Support Vector Machine (SVM) is a machine learningmethod and widely used in the field of biological information in recent years. As a supervisedlearning method, SVM was widely used in statistical classification and pattern recognition,and in the field of bioinformatics, SVM particularly had good classification results for twoclassification problems and small sample problems. In this paper, we proposed an ab-initialmachine learning method to predict protein domains. To summarize the structure of theprotein on the long-range correlation by analyzing protein characteristics. By using proteinsecondary structure, solvent accessibility of protein, protein specific position scoring matrix(PSSM) and carbon atom coordinates as characteristics, and combining with support vector machine to predict protein domains, and compared it to the existing classification tools. Theproposed method improved the prediction effect of protein domains.In the research of saliva proteomics, saliva proteome, as a branch of Proteomics, hadconsiderable progress in the decade. Saliva proteome researches could be applied in diagnosisand prediction of human systemic diseases, and it had a positive meaning in predicting humanmajor diseases, especially the early detection of cancer. It had been found that there are morethan20,000proteins in human, and only a few of them could be secreted into saliva. In thepast, saliva proteome data were obtained by using mass spectrometry, liquid chromatography(LC), two-dimensional gel electrophoresis, matrix-assisted laser desorption ionization, andsome combination experimental methods. This paper innovatively implied a machinelearning-based model to predict proteins that secreted by salivary glands and proteins thattransferred from blood into saliva. This method not only provide experimental scientists witha novel analysis and comparison tool to test experiment results, but also an important basisand computational methods to find disease biomarkers in human saliva. Main procedures ofthe model were as follows:1.Data collection. In the Sys-BodyFluid database, we searched atotal of2161proteins that labeled as saliva secreted protein; then we took an intersection ofthese2161protein with labeled saliva proteins in Uniprot database and SPD database,308proteins were obtained and used as training test.2. Feature selection. We collated andsummarized four categories (General sequence features, Physicochemical properties,Domains/Motifs, Structural properties) and34protein features in all. With these34features,each protein can be expressed with a1523-dimensional feature vector. Then we firstly usedthe t-test method to filter the elements with the p-value<=0.05, and632irrelevant featurevector are left. Secondly we use SVM-RFE method to rank the remaining features; finally83effective features were obtained.3. Classification model based on Support Vector Machine.The83characteristics training classifiers were used to get classification model.4. Modeltesting. In this section, we used independent-data-sets method to verify the model. Throughthe literature search and data collection methods, we found102saliva-secreted proteins thatwere not in the training set. Indicators such as sensitivity, specificity, precision, accuracy,MCC and AUC were used to assess the model, results were81.37%,96.02%,74.11%,94.22%,0.74%,0.90%by RBF function classification; and64.71%,95.19%,65.35%,93.54%,0.60,0.87by Linear and function classification.In the process of saliva proteomics research, this paper also focused on researchingwhich proteins could be transferred from blood circulation to saliva gland. Some of the protein sequence and physicochemical characteristics allowed some proteins to move into theblood circulation, and then move into saliva glands by active transportation, passive diffusionor ultra filtration, and then secreted. Through measuring the proteins in saliva, we could detectdisease proteins biomarkers from distant organs. Our model could accurately detect andpredict proteins that moved from blood circulation into the saliva. First, permutation test andSVM-RFE feature selection method was used to filter out55effective protein characteristics;recall and average precision were88.56%and90.76%respectively by using55training SVMclassifiers. This verified the success of feature selection. Then, we used these55features totrain a Ranking Algorithm model, and combined with the data of protein differentialexpression from disease group and health group, this model could effectively predict potentialblood biomarkers of certain human diseases. This model was used to sequence the proteinsfrom all the20209people. We proposed31candidate protein biomarkers for breast cancer inthis paper, these proteins were breast cancer biomarkers that could be transferred from bloodto saliva.
Keywords/Search Tags:Support vector machine, Protein domain boundary prediction, Salivary protein, Diseasediagnosis, Biomarker
PDF Full Text Request
Related items