Font Size: a A A

Research On Relevant Problems Of Molecular Biology System Modeling And Protein Function Prediction

Posted on:2017-05-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:R T YanFull Text:PDF
GTID:1310330512489944Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Life science is a branch of science that investigates laws of life activities and development,essence of life,interrelationships of living organisms,and relationships between living organisms and environment.From the beginning of the 21st century,life science has been undergoing rapid development,thereby making some major breakthroughs.It is remarkable that life science has broad application prospects,which not only helps to reveal basic laws of life activities,but also provides an important theoretical basis for the diagnosis and treatment of diseases.The establishment of the DNA double helix model made molecular biology become an important branch of life science,creating a new era that investigates biological phenomena at the molecular level.Gene expression is the theoretical foundation of molecular biology,which refers to the transfer process of genetic information from DNA to protein.The investigation of the biological mechanism of gene expression laid a theoretical foundation for the birth of DNA computing.Since the 20th century,molecular biology has been developing rapidly.However,the cost of biochemical experimental method is relatively high.Therefore,it is urgent to establish effective molecular biology system models to analyze and forecast biological problems,thereby revealing the mysteries of life.Protein is the physical basis and the final executor of life activities,directly reflecting life phenomena and physiological functions of the body.Protein function prediction has the direct sense to reveal the essence of life phenomena at multiple levels such as molecular,cellular and organism,opening an entirely new way to explore the mechanism of disease and develop drugs.In addition,protein function prediction has an enormous push to the study of many fields such as food,agriculture,and environmental monitoring.Currently,determining protein functions by experimental methods is still inefficient and costly.Since the 1980s,the sustainable development of genome sequencing projects has produced vast amounts of protein sequences,and its growth rate is accelerated increasingly.The functional knowledge of proteins determined by experimental methods is far behind the growth rate of newly discovered protein sequences.Relying solely on experimental studies can not meet the need of protein function annotations in the whole genome.How to narrow the gap between the number of protein sequences and the number of completely characterized proteins has become an important research topic of molecular biology.In order to assist experimental techniques,it is imperative to adopt advanced and efficient computational methods to establish protein function prediction models and online prediction platforms.Although relevant research problems of molecular biology system modeling and protein function prediction have made great progress in the past ten years,there is a large space to explore in the areas.Based on fundamental theory of mathematics and machine learning theory,this dissertation investagets related problems of molecular biology system modeling and protein function prediction.The detailed contents of the research are summarized as follows.(1)At present,the biological mechanism,biological characteristics,and biological significance of the genetic code have been studied deeply.However,limited by lack of precise mathematical model of the genetic code,the investagations of relationships between codons and other living organisms or biological processes are difficult to be further conducted.Based on the advantage of group models on depiciting symmetrical and complementary characteristics,a mathematical model of the genetic code and correspondences between the genetic code,amino acids and group elements are constructed in the complex plane.Based on this model,we obtain some valuable propositions,especially defining the function that describes the relationship between codons encoding the same amino acid.This model will provide a reference for quantitative analysis and understanding of gene expression processes.At the same time it helps to analyze the impact of mutations on protein synthesis,thereby revealing the operational mechanisms of complex biological systems.(2)DNA computing has broad application prospects.To reduce the cost of experiments,it is necessary to perform computer simulations for DNA computing algorithms before carrying out the corresponding DNA computing experimental work.Mathematical models have the potentials to depict biological characteristics,describe biological processes,and accurately calculate the dynamic evolution of biological systems.Based on the mathematical theory of Yuanjian,the dissertation models the experimental processes of DNA computing to solve the Hamiltonian Path Problem(HPP).Then,an encoding rule is given from the view of mathematics,and a generalized Yuanjian model for solving the multi-node HPP is derived.The simulations verify the effectiveness of the proposed model which can be applied as a new bionic algorithm to solve the HPP for any size digraph.Furthermore,the presented Yuanjian model helps to combine DNA computing and computer simulations,providing a model basis for DNA computing technology that first conduct simulations and then conduct experiments.(3)The diversity of ECM(Extracellular Matrix)proteins is the foundation of the regulatory roles of ECM on numerous biological events such as tissue morphogenesis,differentiation and homeostasis.AFPs(Antifreeze proteins)have the ability to adsorb onto the surface of ice crystals and inhibit their growth,which is the precondition of overwintering organisms surviving in the cold environment.Protein class prediction is an important research branch of protein function prediction.Prediction of ECM proteins will contribute to understanding ECM protein based biological processes and develop drugs.Prediction of AFPs may provide important clues to decipher the underlying mechanisms of AFPs in ice-binding.The existing ECM protein and AFP prediction systems are based on a single classifier prediction algorithm,which limits prediction performance to a certain extent.Based on ensemble learning algorithm,ECM protein and AFP prediction systems are constructed.Experimental results show that the above-mentioned prediction systems are far better than the existing methods.(4)The main function of the GA(Golgi Apparatus)is to store,package and distribute proteins.The types of GA proteins are usually classified into cis-Golgi proteins and trans-Golgi proteins,leading proteins in and out of the GA.Malfunctions in the GA proteins can lead to malnutrition,diabetes,cancer and other genetic diseases.In view of shortcomings of existing methods,a computational method is developed to distinguish cis-Golgi proteins from trans-Golgi proteins.Accurate prediction of the types of the GA proteins will be beneficial to elucidate functions of the GA involved in various cellular processes,and provide useful clues to understand the the mechanisms of diseases.Based on the concept of CSP(Common Spatial Patterns),a prediction model is developed to distinguish cis-Golgi proteins from trans-Golgi proteins.Experimental results show that the perdition performance of CSP based feature extraction method is comparable to that of traditional feature extraction methods.Meanwhile,the feature number of the CSP based feature extraction method is only 1/20 of traditional feature extraction methods,which greatly reduces the computational complexity.Considering the prediction performance and feature dimension,CSP is an effective feature extraction method.The imbalanced dataset problem is solved by SMOTE(Synthetic Minority Over-sampling Technique).The recursive feature elimination method is adopted to exclude redundant features and improve prediction performance.By comparison with existing method,the strong predictive power of the proposed method is confirmed.(5)The binding sites on protein surfaces usually interact with other biological molecules,which is very important for the realization of protein functions.Another important research direction of protein function prediction is to distinguish between the binding sites and other surface areas.Taking FIRs(Flavin Adenine Dinucleotide Interacting Residues)as the research object,a new protein binding site prediction model is constructed through using a variety of feature extraction strategies.Given the interdependence of adjacent residues,PSSM is smoothed when extracting evolutionary information.To further understand the mechanisms of FIR formation,we conduct a quantitative analysis of various types of features.The results indicate that the extracted features have a good ability to distinguish FIRs from non-FIRs.To reduce the computational complexity and improve the accuracy of the prediction model,a feature selection technique is employed to select the optimal feature set effectively.The analysis of optimal features reveals the mechanisms of FAD-protein interactions to some extent.Cross-validation results on the training set show that the proposed method is significantly better than existing methods.(6)PSSM is widely used to extract evolutionary information from protein sequences.Now there are many web servers used to extract protein sequence information.However,there is no web server developed for extracting evolutionary features via PSSM,which greatly limits its practical value.First,the PSSM-derived protein representation methods are divided into 3 categories.Then,based on these feature extraction methods,we develop a web server called PSSM-PROREP to extracting evolutionary features from protein sequences.Finally,a step-by-step guide is provided on how to use the web server.PSSM-PROREP makes these features readily achievable for any expert and/or non-expert users via its highly flexible,configurable and user-friendly design.Users can easily assess the predictive power of these features and select relevant ones to develop robust classifiers.It is anticipated that PSSM-PROREP may become a very useful tool for predicting various protein functions.
Keywords/Search Tags:Molecular biology system modeling, Protein function prediction, Genetic code, DNA computing, Machine learning, Web server
PDF Full Text Request
Related items