Font Size: a A A

The Study Of Protein Amino Acid Residues' Solvent Accessibility Prediction And Gene Expression Profile Analysis

Posted on:2008-08-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:X WangFull Text:PDF
GTID:1100360212998578Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Computers and the World Wide Web are rapidly and dramatically changing the face of biology research. Bioinformatics is a newly interdisciplinary research area, which is the marriage of information technology and biology research. It is a research area that applies the information technology to the biological data by means of storing, managing and performing analysis. Bioinformatics is the cutting edge of life and natural sciences nowadays and will be one of the most important research areas in the 21st century.Systems biology is also a newly interdisciplinary research area appearing with the development of information technology and biology research, just like Bioinformatics. Systems biology is closely related to the Human Genome Project, which is based on the progress of genomics and proteomics. Systems biology is a field that focuses on the study of all the components inside an organism, including genes, mRNA and proteins. It is also interested in the relationship between these components under specific conditions. The traditional biological technology fails to meet the needs of systems biology research, and the process of mathematics, physics, together with informatics provide powerful technic support to systems biology. The emergence of large-scale computer makes calculations become a reality in large-scale data. Systems biology will become the core driving force in medicine and biology in the 21st century.The advance of bioinformatics will be an evolutionary power to current life sciences: not only basic research fields, but also agriculture, medicine and public health, food industry, and so on, will benefit from its merits. One urgent work for current bioinformatics and Systems biology researchers is to investigate efficient machine learning methodologies to predict or analyse mountainous data deposited in current public databases.Comparing to traditional bench-experiments, advantages of these approaches from machine learning are apparent: fast, automatic and efficient in time and labor resources, especially in high throughput large-scale biological data analysis.In this dissertation, some original research works by the author can be formulated as follow:(1) To predict the 3D structure from the amino acid sequence level is one of the most difficult parts in Bioinformatics. Predicting the amino acid residue's solvent accessibility in protein, as a supplementary means to this problem, has attracted the attention of the researchers. Relative solvent accessibility (RSA) of a residue demonstrates the degree that the residue is exposed to the solvent in protein's 3D structure, and can be regarded as a characteristic identifier of protein tertiary structure and functional sites.Residues in protein sequences can be divided into two classes (exposed/buried) or three classes (exposed/intermediate/buried) according to their relative solvent accessibility. Several window lengths and parameters were explored to achieve the best performance. The prediction accuracies of support vector machine (SVM) for different cut-off thresholds are analyzed and compared with other methods, which shows that the SVM is a better method than neural network and information theory when using the same dataset. The best accuracy, in two-class problem, can be as high as 79.0%, and in three-class problem, can be as high as 67.5%. These results show that the support vector machine is an effective method in the prediction of protein solvent accessibility. (2) DNA microarray technology is a recently developed high-throughput biological experimental technique. This kind of technology makes it possible to analysis the gene expression profile, patients' genotype, drug metabolism, the occurrence and development of diseases from the genomic scale. It also provides a way for scientists to analyze the whole genome of certain organism in one single experiment. However, the massive gene expression data sets always contain the missing values which were caused by various factors, such as insufficient resolution, image corruption, or simply due to dust or scratches on the slide. The missing values in the data sets will influence downstream microarray analysis algorithms. In this dissertation, we propose a new approach based on the Support Vector Regression (SVR) to estimate the missing values and use orthogonal input coding scheme to address the issue of multiple missing values in one row of certain expression profile. To evaluate the proposed method, six microarray datasets have been tested with various parameter settings. Our approach makes most use of the missing value information in the whole gene expression matrix by using orthogonal input coding scheme. What's more,SVR is based on the structural risk minimization principle in statistical learning theory, is a powerful tool for general purpose machine learning problem. The superior performance, comparing with KNN, BPCA, and LLS impute methods, indicates the promising estimation ability, together with the robustness against the noise of the method.(3) The gene expression profiles contain abundant biological information which comes from the DNA microarray experiments. How to find the hidden information from this raw data and construct the related bio-networks, is one of the issues of concern that the systems biologists are interested in. We use the gene expression profile to reconstruct the gene regulatory network using Bayesian Network for structure inference. Discrete and continuous data are all tried as the input data, different approximation approaches are also used in the Bayesian Network structure inference. The method was tested on a data set from Saccharomyces cerevisiae yeast DNA microarray experiment. The results show that different approximation approaches end in similar network topologies. We analyze part of the result network topology with known biological knowledge. It turned out that the network topology comes from the Bayesian Network structure inference can be well explained by biological knowledge, and it can be a guidance to the biologist with the experiment design.
Keywords/Search Tags:Bioinformatics, Systems Biology, Amino acid residue, Solvent accessibility, DNA microarray, Gene expression profile, Missing value estimation, Machine learning, Support Vector Machine, Support Vector Regression, Bayesian network
PDF Full Text Request
Related items