Font Size: a A A

An Efficient Method For Dealing With Missing Data In Homology Modeling

Posted on:2013-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:A P LiuFull Text:PDF
GTID:2230330374965034Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
Protein molecules are usually made of hundreds of, thousands of amino acid composition. If we want to fully express its structure, we need6N dimensional space, where N is the number of atoms. This is an extremely huge dimension. It is difficult to use it to deal with the problem, on some issues even is not feasible. But from the same ancestor of the same family in different proteins, due to genetic information will be conservative in the evolutionary process of preserved. In the structure of these proteins some sections have very similar three-dimensional structures, often referred to as the conservative region. Conserved regions of the structure can be used PCA method to deal with, according to the needs of the precision to reduce space dimension, simplifying the complexity of the problem.Application of the PCA method to deal with protein problem, first of all is to the same family of protein molecules for multiple sequence alignment; then according to the discrete degree, to judge the conservative regions, then use PCA method to obtain the conserved region structure of the sample space. Generally speaking, the conservative regions can be the use of genetic information, in the treatment of many problems, so when we calculate PCA, we hope include conservative regions as large as possible. But in a little conservative regions with a small amount of space position, which still contains a lot of genetic information. If we can put a vacancy filled with the suitable numerical value, then we can use PCA method. In dealing with missing value problem, people usually delete or ignore the missing values, even using a value of0to fill. In general, ignoring missing values for the number of samples is very large sample set is not a big problem, but for the biology of protein molecules will bring great deficiencies and defects, such treatment did not make full use of biological data sets which contain valuable information. Therefore, we need to fill the reasonable value with the relationships between data.This paper presents An Efficient Method for Dealing with Missing Data in Homology Modeling, different from the traditional iterative method, this method needs no iterative computation, we only need to do the twice matrix operation. It is a complete linearization method, so avoid the iterative algorithm effect of original data reliability problems. At the same time, this method can not only be applied to protein homologous modeling sample space missing data filling problem, can also be applied to other research fields of missing value problem, has a broad meaning.
Keywords/Search Tags:Homology Modeling, Missing data, EMPCA, KNN
PDF Full Text Request
Related items