Font Size: a A A

Studies On Missing Value Problems In Protein Homology Modeling

Posted on:2015-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:G Q WuFull Text:PDF
GTID:2180330431982432Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
Protein structure and function are intimately linked, which indicates the significance of structure determining. Evolutionarily related proteins have similar sequences and naturally occurring homologous proteins have similar protein structure. Therefore, homology modeling method, a method for the computational prediction of protein structure from its sequence has been developed. The methods can build a3D model for a protein of unknown structure from experimental structures of evolutionary related proteins. The standard procedure of the homology modeling method is:firstly aligning the target protein sequence (query sequence) to candidate proteins with known structures, and then obtaining the core of proteins which has similar sequence. To further extract the evolution information, principle component analysis (PCA) is applied to the core. Unfortunately, in some fragments of core, missing values are caused by insertions/deletions in the multiple structural alignments. Standard PCA can’t deal with the core with missing value. The gapped positions must be imputed with reasonable values, so that evolutionary information can be made full use of to build a low dimensional feature space. Obtaining the low dimensional feature space of endangered core promises a refinement in homology modeling.In this paper, the k-nearest neighbor algorithm (KNN), self-organizing maps (SOM) and back propagation network (BP) are proposed to impute missing values. MAMMOTH-mult is used to align the other proteins from the same superfamily with the target. From the alignment, evolutionary cores can be singled out and classified to strict cores and loose cores. The strict core is the set of gapless positions for which the Ca atoms are present in all proteins and all pairwise distances are smaller than4A. While the loose core is the set of positions for which at least a fraction of2/3of all proteins have a Ca atom within3A distance with each other. After imputing the gapped position in loose core, PCA can be applied to enlarged regions. For the purpose of complete evolution information reservation in the original PCs and the numerical stability, a simplified expectation-maximization algorithm (EM) technique is proposed. To obtain a more accurate model, anisotropic network model (ANM) methods are applied to constructing an evolutionary and vibrational armonics (EVA) space for protein modeling.The application of the method to a set of33superfamilies with low pairwise sequence identity (SID) has enlarged the modeling region by an average of30%. After filling of the missing values, the regions are greatly enlarged so that the average coverage is increased from62.9%to82.7%. In the meanwhile, the qualities of the low dimensional sampling spaces are found to be quite satisfactory, as demonstrated by root mean square deviations (RMSD). The RMSD decreases from1.65A to1.08A by KNN,1.08A by SOM and1.12A by BP. After application of ANM, the RMSD decreases to0.88A (KNN),0.89A (SOM) and0.93A (BP). This implies that sampling spaces obtained are suit to further applications of protein structural researches. In fact, the methods proposed to dealing with missing value have extensive applications in various fields.
Keywords/Search Tags:homology modeling, missing value problem, k-nearest neighbor algorithm, self-organizing maps, back propagation network, expectation-maximization algorithm, normal mode analysis
PDF Full Text Request
Related items