Font Size: a A A

Comparative Research On Non-linear Kernel Entropy Component Analysis And Kernel Principal Component Analysis In Protein Subcellular Localization

Posted on:2017-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:D S XuFull Text:PDF
GTID:2180330488464495Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development and application of high-throughout biotechnology in recent years, the increasing protein sequences have been explored and still annotated by biological experiments. Accordingly, it may probably accelerate the annotation processing massive biological data by computers, especially the subcellular location closely correlated to function. This paper performs a prevalent method, pattern recognition, to predict subcelluar locations of human proteins.To represent protein sequence, this paper extracts features by a comprehensive representation named pseudo amino acid position-specific scoring matrix (PseAAPSSM), but leading to a high-dimensional data full of redundancy and noises. So this paper innovatively introduces a previously not-mentioned dimension-reduction algorithm called kernel entropy component analysis (KECA) compared with traditional kernel principal component analysis (KPCA). Kernel entropy component analysis weights the contribution of each component by entropy which calculated matrix eigenvalues and eigenvectors by a given formula while kernel principal component analysis only takes the eigenvalues into consideration resulting in the neglect of the eigenvectors’effect in projection. After dimension reduction, this paper adopts traditional classifier k-nearest neighbors (KNN) and multi-label classifier One-vs-rest k-nearest neighbors (OVR-KNN) to predict subcellular locations respectively. Via Jackknife validation, the classification algorithm based on KECA with Gaussian kernel outperforms that based on KPCA in most subcellular locations by KNN classifier, however, in some subcellular locations,especially in the centriole, the prediction precision of that based on KECA is lower even zero. But via OVR-KNN, in a large scale of variation in parameters, the classification algorithm based on KECA with Gaussian kernel is superior to that based on KPCA.In order to study the effectiveness of kernels in KECA, this paper takes an advanced research on the selection of kernels. According to the different functions of kernels, Gaussian kernel mainly retains the local information and polynomial kernel mainly retains the holistic information so this paper fuses the two kinds of kernels to test the prediction precision in multi-label set by comparison and draws the conclusion that combined kernel function is superior to the single-kernel function in the perspective of maintaining data features.
Keywords/Search Tags:Protein subcellular localization, Kernel entropy component analysis, Kernel principle component analysis, Combined kernel function, Multi-label classification
PDF Full Text Request
Related items