Font Size: a A A

Based On GO Se Mantic Similarity Protein Subcellular Location Prediction Research

Posted on:2016-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZhangFull Text:PDF
GTID:2370330473965670Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The completion of the human genome sequencing allows the exponential growth of protein sequence information.Massive influx of biological protein sequence and fast development lead to the growing gap between known sequence proteins and known function proteins.Such imbalances will severely restrict the proteome research and new drug development.Protein function is very close to its subcellular localization.Protein must go to the correct location to play its normal function,or else it will lead to bad results for organisms.As protein subcellular localization information is helpful to predict protein function,the study of protein subcellular localization is becoming increasingly important in proteomics.In the conventional biological view,the relationship among the genes,proteins and subcellular is one versus one,namely a gene corresponds to a protein and a protein corresponds to a subcellular location.Most of the protein subcellular localization prediction methods are based on this traditional view.Although the protein subcellular localization prediction has achieved a certain effect,research on single-site proteins can't meet the demand.Because multi-site proteins need to be predicted too.And multi-site proteins may have more important significance,for example,multi-site proteins are more prone to be located abnormally,which is more likely to be the reason for a disease.At present,people have begun to study multi-site protein subcellular localization prediction.But it is still in the initial stage and not comprehensive enough.Research has shown that more comprehensive and more representative features will be more conducive to improve the accuracy of protein subcellular localization.A large number of researchers have improved the accuracy of protein subcellular localization by the methods.For example,considering meanwhile GO information and annotation information obtained a good experimental result.In theory,the more comprehensive biological information.the more helpful to improve the experimental results..Thus,how to choose more representative features is worth to explore and study.It is the research work of this paper.If a group of genes have the same biological function and the group of genes belong to the same regulatory mechanism,the GO terms are similar.The existing literature simply consider that whether GO terms appear,without considering the similarity between the GO terms.We call this feature vector as the traditional GOfeature vector.In the traditional GO feature vector,1 and 0 represent the GO tem appear and disappear.However,in GO feature vector based on GO semantic similarity,the value 0 in the former vector is replaced by the new calculated value and remain 1 value unchanged.Thus,the new GO feature vector is a supplement to the traditional GO feature vector and is a more comprehensive feature.The main steps of protein subcellular localization prediction are the feature extraction and classification algorithm.This article focuses on the feature extraction and classification algorithm in protein subcellular localization.The main work is as follows.In this paper,we propose a multi-label subcellular location predictor,namely GSS-mPloc,that considers not only GO terms but also the inter-term relationships.Given a protein,by searching the Gene Ontology database we can obtain the set of GO terms.If the protein is assigned by a GO,the property value the of corresponding GO is 1,other wise is 0.According this,we can get the GO feature vector(6749 dimensions)of the protein,in which the value of each dimension is 0 or 1.Then we use the GO semantic similarity between terms to improve the original GO feature vector.The improvement is as follows,after averaging the semantic similarity between a disappearing GO and all appearing GO.the averaged value is treated as the new value for the disappearing GO.According this,the new feature vector(6749 dimensions)is obtained,in which each dimension value is between 0 and 1.Besides,multi-label multi-class support vector machine classification algorithm(ML-SVM)is used to classify the new feature vector.On the standard human data sets,the absolute accuracy of protein subcellular location prediction is 71.8%which is 3.6 percent higher than state-of-the-art predictors.Experiments show that GO semantic similarity characteristics are superior to traditional GO features and classification algorithm based on SVM is more superior than KNN.
Keywords/Search Tags:Gene Ontology, semantic similarity, SVM, subcellular location
PDF Full Text Request
Related items