| G protein coupled receptor is one of the most interesting drug targets.Determining the category of G protein coupled receptors is helpful to understand their structure and function,so as to carry out further experiments and applications.Therefore,the prediction of sequence categories of some G protein coupled receptors whose functions are not yet clear is a research hotspot in bioinformatics.It is difficult to accurately extract valuable features from original data due to high similarity and redundancy of G protein coupled receptor sequences.This article proposes a feature extraction algorithm based on amino acid evolutionary similarity clustering.This algorithm uses the conversion score in the amino acid substitution matrix to evaluate the evolutionary correlation between features,clustering candidate features into synonymous phrases.Then,each cluster feature is integrated and represented by a unique key function word.These reserved key function word are used to form a feature knowledge base.Finally,before the training and testing phase,according to the features in the feature knowledge base,the original G protein coupled receptor sequence is converted into a feature vector based on the improved bag-of-words model model.After analyzing the obtained feature vectors,this article uses feature selection and dimensionality reduction algorithms to further process them.In order to verify the effectiveness of the method in this paper,on the basis of the proposed feature extraction algorithm,combined with the nearest neighbor algorithm,random forest,support vector machine and multilayer perceptron,several classification models were constructed,and experiments were conducted on two public data sets containing 8354 and12731 G protein coupled receptor sequences,respectively.Compared with the existing technology,the method proposed in this paper can significantly improve the accuracy of G protein coupled receptor sequence classification.This work demonstrates the potential of a new feature extraction strategy and provides an effective theoretical design for the classification of G protein coupled receptors. |