| With the rapid development of sequencing technology,a large number of protein sequences have emerged,and traditional biological experiments have been unable to meet the demand for accurate prediction.Protein subcellular localization based on bioinformatics has become one of the most important methods for location prediction and also an important content of proteomics.The prediction of protein subcellular localization is not only important for the study of protein structure and function,but also can promote the design and development of new drugs.In this paper,feature extraction methods and classification algorithms for protein subcellular localization are studied in depth.The main work is as follows:1.Based on protein evolutionary information and data segmented distribution,a new method for predicting single-location protein subcellular is explored.Focusing on the evolutionary information of protein homologous sequence,a novel feature extraction method PSSM-GSD is proposed on the basis of protein position specific score matrix.This feature reflects the segmented distribution of amino acid’s evolutionary information along the protein sequence to add more local information.After fusion of PSSM-GSD with AAO and AAPSSM method,it is put into support vector machine for subcellular localization.In view of the imbalance of dataset,SMOTE algorithm is used to generate the minority protein samples.Finally,the experiment is performed on the Gram-positive protein Gpos-m PLoc and Gram-negative protein Gneg-m PLoc datasets,and the overall accuracy is 82.0% and 79.5%respectively.2.Based on feature selection and dynamic classifier chain algorithm,a novel method for predicting multiple-location protein subcellular is further explored.There is a lot of evidence that some proteins exist at two or more subcellular sites,and the localization of these proteins is particularly important.In this paper,MULoc EL is constructed,which is a noval ensemble classifier for subcellular localization of multi-label proteins.AAOD,SDPP and CSPPC feature extraction methods are proposed based on the centralization trend and dispersion degree of data.The seven feature extraction methods are integrated to extract protein sequence information,evolutionary information and amino acid physicochemical information.PAGERANK algorithm is used to integrate multiple feature selection methods.The forward adding strategy is used to screen the optimal sub-features of 106 dimensions from the 702 dimensions.Based on the traditional classifier chain algorithm,the order of labels is dynamically adjusted according to the CEF index constructed by conditional entropy and F1 value,and the dynamic classifier chain algorithm DCC is proposed.On this basis,the final classification result is obtained by Bagging.MULoc EL achieves an overall accuracy of 84.0%on the Gram-negative protein Gneg-m PLoc dataset and can effectively predict multi-label protein subcellular location. |