| Feature processing is an important part of identifying protein sequences based on machine learning methods.Feature processing provides a strong reference and support for traditional experiments by generating optimal feature sets,improving recognition accuracy and discovering important feature segments.This thesis studies and proposes three protein sequence feature processing algorithms from the perspectives of multi-task learning,combinatorial optimization and directed graph decomposition:(1)Multi-task protein feature selection algorithm based on data set structure information: Aiming at the optimal feature selection problem in the construction of protein sequence recognition model,a multi-task feature selection algorithm is proposed.In the feature selection process,the algorithm constructs multiple SVM models with different objective functions according to the data set structure information,then trains and optimizes the models through parameter sharing to determine the optimal feature set.The algorithm obtained the recognition effect of cell lyase with Accuracy,Sensitivity,Specificity,Matthews correlation coefficient and AUC values of 0.93,0.853,0.948,0.775 and0.9,respectively,under the leave-one-out cross validation.(2)Protein feature subset search algorithm based on elimination strategy: In order to prevent the problem of feature combination explosion,two subset search algorithms based on elimination strategy are proposed,namely the subset search algorithm based on direct elimination and the one based on cache elimination.Subset search algorithms,they all use the elimination strategy to find a new feature combination method to avoid the artificial factors in the current mainstream feature selection methods and the drawbacks of relying too much on the feature sorting results.The algorithm obtains a high model evaluation index on the 21 feature ranking data with low dimension optimal feature set.(3)Protein feature ranking algorithm based on ranking integration strategy: Based on the quantification of the global and local ranking factors of the basic ranking,a feature ranking integration algorithm based on weight quantification is proposed.Specifically,according to the central limit theorem,based on the distribution of feature score data in the basic ordering to be integrated,its normality is quantified to generate weights for the basic ordering.Then generate a weighted directed graph and use Hodge Rank to obtain the final global ranking.Through 56 experiments,it is proved that the performance of the algorithm in 2/3 experiments is better than that of similar comparison algorithms.The three proposed feature processing algorithms all aim at generating optimal feature sets and training high recognition rate models.Feature selection is based on feature ranking,and feature subset search is an important part of feature selection.They can address the problem of feature processing in protein sequence recognition,either individually or in concert. |