| Functional biological sequences contain lots of important information of life,and Deoxyribonucleic Acid(DNA)sequences synthesize proteins through transcription and translation,so as to control biological reactions and traits.Therefore,the identification and effective utilization of functional biological sequences are very important for the development of biological field.There are mainly two kinds of problems related to biological sequence identification,protein sequence identification and nucleic acid sequence identification(including DNA and Ribonucleic Nucleic Acid,RNA).The traditional wet experiment methods have some shortcomings,such as low accuracy and high price.Therefore,this thesis aims at two kinds of representative biological sequence problems,and based on Multiple kernel learning and Hilbert-Schmidt independence criterion(HSIC),we propose the multiple kernel support vector machine(MKSVM)model for identifying antifungal peptides,and k nearest neighbor multi-label classification model for predicting non-coding RNA’s subcellular localization.Firstly,we introduce some related research work,then we propose the MKSVM model.We extract feature vectors with five sequence-based feature descriptors,and construct feature matrices using Gaussian kernel function,then we combine these feature matrices by multiple kernel learning and HSIC to build multi-kernel support vector machine model.The experimental result shows that multiple kernel learning improves the predicting performance and accuracy,and our model out-performs other state-of-art models.Then we propose multiple kernel graph regularized k-local hyperplane distance nearest neighbor algorithm(MKGHkNN).We use five sequence-based features to encode RNA sequences,and in order to deal with subcellular localization prediction,which is a multi-label problem,we apply One-vs-Rest strategy to decompose the problem into multiple binary classification problems,and use MKGHkNN to obtain predicting results.Experimental result illustrates that our improved model is better than the original one,and with applying multiple kernel learning,our model has better performance than existing models.Our study still has the following limitations: First,we only use sequence-based information and physical-chemical properties to extract features,other features such as structure properties and feature extracted by deep learning should also be considered to improve prediction accuracy;Second,k nearest neighbor model is a lazy learning approach,which is less efficient when dealing with large datasets;In addition,we use One-vs-Rest strategy to deal with multi-label problem,which ignores the relationship between labels,we can try other methods like classifier chain to improve the model in the future. |