Font Size: a A A

Protein Subcellular Localization Prediction From Multi-label Learning

Posted on:2022-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:S P JinFull Text:PDF
GTID:2480306770991019Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Protein is an organic compound obtained by folding the peptide chain space.As an important part of human cells and tissues,it provides the necessary nutrients for maintaining normal human life activities.It can only play a role in the correct subcellular localization.Therefore,accurate localization of protein subcellular location is an important prerequisite for protein function.With the deepening of research,the multi-label protein sequences in the database have been continuously explored.Traditional prediction methods are vulnerable to external environmental interference and low efficiency,which cannot meet the needs of multi-label research.At present,many researchers apply machine learning methods to multi-label protein subcellular prediction,this method can better build model to predict protein subcellular location.In this paper,the machine learning method is chosen to predict the multi-label protein subcellular location.The research contents are as follows:1.A multi-label protein subcellular localization method based on distance metric learning is proposed,called ML-loc COMMU.Firstly,the feature extraction methods of pseudo amino acid composition(Pse AAC),compositional distribution coding(CTD),gene ontology(GO),redundant probing transformation(RPT)and evolutionary distance transformation(EDT)are used to transform the relevant protein sequences into the values available for computer calculation and perform feature fusion.The highdimensional space formed by multi-information fusion increases the interference of redundant information on the prediction results,we utilize the multi-label information latent semantic index(MLSI)method to avoid the interference of redundant information.Finally,the distance metric learning based on multi-label K nearest neighbor(MLKNN-Commu)is adopted to predict the effectiveness of the model.The overall actual accuracy(OAA)and overall location accuracy(OLA)of MLloc COMMU on the training set are 97.41 % and 97.69 %.The OAA on the test sets are93.94 %,82.69 % and 80.00 %,and the OLA are 93.00 %,86.01 % and 79.19 %.Studies have shown that ML-loc COMMU model can play a good role in predicting the subcellular location of multi-label protein.2.A multi-label learning method based on ML-loc MLFE is proposed.First of all,six feature extraction methods are adopted to obtain protein effective information.These methods include pseudo amino acid composition(Pse AAC),encoding based on grouped weight(EBGW),gene ontology(GO),multi-scale continuous and discontinuous(MCD),residue probing transformation(RPT)and evolutionary distance transformation(EDT).In the next part,we utilize the multi-label information latent semantic index(MLSI)method to reduce the dimension of the feature space and avoid the interference of redundant information.In the end,multi-label learning with feature induced labeling information enrichment(MLFE)is adopted to predict the multi-label protein SCL.The Gram-positive bacteria dataset is chosen as a training set,while the Gram-negative bacteria dataset,virus dataset,new Plant dataset and SARS-Co V-2dataset as the test sets.The overall actual accuracy(OAA)of the first four datasets are99.23%,93.82%,93.24%,and 96.72% by the leave-one-out cross vali-dation(LOOCV).It is worth mentioning that the OAA prediction result of our predictor on the SARS-Co V-2 dataset is 72.73%.The results indicate that the ML-loc MLFE method has obvious advantages in predicting the SCL of multi-label protein,which provides new ideas for further research on the SCL of multi-label protein.
Keywords/Search Tags:SARS-CoV-2, Multi-label protein subcellular localization, MLSI dimension reduction, MDDMf dimension reduction, MLFE classifier, MLKNN-Commu classifier
PDF Full Text Request
Related items