| Mycobacterium tuberculosis,short for tubercle bacillus,is a sort of slightly curved bacilliform aerobic bacteria.It is strongly insensitive to external environment under the protect of cell wall lipids and bacterial capsule,and it has been proved to be the causative pathogen of the contagious disease tuberculosis(TB).China is one of the most heavy-burdened country all over the world that bearing the TB disease,which is killing as many as 1 million lives every year.As a kind of chronic respiratory infectious disease,TB has an unconspicuous early symptom and long treatment cycle,which makes it an easy case to be prevalent among the population and get out of hand.Over the ten decades,despite myriad medical experts have devoted to studying the molecular structure,toxicity and pathology of MTB,there has not been a drug that can prevant or cure it absolutely for its complex membrane structure and frequent gene mutations.Recent studies suggest that secretory protein antigens can be used to detect antibodies in infected specimens,so distinguishing secretory proteins from non-secretory proteins is a matter of grave concern for tracing the real pathogenic factors and developing vaccines or drugs against TB.In this work,we developed an algorithm to recognize the secretory proteins of MTB and provided the online service.Firstly,we constructed the standard data sets of MTB proteins,which are collected from the experimentally confirmed records of UniProt.After removing the redundant sequences to the utmost extent by the CD-HIT online service,a positive dataset containing 35 samples and a negative dataset containing 266 samples were finally obtained.Then,we extracted the g-gapped dipeptide compositions and physical-chemical property features to encode each protein sequence into its unique feature vector.Eventually,we built and trained the model by the popular SVM algorithm,to improve its prediction power further,we performed the feature selection procedure on the basis of the optimal model parameters.As a result,each peptide sequence was translated into a 374-dimension feature vector,including 9-gapped dipeptide compositions and hydrophilic/hydrophobic properties.Validated by jackknife-test,the algorithm we proposed got an averaged accuracy of 87.18 percents,and the area under the operating curve was as large as 0.93.To illustrate the superiority of the model based on SVM,we reconstruct the model on the same standard dataset using Random Forest and Bayes Network as well as RBF Network which are all embedded in Weka software.It is demonstrated by jackknife again that,the model based on SVM is better than the other three on the issue of predicting the secrtory proteins of MTB.For the convenience of researchers in relevant fields to communicate research progress and share scientific achievements,the interface-friendly online service MycoSec(http://lin.uestc.edu.cn/server/MycoSec/)is opened and free for non-commercial use. |