Font Size: a A A

Research On The Feature Analysis And Prediction Of Bacterial Essential Genes

Posted on:2017-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:B J WangFull Text:PDF
GTID:2310330509953966Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Genes that are indispensable for survival are essential genes. Identifying essential genes is important for understanding the minimal requirements for cellular survival, and it is useful for biological and biomedical studies, such as origin of life, drug design studies and treatment of diseases. At present, essential genes are determined by experimental techniques. However, experimental methods are time consuming and expensive and different experimental methods may yield different results. Computational prediction methods offer a good alternative.Many computational methods, especially machine learning-based methods, have been proposed for predicting of essential genes in recent years. Many features that describe the gene essentiality, including high-throughput experimental features, and topology features, have been adopted for increasing prediction accuracy. But there are two problems. First, computational models were proposed and verified based on one or several organisms and were not applied to adequate organisms. Secondly, experimental features which cannot be directly derived from the sequence were used in previous models. In practice, only sequence-based features are commonly available for a newly sequenced genome. In our study, bacterial species recorded in DEG were used as analyzed objects. Features sequence-based were calculated and Lasso was employ for feature selection. Finally, support vector machine(SVM) and ensemble learning were used to predict essential genes. The main work was as follows:(1) The Hurst exponent, a characteristic parameter to describe long-range correlation in DNA, was used to analyze its distribution in 33 bacterial genomes. In most genomes(31 out of 33) the significance levels of the Hurst exponents of the essential genes were significantly higher than for the corresponding full-gene-set, whereas the significance levels of the Hurst exponents of the nonessential genes remained unchanged or increased only slightly. We therefore propose that the distribution feature of Hurst exponents of essential genes can be used as a classification index for essential gene prediction in bacteria.(2)The least absolute shrinkage and selection operator(Lasso) method was used to screen key sequence-based features related to gene essentiality. To assess the effects, the selected features were used to predict the essential genes from 31 bacterial species based on a support vector machine classifier. For all 31 bacterial objects(21 Gram-negative objects and 10 Gram-positive objects), the features in the three datasets were reduced from 57, 59, and 58, to 40, 37, and 38, respectively, without loss of prediction accuracy. Results showed that some features were redundant for gene essentiality, so could be eliminated from future analyses. The selected features contained more complex biological information for gene essentiality.(3)Support Vector Machine(SVM) was employed to predict essential genes. The difference between the number of essential and nonessential genes was so great that it was very hard for any machine learning algorithm to obtain a balanced result. So Weighted Support Vector Machine(WSVM) classifier was used. Positive and negative samples yield different weights. Classifier was evaluated by four methods, i.e., self-test, cross-validation, leave one species out and cross species.(4)Ensemble Learning was employed to predict essential genes. In order to increase prediction accuracy, two ensemble learning model were built to predict essential genes. First, Non-essential genes(negative samples) were divided into several sub dataset. And new train dataset was built by positive samples and negative subset. So several SVM classifiers were built and final outcome was obtained by combining the outputs of these diverse classifiers using an un-weighted average approach. Secondly, four classifiers, namely SVM, Naive Bayes classifier, Bagging and KNN were used. Each classifier scheme independently predicted essential genes. The final performance was obtained by combining the outputs of these diverse classifiers using an un-weighted average approach.
Keywords/Search Tags:essential genes, support vector support, feature selection, computational prediction, ensemble learning
PDF Full Text Request
Related items