Font Size: a A A

Prediction Of Disease-Resistant Gene In Rice Based On SVM-RFE

Posted on:2011-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:J B ZhouFull Text:PDF
GTID:2143360305454748Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Rice, wheat, corn and potatoes and other food crops which are mankind's main source of food, have been subject to a variety of pest threats, resulting in more than 10% reduction in crop yields and serious physical and economic losses. In addition, food shortage caused by pets has a great impact on social stability.China is the world's largest rice producing country, and plagued by a serious yield reduction of rice due rice diseases. Affected by climate, cropping pattern, cultivation patterns and other factors, there is a serious outbreak of chang-onset disease in recent years, and the damage of a number of minor diseases and pick-up in disease has been increasing sharply, not to mention the rise of new diseases emerging pathogen variants occur over an area of 300,000 thousand hectares. Traditional breeding disease-resistant plants approach has been difficult to quickly and effectively solve the disease problem. Gene technology with the advantages of its speed and stability has become an efficient way to improve crop characteristics and get access to superior varieties. Machine learning methods through the analysis of gene expression data, are important ways for the purpose of location-related genes for genetic technology.As the microarray experiment cost very much, there are a few of genetic sample. However the number of genes detected is very large, often as high as several thousand or even tens of thousands. Within the data, there are a lot of disease-resistant genes. Gene expression data are the typical high-dimensional and small sample size problem. Support Vector Machine (SVM) is a widely used method of machine learning in recent years and has a good performance for high-dimensional gene expression data analysis. Compared with other traditional machine learning methods, SVM bases on the structural risk minimization principle (SRM). SRM fixes the experience risk and minimizes the scope of confidence. Thus SVM has a very good classification and generalization ability under limited samples, especially in small sample cases. In addition, SVM kernel functions solve non-linear case by mapping the problem into a high-dimension space and apply the linear classification.Another problem is that these research data often contain noise information in the microarray generation process, the genetic sample preparation process or because of impurity in the sample. If the training sample size is too small compared with the number of features, it may reduce the classifier performance. Therefore, for high-dimensional data, it is necessary to reduce the feature space dimension. We use the recursive feature elimination (RFE) for feature selection to avoid the original biological information loss by using feature extraction methods.The data in this paper came from the GEO public database. We choose the infection of rice tungro disease and resistance to disease categories. There are 21 samples in all. Samples are marked as +1 and -1 according to disease-resistant or susceptible to disease respectively. As the experimental data is a high-dimensional matrix, we use Matlab for calculation and LIBSVM to search for optimal parameters of SVM.Firstly, we conducted a pre-processing of the data, remove invalid and noise characteristics of the data and normalized to the interval [-1, 1], the actual operation was found, normalized for the SVM training and prediction accuracy rate of greater the impact of normalized classifier performance has greatly improved.Secondly, 11 samples were randomly selected as training set and the remaining 10 were selected as testing set. SVM-RFE, starting with all the features, removes the feature which is the least significant for classification recursively. The ranking score is given by the components of the weight vector of the SVM. We would get a feature list if all the features are excluded. wFinally, we get the feature list from which we select some nested feature subset , train and test with SVM. The smallest subset only contains the most important feature. The subset which achieves the highest prediction accuracy with the smallest capacity is the optimal feature subset. Q1 ? Q2 ? L QnQiFor each optimal feature subset, its capacity is less than 30. It achieves the highest accuracy when the capacity is about 10, and declines sharply when the capacity is more than 450. At the same time, all the optimal feature subset doesn't contain the same features and the order of each feature changes too.In order to overcome the impact of the small sample size and random selection, we improve the SVM-RFE. On the one hand, we improve the evaluation function by introducing the mutual information, so that it can consider the relationship between genes. On the other hand, we keep the top 30 features of each optimal feature subset and assess each one according to its frequency and order in the optimal subset. Then we generate the new rank list from the highest 50 features. By using the improved method, we get the best feature subset and its capacity is 8. The 8 features have a greater contribution to prediction and the average classification accuracy rate is up to 98.74%.In order to evaluate whether the optimal feature subset has disease-resistant genes, we search these genes in the GO(Gene Ontology) and other databases. As a result, we locate four genes related to disease-resistance of rice, especially the first two genes. The four genes locate in the position of 1, 2, 8 and10. The first two genes have a lot to do with the rice disease resistance as shown in the literature clearly. Though the other two do not significantly associated with the disease, but they play a role in the process of plant stress response. As for the other four genes, no study indicated their function in organisms. As we can see that using the SVM-RFE method has some effect on the prediction of rice disease resistance genes.At present, there are only a few applications dealing with gene expression data of crops using feature selection. We expect that the result of this article using the SVM-RFE method could guide the corresponding experiments. It will better if it provides some useful reference information in reducing the experimental period and the cost.
Keywords/Search Tags:disease-resistant gene, support vector machine, feature selection, recursive feature elimination
PDF Full Text Request
Related items