| RNA interference(Ribonucleic Acid interference)is a kind of biological technology,which makes the expression of specific gene silence by introducing double stranded RNA into the cells.It is an important precondition to design siRNA with high inhibition for RNA interference technology.Because of the method depending entirely on the biological experiment to design efficient siRNA with the high cost,long cycle,low efficiency,the biological information technology to assist the optimization design of siRNA with high inhibition has become a reliable way to RNA interference technology.The ways of siRNA design with biological information technology are to learn the lab data sets with machine learning and form models for predicting the siRNA inhibition.Users input a target sequence of m RNA,the computer program outputs all candidates of siRNAs with high inhibition.Then biological researchers just do several times biological experimentation for validation.At present,there are some siRNA prediction software,but most of them are based on the sequence characteristics of siRNA,which leads to the low accuracy of prediction;some choose a comprehensive set of features,but no doing feature selection which is an important process acting as data preprocessing.It leads to the construction of the prediction model of the program very time-consuming and the model’s prediction is not high enough.In the real machine learning task,it is necessary to carry out the feature selection after the data acquisition,and it will also improve the efficiency of the program in the latter period of the training of the machine.Filter Feature Selection is to select the features of the data set firstly and then to train the learner.The process of this feature selection method has nothing to do with the following learner.The algorithm of Filter Feature Selection evaluates the features by weights of all the features of the data set with scores,and this process will not be completed by building a model.After the feature set given the corresponding weight score,the part of features with weight value less than the threshold will be removed,the other part whose weight value higher than the feature set threshold will be retained,and the higher part are then used to analyze the characteristics and classification,feature construction model.In this paper,According to the actual distribution of the set of experimental data,a concrete program of relief algorithm of feature selection is design to select the current 107 features of siRNA.The result of the experiment selected 88 relevant features and removed 19 irrelevant ones.We use these 88 relevant features train a prediction model with Random Forest,the correlation index increased from 0.629 to 0.640 with the way of 10 fold cross validation,and the efficiency of construction of random forest model is improved,meanwhile it reduces the time complexity of siRNA prediction software.We also found that it has obvious positive correlation between the siRNA inhibition and the energy difference between the 5’ end of double strand siRNA,namely the energy difference higher,the siRNA inhibition higher;the energy difference lower,the siRNA inhibition lower.In the next step,we do some statistical analysis to the data set experimented by Dieter Huesken.Results are as follows:(1)the first position of siRNAs’ guide strand from 5 ’end should be A or U,no G neither C;(2)the second one should be A or U,no G neither C;(3)the seventh one should be no C;(4)the fourteenth should be no G;... |