Font Size: a A A

Selection Of Tb Susceptible Genes Based On Improved Random Forest Algorithm

Posted on:2021-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2404330611462873Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
This article aims to improve the random forest algorithm,hoping to find fewer genes to help the analysis of the disease,but the number of human genes is huge,and it is not easy to quickly find the differential genes for a certain disease.Because random forest can calculate the importance of each gene for classification.This paper uses random forest to screen genes.Random forest has randomness in sample and feature selection when building decision trees.The importance of features calculated by random forest will be affected by noise,and even important feature genes may be overwhelmed by noise.In order to reduce the negative impact of noise on the screening results,the algorithm is improved based on the idea of backward elimination in random forest combined with multivariate statistics,and no standard parameters are standardized.Combining K-fold cross-validation and the establishment of random forests,each time a new training sample is generated,a random forest model is established.The error increment is introduced as a threshold,which is mainly used to judge whether to stop cross-validation to build a model.If the error increment exceeds the parameter,the operation is stopped,and then the model with the highest accuracy is selected to calculate the genetic importance.The importance values of genes are arranged in order from large to small,and then a certain proportion of genes at the end are eliminated.Repeat the above steps for the remaining gene data to build a random forest model,and repeat the screening until the required number of genes remains.In order to compare the advantages of the improved random forest algorithm,the random forest algorithm and the traditional screening algorithm are used to screen and analyze the gene data,and finally the support vector machine is used to discriminate the screened genes.After empirical research on TB gene data,the main conclusions of this article are as follows:1?Feature selection part: This paper uses traditional feature selection algorithms,random forest,and improved random forest algorithms to screen 8068 genetic features.By comparing and analyzing the three algorithms to screen genes,it is found that the screened genes and gene expressions are quite different,indicating that there are big differences between the three algorithms.2?Discriminant classification part: Support vector machines are used to discriminate the test samples by using the characteristic genes screened by traditional screening algorithm,random forest algorithm and improved random forest algorithm.The classification accuracy rate is calculated,and the results show that the classification accuracy of the first 13 genes screened by the improved random forest algorithm is 90%,which is significantly higher than the random forest algorithm and the traditional screening algorithm.3?By further improving the random forest algorithm,it made up for some of the defects of the previous improved algorithm,combined with the comparative analysis of the genes screened by the three algorithms,and the discrimination results of the support vector machine.,Has a great advantage in the process of gene selection.
Keywords/Search Tags:Random forest, Feature screening, Support Vector Machines, K-fold cross validation
PDF Full Text Request
Related items