Selection Of Tb Susceptible Genes Based On Improved Random Forest Algorithm

Posted on:2021-04-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y Yang

Full Text:PDF

GTID:2404330611462873

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

This article aims to improve the random forest algorithm,hoping to find fewer genes to help the analysis of the disease,but the number of human genes is huge,and it is not easy to quickly find the differential genes for a certain disease.Because random forest can calculate the importance of each gene for classification.This paper uses random forest to screen genes.Random forest has randomness in sample and feature selection when building decision trees.The importance of features calculated by random forest will be affected by noise,and even important feature genes may be overwhelmed by noise.In order to reduce the negative impact of noise on the screening results,the algorithm is improved based on the idea of backward elimination in random forest combined with multivariate statistics,and no standard parameters are standardized.Combining K-fold cross-validation and the establishment of random forests,each time a new training sample is generated,a random forest model is established.The error increment is introduced as a threshold,which is mainly used to judge whether to stop cross-validation to build a model.If the error increment exceeds the parameter,the operation is stopped,and then the model with the highest accuracy is selected to calculate the genetic importance.The importance values of genes are arranged in order from large to small,and then a certain proportion of genes at the end are eliminated.Repeat the above steps for the remaining gene data to build a random forest model,and repeat the screening until the required number of genes remains.In order to compare the advantages of the improved random forest algorithm,the random forest algorithm and the traditional screening algorithm are used to screen and analyze the gene data,and finally the support vector machine is used to discriminate the screened genes.After empirical research on TB gene data,the main conclusions of this article are as follows:1、Feature selection part: This paper uses traditional feature selection algorithms,random forest,and improved random forest algorithms to screen 8068 genetic features.By comparing and analyzing the three algorithms to screen genes,it is found that the screened genes and gene expressions are quite different,indicating that there are big differences between the three algorithms.2、Discriminant classification part: Support vector machines are used to discriminate the test samples by using the characteristic genes screened by traditional screening algorithm,random forest algorithm and improved random forest algorithm.The classification accuracy rate is calculated,and the results show that the classification accuracy of the first 13 genes screened by the improved random forest algorithm is 90%,which is significantly higher than the random forest algorithm and the traditional screening algorithm.3、By further improving the random forest algorithm,it made up for some of the defects of the previous improved algorithm,combined with the comparative analysis of the genes screened by the three algorithms,and the discrimination results of the support vector machine.,Has a great advantage in the process of gene selection.

Keywords/Search Tags:

Random forest, Feature screening, Support Vector Machines, K-fold cross validation

PDF Full Text Request

Related items

1	The Application Of Random Forest And Support Vector Machine In High Dimensional Transcriptome Data Of Breast Cancer
2	Research On Risk Prediction Of Diabetes Based On Random Forest And Support Vector
3	Analysis Of Cancer Gene Data Base On Random Forest And Support Vector Machine
4	Classification Of Common Carotid Arterial Plaques Based On Ultrasonic Characteristic Parameters Of The Gamma Hybrid Model
5	Study On The Prediction Of Intracranial Hypertension Based On Waveform Feature Extraction And Support Vector Machines Classification
6	Research Of MRI Image Segmentation Based On Support Vector Machines
7	3D Reconstruction Of Head MRI Based On One-Class Support Vector Machine With Immune Algorithm
8	Research On ECG Signal Processing Method Based On Machine Learning
9	Preliminary Study Of Enhanced CT Radiomics Models For The Differential Diagnosis Of Mucinous Ovarian Cancer
10	Research And Application Of Support Vector Machine Technology In Computer-aided Medical Diagnosing System For Breast Cancer