Font Size: a A A

Research On Classification Of Gene Expression Data Based On Statistical Learning Algorithm

Posted on:2022-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:L Y GaoFull Text:PDF
GTID:2517306317480764Subject:Statistics
Abstract/Summary:PDF Full Text Request
Gene expression information is critical to diagnose the kind and degree of disease currently,which makes the statistical technique important greatly.But,the disease gene expression data is of high-dimensionality,small samples,and imbalanced classification,and traditional machine learning method is not enough to mine the correlation and causality because of the overfitting,the effect of sample size,etc.Here I employed the new statistical methods,such as LightGBM,SMOTEENN,RFECV,and autoencoders,to figure out the classification of Gene expression data,and try to find the suspicious gene with the potential disease.Based on the commonly used methods of screening differential genes in traditional bioinformatics,I combined the samples to deal with the imbalance of gene expression data and optimize hyperparameters.The dimensionality reduction for high-dimensional characters is employed in this process,feature recursive elimination and autoencoder further improves the predictive performance of the classifier and its generalization ability.In order to verify the efficiency of the algorithms,the gene expression data of breast cancer,lung adenocarcinoma and sarcoidosis is adopted for the empirical parts,and here are the main conclusions:(1)The multiple changing algorithm to the screening of differential genes is adopted for reducing the running cost.(2)In order to solve the imbalance problem of gene expression data,nine algorithms based on the screening of differential genes are compared and analyzed from the data level and algorithm level respectively.The case results showed that the AUC values of GSE87080,GSE63459 and GSE42826 gene expression datasets were0.9804,0.9495 and 1.0,respectively,and the classification effect of LightGBM algorithm based on SMOTEENN shows more efficient inference.(3)In order to optimize the classification performance of the classifier,three hyperparameter optimization methods,grid search,random search and Bayesian optimization,are compared,and AUC mean and standard deviation are obtained through 5000 iterations.It is found that LightGBM classification based on Bayesian superparameter optimization is more accurate and more stable.(4)Three dimensionality reduction classification algorithms are compared:RFECV-LightGBM?AE-LightGBM and RFECV-AE-LightGBM.The results show that the classification performance of RFECV-AE-LightGBM algorithm is the best.Through 1000 repeated samples,the mean AUC of RFECV-AE-LightGBM is 0.9594,0.9846 and 0.9203,respectively,and the standard deviations are 0.0416,0.0289 and0.0614,respectively.The performance stability of RFECV-AE-LightGBM classifier is verified.In a word,the comparison of statistical learning algorithms in this paper provides a reference for accurate decision making in medical diagnosis.
Keywords/Search Tags:statistical learning, genes expression data, classification, Bayesian hyperparameter optimization, Autoencoder
PDF Full Text Request
Related items