Font Size: a A A

LASSO-based Methods With The False Discovery Rate Control And The Application In Survival Analysis Of High-dimensional Data

Posted on:2018-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:S H XuFull Text:PDF
GTID:2334330536474426Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective:Since the traditional LASSO tuning parameter selection methods have very high false discovery rate(FDR),this paper will introduce the fundamental theory of three methods which are available to select a suitable tuning parameter with the FDR control.To explore the performance of CV method(Cross Validation),pcvl method(penalized cross-validated log-likelihood),EBIC criterion(Extended Bayesian Information Criterion)and Stability Selection approach in variable selection aspect based on LASSO-Cox model.Methods:Introduce the tuning parameter selection methods based on LASSO-Cox model systematically.Examine the influence of the censoring proportion of survival data,the different linear correlations between covariates and the different sparse scenarios on the performance of each method respectively.During the simulation study,we considered six kinds of sample size n=(100,120,140,160,180,200);the number of covariates was 1000;the covariance structures of covariates was corr(xi,xj)=ρ|i-j|,i≠j;Simulation one:|ρ| =(0,0.3,0.5,0.8),L=(2,3,4,5),the nonzero regression coefficients were β1*=3,β51*=-1.5,β101*=2,β151*=-3,β201*=1.5,β251*=-2,other regression coefficients were zero.Simulation two:|ρ|=(0,0.3,0.5,0.8),L=3,the number of true nonzero covariates were q=(4,6,8,10),the true nonzero regression coefficients were 2 or-2 respectively.Simulate data were generated and analysed by R software.The FDR and Positive Select Rate(PSR)were evaluation index.We considered a data set from the Gene Expression Omnibus(GEO)to identify prognostic genes from 420 patients with DLBCL and 54675 genes in the real data analysis.After filering step,412 samples and 4947 genes were retained finially.Results:The simulation results show that when the sample size,the censoring proportion,the correlation coefficients and the sparse scenarios were fixed,the FDR of each method by ascending order was:Stability Selection≤EBICγ1<EBICγ2<pcvl<CV.The PSR was according descending order:CV≥pcvl≥Stability Selection≥EBICγ2≥EBICγ1.The FDR of each method remained essentially unchanged and the PSR was increased with the decrease of censoring proportion.When the correlation coefficients increased,the FDR of the Stability Selection,the pcvl and the CV remained essentially unchanged,while the FDR of the EBIC was slightly increased.When the sparse scenarios decreased,the FDR of the Stability Selection was essentially unchanged and the FDR of the pcvl was slightly increased.The performance of the EBIC was unstable.When the sample size was large,the PSR of the CV,the pcvl and the Stationary Selection remains unchanged with the decrease of the sparse scenarios.The real data analysis results show that only one gene was identified by the EBIC.The Stability Selection identified 13 genes,out of which 12 were also selected by the CV,furthermore,10 among the 12 genes were also identified by the pcvl.The pcvl method identified 28 genes,out of which 26 and 1,respectively,were in common with the CV and the EBIC.Conclusion:No matter the censoring proportion,the correlation coefficients and the sparse scenarios are fixed or changed,the Stability Selection’s ability to control the FDR is better and more stable than another methods,simultaneously,and its power is relatively high.The EBIC performs well when the correlation coefficients and the sparse scenarios are low,however,the EBIC performs conservative when the sample size is less.Although the pcvl method is not easy to miss important variables,but the FDR is still relatively high.
Keywords/Search Tags:LASSO, survival analysis, tuning parameter, false discovery rate, stability selection
PDF Full Text Request
Related items