Font Size: a A A

The Impact Of Data Missingness And Related Factors On Stepwise Variable Selection

Posted on:2012-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:H M LiaoFull Text:PDF
GTID:2154330335497952Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objectives:To study the impact of different missing mechanisms and patterns on stepwise variable selection. Also, to investigate the effects of other factors on stepwise variable selection, including the correlation between candidate variables, the number of candidate variables, the goodness of fit of the true models, sample size (or EPV) and the significance levels for variable entry and removal in stepwise selection.Methods:Using SAS software to perform Monte Carlo simulations. True models, both linear models and probit models, were constructed to generate six different kinds of data structures using four combinations of missing mechanisms (complete, missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR)) and two combinations of missing patterns (linear, convex), namely: complete data, MCAR data, MAR(linear) data, MAR(convex) data, MNAR(linear) data, and MNAR(convex) data. Stepwise variable selection was applied to these data sets, and the results were evaluated using five evaluation criteria:1) average number of authentic variables selected into models; 2) average number of noise variables selected into models; 3) G index, a criterion to reflect the overall performance of stepwise on identifying authentic variables and discarding noise variables, defined as G=sensitivity*specificity,where sensitivity=number of authentic variables entered/number of authentic variables in the pool of candidate variables; specificity=1-number of noise variables entered/number of noise variables in the pool of candidate variables);4) percentage of true models obtained; 5) bias of regression coefficients estimation. The first four criteria were designed to reflect the ability of stepwise in distinguishing authentic variables from noise variables, while the fifth one was aimed at evaluating the accuracy of the estimated coefficients. For the probit model, the same criteria were used to evaluate the results. We simulated 1000 replications under each combination of factors.Results:Missingness mainly affected the entry of authentic variables in stepwise selection and the estimation of regression coefficients, but there was little impact on the entry of noise variables. The proportion of missing data had a larger effect than missing mechanisms and patterns. The higher the proportion of missingness, the fewer the authentic variables that entered the models and the larger the bias of the regression coefficients. When the proportion of missing data was low, different missing mechanisms and patterns did not show much effect, and most of the effects caused by missing data were related to the decrease in sample size; but when the proportion of missing data increased, the disparity between different missing mechanisms and patterns started to show. As for the entry of authentic variables:1) MCAR did not show much superiority over other missing mechanisms; 2) under the same missing patterns, MAR seemed to perform a little better than MNAR; 3) under the same missing mechanisms, linear missing pattern performs better than convex missing pattern; 4) MNAR(convex) had the worst performance. As for accuracy of regression coefficients estimation, we did not observe consistent tendency between different missing mechanisms and patterns. We also found compared with other factors, missingness was not among the most influential factors in stepwise selection: 1) for the entry of authentic variables, the most influential factors were the goodness of fit of the true models, the correlation between candidate variables, the significance levels of stepwise selection and sample size; 2) for the entry of noise variables, the most influential factors were the number of candidate variables and the significance levels of stepwise selection; 3) for the accuracy of regression coefficients estimation and the percentage of true models selected, the goodness of fit of the true models and the correlation between candidate variables were the most influential factors.Conclusions:1)When performing stepwise variable selection on data sets with missing entries:the impact of missingness on stepwise selection was reflected by its effect on the entry of authentic variables and the estimation of regression coefficients; its effects on entry of noise variables were not evident. Moreover, if the proportion of missing data was small (e.g. lower than 25%), different missing mechanisms and patterns do not exert much difference. In this case, we may estimate most of the effects of missing data by the loss in sample size; however, if the proportion of missing data was large, we should not only pay attention to the missing mechanisms but also to the pattern of missing since these have very different effects on stepwise selection.2) when performing stepwise selection on data sets, either with or without missing entries, we should pay attention to factors like the correlation between candidate variables, the number of candidate variables, sample size and significance levels of stepwise selection, since these are inherent shortcomings of the stepwise selection method as suggested by other literatures:it tended to omit authentic variables, select noise variables, and have biased estimates of regression coefficients, especially when correlations between candidate variables were large, significance levels were set arbitrarily without consideration of the variable selection objectives, and/or the number of candidate variable was large. Under such circumstances, stepwise selection is not recommended.
Keywords/Search Tags:missing data, stepwise, variable selection
PDF Full Text Request
Related items