BACKGROUNDPresent there is no practical theoretical method to estimate the sample size for logistic regression, while the empirical method is widely accepted. The empirical method, so called the number of events occurred of outcome variable per variable (EPV) method, is that the number of event occurred of outcome variable (minimum of the number of events occurred and the number of events non-occurred) should be not less than the number of independent variables included in the model multiplied by EPV. There had been many studies on the EPV method, such as Harrell (1984). Concato (1995). Peduzzi (1995). Vittinghoff et al (2006). from whose simulation studies there came to a rule of thumb that the general validity requires no less than 5. 10. or sometimes even 20 EPV to ensure a robust regression results by Wald method based on maximum likelihood estimate. Xiaoyan Yang(2005) recommended that the required number of EPV was not less than 10.However, research work scarcely concerned with the number of events needed of independent variables (EIV). namely the smaller number of the sum number of occurrence events or non-occurrence events in binary independent variables included in the model. And precisely this was the commonly encountered problem in actual data. If the EIV was too small, the result of logistic regression would be not accurate and stable. Thus, it was not enough to just rely on the EPV method to determine the sample size, but also need to combine with EIV. To this end, this study would explore the impact of EIV on the logistic regression model through simulation, and further get the strategy to determine the cut-off value of EIV, in result of providing a more complete empirical method for sample size determination.OBJECTIVEPresent study, using the Monte Carlo simulation technique, explored the stability of regression results from the perspective of EIV, and established a method to determine the cut-off value of EIV.METHODSMaximum likelihood estimate (MLE) method is the most commonly used parameter estimate method. And there are still penalize likelihood estimate (PLE), exact logistic regression and rare event logistic regression. PLE was raised for solving the problem that the log maximum likelihood converged but at least one parameter estimation was positive or negative infinity, mainly happened in unbalanced number of occurrence events and non-occurrence events of independent variables and high-risk factors. It had been suggested that PLE was reliable for parameter estimate for correcting the bias of MLE, and it also had a better performance than exact regression and MLE. But these were mainly described in the statistical literature, rarely applied to empirical data. The rationale of rare event logistic regression is to correct the occurrence probability of an outcome event to ensure the result robustness. And from the simulation results of Xiaoyan Yang, this method only improved the model a little. As for the confidence interval estimate and hypothesis testing, Wald method is well known. While the profile likelihood method for confidence interval estimation and hypothesis testing is more robust and strictly controlled the type I error than Wald method and Bootstrap. And it is also more powerful than Wald method. Thus, in present study, MLE method and PLE method were selected to estimate the parameter of regression coefficients, Wald method and profile likelihood method were choose for confidence interval estimate and hypothesis testing.Monte Carlo technique was used in this study, all simulations and calculations completed through R3.1.2.First, the binary outcome variable and independent variables in logistic model were generated, considering six type simulation arguments:the number of independent variable as 1,4,8, the absolute value of regression coefficient as 0,1,2, sample size as 50,70,80,90,100,200,300,400,500, EIV as 1,2,3,4,5,7,10,12, 14,16,18,20,25,35,45,50,60,70,80,90,100,150,200,250, correlation among variables as 0,0.5,0.8, and event rate as 5%,10%,15%,30%,50%. The total simulation setting was not the full combination of above six type arguments, where, the number of EIV was up to half of sample size, the minimum EIV under MLE was 5, there did not exist correlation in 1 independent variable model, event rates were only set in 8 independent variables model. The outcome variable was sampling by the binomial probability which was computed by the artificially set regression coefficient β and the simulated independent variables. All was computed with 10000 replications.Then, the parameter estimates were obtained by MLE and PLE methods and the confidence interval estimates and hypothesis testing were calculated by Wald method and profile likelihood method.Last, the estimated converged parameters were evaluated by type I error, mean square error (MSE), accuracy, precision and confidence interval coverage probability (CI coverage), comparing with the artificially set regression coefficient β to explore the impact of EIV on model results. When the evaluated indices reached the desired value or achieved relatively stable, the number of events would be picked as the cut-off value of EIV.RESULTSEIV had a regular impact on logistic regression results directly, but the event rates of independent variables made effect indirectly by binding sample size. Table 1, displayed the specific cut-off value of EIV under different methods and five assessment indices.Two methods, the Wald method based on MLE and profile likelihood method based on PLE, could finally achieve and control the type I error, but the latter was superior to the former. The Wald method based on MLE required 20 EIV or more for type I error stably maintained from 4% to 6%; the profile likelihood method based on PLE needed 12 EIV or more for type I error stably kept in the vicinity of 5%. While the profile likelihood method based on MLE only wanted 12 EIV or more with sample size larger than 200 maintaining the type I error stably in the vicinity of 5%; the Wald method based on PLE needed 45 or more EIV with sample size larger than 200.The second step was to quantify the strength of risk factors, namely the accuracy of parameter estimate. When using the MLE method, EIV needed to reach 18,12,16 respectively, in order to obtain stable mean square error, accuracy and precision. As using the PLE method, EIV needed to reach 12,12,7 respectively.Last, in terms of confidence interval coverage probability, the confidence interval estimate of logistic regression would more reliable as the coverage probability was well controlled around 95%. The same to the result of type I error, the Wald method based on MLE and profile likelihood method based on PLE, can achieve and control the coverage probability as expected well controlled around 95%. Also the latter was superior to the former. The Wald method based on MLE required 30 EIV or more for coverage probability stably maintained between 94% and 96%; the profile likelihood method based on PLE needed 14 EIV or more for coverage probability stably kept in the vicinity of 95%. While, the other two cases were influenced greatly by other factors. The profile likelihood method based on MLE could not meet the expectation mostly. And EIV must reach 45 with sample size larger than 200 by the Wald method based on PLE. In addition, the cut-off value of EIV would be somewhat changed by the number of independent variables, absolute value of regression coefficient, sample size and correlation among independent variables, whose intensity and direction were slightly different.CONCLUSIONIn the practical application of logistic regression models, EPV and EIV should be combined together to determine the sample size. In terms of EIV, it should not be less than 12. when EIV ranged from 12 to 20, the profile likelihood method based on PLE was recommended for better controlling the type I error and obtaining the accurate parameter estimate; when EIV is larger than 20, the profile likelihood method based on PLE and the Wald method based on MLE were both suitable. Further, when EIV ranged from 14 to 30, the profile likelihood method based on PLE was recommended for better controlling the confidence interval coverage probability; when EIV was larger than 30, the profile likelihood method based on PLE and the Wald method based on MLE were both applicative. The above two methods could be used in the recommended EIV case, and gave priority to the profile likelihood method based on PLE. In the case of small EIV and could not expand the sample size, the independent variable could be removed from the logistic regression model, to avoid the biased results. |