Font Size: a A A

Study Of Statistical Models In Zero-inflated Data

Posted on:2012-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:T XuFull Text:PDF
GTID:1484303350969409Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
BackgroundDuring medical research, zero-inflated count data were very common, such as the number of sub-health symptoms. These data were described as discrete count data, the values of which were zero or positive integers. Almost half observed values and even most of them were zero, which made data over-dispersed. Zero inflation had negative effect on the goodness of fit about negative binomial regression and Poisson regression models. Neglecting these zero values, bias would exist during the estimation of regression parameters. To handle it, the raw data could be considered as a mixture of an all-zeroes subset and a subset derived from a Poisson distribution or negative binomial distribution. This is the Zero-Inflated model (ZI model).All previous studies about ZI models home and abroad only explored the uses of ZI models in one abstract sample and compared with traditional Poisson regression and negative binomial regression model. No any one could study the goodness of fit in various proportions of zero counts. And no any study could tell us when the ZI models would be better than traditional ones and what proportion of zeroes could be considered zero inflated. In this study, bootstrap sampling method was used to randomly get simulation samples about various proportions of zeroes based on a large-scale sub-health sample. The optimum regression model was explored in every simulation sample with various proportions of zeroes. In addition, the applicability of ZI models was studied in sub-health symptoms data.MethodsZI models could handle the over dispersion and zero inflation at the same time. In medical field, ZI models could be used to estimate a two-stage disease process. In ZI models, incidence of zero count was considered as two groups. The first group of zeroes came from individuals who were not absolutely affected by events or were in low risk of events. The significance of model parameter estimation was similar to binary logistic regression models, which showed whether covariates affected the incidence of events. The other group of zeroes came from individuals who did not produce events based on Poission distribution or negative binomial distribution, or were in high risk of events. The significance of model parameter estimation was same as traditional Poisson regression or negative binomial regression models, which showed that covariates affected the number of events.In this study, the response variable was the number of sub-health symptoms and the explanatory variables were age, sex, marital status, race, occupation, smoking, alcohol drinking, high blood pressure and obesity. Poisson regression, negative binomial regression, ZI models and ordinal regression model were constructed in every bootstrap sample with various propotions of zeroes by SAS9.12. The coefficient of a, O test and Vuong test were conducted to assess the over dispersion and zero inflation. Likehood ratio, AIC, BIC and the model predictive probability of counts were used to compare the goodness of fit about every models. The optimum model would be found in every proportion of zero counts.ResultsIn a sub-health sample,43.3%of all 11227 cases had no any sub health symptoms. The coefficient of dispersion (a) was 1.013 (95%CI:0.965-1.063) which indicated that a was significantly larger than 0. The average number of sub health symptoms was 2.90±3.85 and the overdispersion statistic of O was 308.011 (P<0.001), which suggested the response variable was over-dispersed and did not obey Poission distribution. The Z statistic of Vuong test was 31.93 (P<0.001), indicating zeroes were too many to be explained by traditional negative binomial distribution. The log likelihood (-22170.741) was biggest in ZINB model, while AIC (44363.482) and BIC (44444.069) were smallest. The predictive probabilities of every count in ZINB model were most consistent with the abstract frequencies of the number of subhealth symptoms. In a word, ZINB model was the best model to study the indicators of the response variable.From the logit section of ZINB model, we found that higher age?=-0.436. P<0.001) and Korean nationality (p=-2.253, P<0.001) were risk factors of incidence of subhealth symptoms, but individuals who were single or mental labors were not susceptible to subhealth sumptoms. The negative binomial section indicated that age, sex, occupation, tobacco and marital status had effect on the number of subhealth symptoms. Among individuals with any subhealth symptoms, female subjects (?=0.280, P<0.001), regular alcohol drinker (P=0.098, P=0.008) and divorced or widowed subjects (?=0.200, P<0.001) suffered from more subhealth symptoms. However, higher aged individuals and mental laborers had smaller number of subhealth symptoms.In every bootstrap samples with various proportions of zeroes, the goodness of fit were similar between ZINB models and traditional negative binomial models when proportion of zeroes was lower than 15%. When proportion of zeroes was equal to or higher than 20%, ZINB models were optimum models, which was better than any others about goodness of fit, previctive prabobility and results explanation. Especially when proportion of zeroes was higher than 70%, the predictive probabilities of all counts in ZINB models were completely consistent with abstract frequcencies of response variables.When the proportion of zeroes was equal to or higher than 85%, ordinal logistic models had preferable log likelihood ratio and AIC too. But the predictive probabilities of all counts in ordinal models were not consistent with abstract frequencies regardless of any proportions of zeroes. All these showed that ordinal regression was not the best choice of zero-inflated count data analyses. In addition, in all samples with any proportions of zeroes, Poisson regression models and ZIP models had poor goodness of fit in view of overdispersion.ConclusionAll likelihood ratio tests, overdispersion test, zero-inflation test and model predictive probabilities suggested that ZINB model was the best regression model to study subhealth symptoms data when proportion of zeroes was larger than 20%. This study provided theoretical support for ZINB models application in zero inflated count data.
Keywords/Search Tags:Count data, Poisson distribution, Negative binomial distribution, Zero-inflatd model, Sub-health
PDF Full Text Request
Related items