Font Size: a A A

Application Of Generative Adversarial Imputation Nets Enhancing Multiple Imputation Methods

Posted on:2024-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:R H ZhuFull Text:PDF
GTID:2544306914490094Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Background:In recent years,there has been increasing attention to real world studies(RWS)at home and abroad.Since most real world data(RWD)lack strict scientific research quality control in the process of collection,recording,and storage,there may be problems such as missing key variables,missing data,and inaccurate records,which have caused certain difficulties in the acquisition and application of real world evidence(RWE).Missing data is one of the unavoidable problems in real world studies.Missing data will not only bring challenges to statistical analysis and interpretation of results,but also may bias research conclusions,affecting their representativeness and authenticity.Directly deleting or ignoring missing data may cause problems such as selection bias and insufficient information utilization.Due to the strong heterogeneity of real world data and the existence of many confounding and interference factors,most of the missing data are covariates with missing data,multivariate missing,and special missing patterns.Therefore,single-value imputation methods such as mean imputation,regression imputation and last observation carry forward method have certain limitations in solving real world data imputation problems.Multiple imputation(MI)is a commonly used multivariate missing data imputation method at present,which fully considers the variability of missing data.MI imputes multivariate missing data by assuming that there is a multivariate normal distribution(Markov chain Monte Carlo,MCMC)or joint distribution(fully conditional specification,FCS)between variables,and can preset the range of imputed values according to professional knowledge.However,the assumption of multivariate normal distribution or joint distribution between variables cannot always be satisfied,which may cause estimation bias.In recent years,generative adversarial imputation nets(GAIN)based on deep learning has been used in the field of missing data imputation,which is a relatively cutting-edge imputation method.GAIN is based on the generative adversarial network to make the generator and the discriminator continuously confront and optimize,and finally generate an imputed value that makes it difficult for the discriminator to judge whether it is observation data or imputed data.GAIN does not require supervision or pre-training and can handle multivariate missing data.However,the final data that GAIN imputes for each missing observation is also a single value,which means that GAIN is also a single-value imputation method that does not consider the variability of missing data.Moreover,the initial imputation value of GAIN is 0,without prior information.In summary,both multiple imputation and generative adversarial imputation nets can be used to handle multivariate missing data,and the main advantages and disadvantages of the two can complement each other.Therefore,exploring and constructing a joint method of these two imputation methods may improve the imputation effect of multivariate missing data,provide a richer methodological option for data imputation,and improve the robustness and reliability of evidence-based results in real world studies.Objective:The objective of this study is to construct generative adversarial nets enhancing multiple imputation methods(GAIN enhanced MI,GEM)around the data imputation problem of multiple covariates with missing data in real world studies.Through the simulation study,we compare and evaluate the imputation effects of GEM,GAIN and MCMC in different simulation scenarios,and then apply the three methods in one example,providing a richer methodological option for imputing multivariate missing data more effectively in real world studies.Methods:In the simulation study,real world data with one continuous outcome,one dichotomous group variable,and seven covariates including three continuous covariates,two dichotomous covariates,one multiple categorical covariate,and one ordinal categorical covariate were simulated using the Monte Carlo method.We set up 120 simulation scenarios,including missing completely at random(MCAR),missing at random(MAR)and missing not at random(MNAR)3 different missing mechanisms,4 different sample sizes(500,1000,2000,5000),and 10 different missing rates(1%~10%).Three covariates with missing data were set for each simulation scenario,including one continuous covariate,one dichotomous covariate,and one ordinal categorical covariate.The imputation effects of the three covariates with missing data were evaluated from two aspects: the estimation of the true value of the observation and the estimation of the parameter.Using the observed values of covariates in the complete dataset as the gold standard,we evaluate the effect of imputation in missing observations by the normalized root mean square error(NRMSE)and the proportion of falsely classified(PFC).Using the parameter values of covariates calculated by linear regression model as the gold standard,we evaluate the performance of parameter estimates by bias,mean absolute error(MAE),and 95% confidence interval(CI)coverage.In the example application,based on real world data,the difference in total length of hospital stay between children with bronchopulmonary dysplasia(BPD)who underwent tracheostomy and those who did not was compared,and four methods such as complete case analysis(CCA),GEM,GAIN,and MCMC were used to handle multivariate missing data to improve the robustness of the results.Results:1.Simulation study(1)The effect of imputation in missing observations1)Normalized root mean square errorUnder three missing mechanisms of MCAR,MAR and MNAR,for the continuous covariate,the normalized root mean square error of GEM and GAIN were lower than those of MCMC,and the difference increased with the increase of missing rate,while there was no significant difference between GEM and GAIN.2)Proportion of falsely classifiedUnder different missing mechanisms,for the dichotomous covariate,the proportion of falsely classified of GEM and GAIN were lower than MCMC,and the difference increased with the increase of the missing rate,while there was no significant difference between GEM and GAIN.For the ordinal categorical covariate,there was no significant difference in the proportion of falsely classified among GEM,GAIN,and MCMC.(2)The performance of the parameter estimates1)BiasFor the continuous covariate and the ordinal categorical covariate,under different missing mechanisms,the biases of the three methods of GEM,GAIN and MCMC were all close to 0,and there was no significant difference among the three methods.For the dichotomous covariate,under MCAR missing mechanism,the biases of the three methods of GEM,GAIN and MCMC were close to 0,and there was no significant difference among the three methods.Under MAR missing mechanism,the bias of MCMC was close to 0,which was smaller than that of GEM and GAIN,while there was no significant difference between GEM and GAIN.Under MNAR missing mechanism,the absolute values of biases of GEM and GAIN were smaller than MCMC,and the difference first decreased and then increased as missing rate increased,while there was no significant difference between GEM and GAIN.2)Mean absolute errorUnder different missing mechanisms,for the continuous covariate and the ordinal categorical covariate,there was no significant difference in the mean absolute error among the three methods of GEM,GAIN and MCMC.For the dichotomous covariate,when missing mechanism was MCAR and missing rate was large,the mean absolute error of GAIN was greater than that of GEM,and GEM was larger than MCMC,and the difference between the three increased as the missing rate increased.When missing mechanism was MAR,the mean absolute error of GAIN was higher than GEM,and GEM was higher than MCMC,and the difference increased with the increase of missing rate and sample size.When missing mechanism was MNAR,the mean absolute error of GAIN was higher than that of GEM and MCMC,and the difference increased with the increase of missing rate.The mean absolute error of GEM was close to MCMC when missing rate was low,and the mean absolute error of GEM was slightly higher than MCMC when missing rate was high,but the difference between the two decreased with the increase of sample size.When sample size was 5000,there was no significant difference between GEM and MCMC.3)95% confidence interval coverageFor the continuous covariate,the 95% confidence interval coverages of GEM and MCMC under different missing mechanisms were all around 100%.Under MCAR missing mechanism,when sample size was large and missing rate was high,the coverages of GAIN were closer to95%.Under MAR or MNAR missing mechanism,the 95% confidence interval coverage of GAIN was closer to 95% as sample size and missing rate increased.However,when sample size was 5000 and missing rates were 9%-10%(MAR),or sample size was 5000 and missing rates were 7%-10%(MNAR),the coverage of GAIN was less than 90%.For the dichotomous covariate,the 95% confidence interval coverages of MCMC under different missing mechanisms were all around 100%.When missing mechanism was MCAR or MNAR,the 95%confidence interval coverages of GEM were around 100%.Under MAR missing mechanism,when sample size was 5000 and missing rate was 10%,the 95% confidence interval coverage of GEM was closer to 95%.Under three missing mechanisms,the 95% confidence interval coverages of GAIN were closer to 95% as sample size and missing rate increased.However,when sample size was 5000 and missing rate was 10%(MCAR),or sample size was 2000 and missing rate was 10%,or sample size was 5000 and missing rates were 8%~10%(MAR/MNAR),the coverages of GAIN were less than 90%.For the ordinal categorical covariate,the 95% confidence interval coverages of GEM and MCMC under different missing mechanisms were all around 100%.When missing mechanism was MCAR,the 95% confidence interval coverage of GAIN was around 100%.When missing mechanism was MAR or MNAR,the 95% confidence interval coverage of GAIN was closer to 95% as sample size and missing rate increased.2.Example applicationIn the example application,a total of 6 covariates contained missing data,and the missing rate of covariates ranged from 0.10% to 14.79%.The types of covariates with missing data included continuous,dichotomous,ordinal categorical and unordered categorical.CCA,GEM,GAIN,and MCMC were used to handle the covariates with missing data,and the linear regression model with adjustment for covariates was used to explore the relationship between BPD children underwent tracheostomy and total length of hospital stay.The results showed that the statistical conclusions obtained by the linear regression model after the four methods were consistent: tracheostomy was associated with a longer total length of hospital stay in children with BPD.Among children with BPD who underwent tracheostomy,the median total length of hospital stay was 124 days,approximately 100 days longer than among children with BPD who did not undergo tracheostomy.Conclusion:Under three missing mechanisms of MCAR,MAR and MNAR,for the continuous covariate and the dichotomous covariate,GEM was better than MCMC in imputing missing observations,and there was no significant difference between GEM and GAIN.For ordinal categorical covariate,there was no significant difference in the imputation effect of missing observations among GEM,GAIN,and MCMC.For the continuous covariate and the ordinal categorical covariate,GEM,GAIN and MCMC all have accurate parameter estimations under three missing mechanisms.For the dichotomous covariate,when missing mechanism was MCAR,the point estimations of parameter obtained by the three methods were all accurate,while the mean absolute errors of MCMC were relatively smaller.When the missing mechanism was MAR,the point estimations of parameters obtained by MCMC were more accurate and the mean absolute errors of MCMC were smaller.When the missing mechanism was MNAR,the point estimation of parameters obtained by GEM were more accurate and the mean absolute errors of GEM were smaller.
Keywords/Search Tags:real world study, missing data, multivariate, multiple imputation, generative adversarial imputation nets
PDF Full Text Request
Related items