Font Size: a A A

Research On Multiple Imputation In Propensity Score With Partially Observed Covariates And Its Application In Real-World Studies Of Adverse Drug Reactions

Posted on:2023-01-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H ZhaiFull Text:PDF
GTID:1524307316454634Subject:Internal Medicine
Abstract/Summary:PDF Full Text Request
Background:Observational studies may compensate for the limitations of randomized controlled trials in some situations and have gained increasing interest from researchers.Currently,a wealth of medical evidence is derived from observational studies.However,due to the absence of randomization and diverse data sources,observational studies have inherent limitations in terms of their susceptibility to confounding and missing.Addressing confounding and missing correctly and thus obtaining a robust estimate of treatment effect,has become an important topic in observational research.Propensity score(PS)is the most widely used causal inference method in observational research.Since the subject’s propensity score is defined as the probability of treatment conditional on all the observed covariates,the absence of important covariates can challenge propensity score estimation,decrease statistical power,increase research complexity,and ultimately lead to uncertainty in research conclusions.Therefore,how to effectively deal with missing covariates and make the most of sample information,to obtain a valid estimate of the exposure effect,is an urgent problem to be solved in PS research.Compared with the vigorous development of PS methodology,the research in this area is still relatively lacking.Multiple imputation(MI)is mature and applicable to a variety of data types and complex scenarios.It has become a standard method for handling missingness in many fields and has been increasingly applied in PS analysis in recent years.The use of MI in the context of PS involves multiple steps.Each step requires the researchers to make the decision and different choices may affect the accuracy and precision of the final estimation result.Currently,controversy remains on the approaches to combing PS and MI and there is no unified guidance for the optimal strategy.The choice of specific MI method in the presence of different types of missing covariates is still inconclusive.These issues are worthy of further exploration and research.Objective:This study was focused on implementing MI in the context of propensity score analysis with partially observed covariates for causal inference in observational studies.Simulation studies under a wide range of scenarios were conducted to systematically investigate the selection of imputation and integration approaches when combining PS and MI.Additionally,we aimed to compare the performance of multiple MI methods in different situations and explore the applicable conditions of different methods.In this way,we expected to provide some references and suggestions for the addressing of missing important baseline covariates problem in PS and also provide some clues and basis for future related research.Methods:For the objectives above,we implemented the research through the following four steps: data simulation,method research,method evaluation,and empirical application.Simulation research was the core content of this paper,including two parts,both of which were carried out according to the structure of“A→D→E→M→P”.1.Simulation study:(1)Investigation on the various approaches of combining PS and MIObservational data with dichotomous treatment and outcome were simulated using the Monte Carlo method.A wide range of scenarios that varied in missing mechanisms(MCAR,MAR),propensity score estimators(stabilized inverse probability weighting(SIPTW),stratification(STR),nearest neighbor matching(NNM)),sample sizes(250,500,1000),and missing proportions(20%,35%,55%,75%)were considered.A total of 13 different combinations of MI imputation and integration approaches were involved: imputation strategy 1-random sampling(MI1_RSAMP),imputation strategy1-averaging covariates(MI1_AVCOV),imputation strategy 1-averaging propensity score(MI1_AVPS),imputation strategy 1-averaging PS model(MI1_AVMOD),imputation strategy 1-averaging treatment effect(MI1_AVEFF),imputation strategy 2-random sampling(MI2_RSAMP),imputation strategy 2-averaging covariates(MI2_AVCOV),imputation strategy 2-averaging propensity score(MI2_AVPS),imputation strategy 2-averaging treatment effect(MI2_AVEFF),imputation strategy 3-random sampling(MI3_RSAMP),imputation strategy 3-averaging propensity score(MI3_AVPS),imputation strategy 3-averaging PS model(MI3_AVMOD),imputation strategy 3-averaging treatment effect(MI3_AVEFF).The estimated results of PS methods of the full dataset prior to generating missing values were used as the gold standard.Relative bias(RB),mean squared error(MSE),empirical standard error(Emp SE),95% confidence interval coverage(95% CI coverage,95%CI Cov),and the average length of the 95% confidence interval(95% CI average length,95%CI AL)were employed to assess the relative performances of different combinations under different simulation scenarios.(2)Research on the MI methods for handling missingness of different covariate typesThe Monte Carlo technique was utilized to mimic observational real-world observational data with dichotomous treatment and outcome.The sample size was fixed at 500.We created simulation scenarios varying in their covariate missing types(continuous,dichotomous,ordinal,nominal),missing mechanisms(MCAR,MAR),propensity score estimators(SIPTW,STR),and missing proportions(25%,45%,70%).Considering the MI methods under the absence of different types of covariates,predictive mean matching(PMM),Bayesian linear regression(BYSLR),bootstrap linear regression(BOTLR),random linear regression imputation(IMELR),logistic regression(LG),bootstrap logistic regression(BOTLG),linear discriminant analysis(LDA),proportional odds model(POM),polytomous logistic regression(PLR),classification and regression trees(CART),and random forest(RF)were included.The performances of RF methods with 10 trees(RF_10),20 trees(RF_20),and 40 trees(RF_40),were also investigated.Similarly,the PS estimation results with complete data were used as the gold standard for evaluation.RB,MSE,Emp SE,95%CI Cov,and95%CI AL were employed to compare the comprehensive performance of the MI methods under different data conditions.2.Empirical applicationThe U.S.food and drug administration adverse event reporting system(FAERS)was used as the data source for this part.The relevant records of chronic lymphocytic leukemia(CLL)were screened out,and the traditional signal detection method was employed to mine the signals of ibrutinib-associated cardiotoxic events.For the detected signals,the PS method was used to control the confounding,and the MI processing was performed based on the simulated results and the actual data missing to obtain the final effect estimate.Finally,we examined and demonstrated the applicability of these research approaches through an empirical analysis.All the above analyses were performed using statistical analysis software SAS 9.4and R 4.1.2.Results:1.Simulation study:(1)Investigation on the various approaches of combining PS and MI1)In general,the performance of different approaches under the MCAR and MAR mechanisms was nearly consistent.Missing rate and sample size could affect the performance of each approach:(1)Under the same sample size,with the increase of the missing,even if positive and negative bias could offset,the absolute value of bias of most approaches still tended to increase significantly.The MSE and Emp SE of most approaches tended to increase as missingness increased,and the corresponding 95%CI Cov decreased.This trend worsened as missing rates increased;(2)Increasing the sample size increased,in most simulation scenarios,lead to the decreased MSE and Emp SE estimates,and a narrower 95% CI AL of different approaches.The overall performances of different approaches became better and the variation in exposure effect estimate between the different approaches was reduced.2)In the context of different PS methods,the estimation results obtained by different MI methods differed:(1)Using SIPTW: MI3_AVMOD,MI1_AVMOD,MI1_AVPS,MI1_AVEFF performed relatively well.MI3_AVMOD outperformed others when the mechanism was MCAR and the sample size was large.Whereas slightly unsatisfactory when the mechanism was MAR and the sample size was large,in which the MSE and Emp SE would increase significantly under high missing rates.By contrast,the other three have relatively stable performance in different situations;(2)Using STR: MI3_AVEFF,MI1_AVEFF,MI2_AVEF performed well.Among these3 approaches,MI1_AVEFF demonstrated better performance overall,especially when the mechanism was MAR and the sample size was large.It was superior to other approaches regarding two core indicators,MSE and Emp SE,in such situations;(3)Using NNM: MI3_AVEFF,MI1_AVEFF,and MI2_AVEF obtained the smallest MSE and Emp SE estimates in most simulated scenarios.The results of MI1_AVEFF and MI3_AVEFF were comparable and slightly better than MI2_AVEFF in terms of MSE and Emp SE.3)The three approaches which pool the treatment effect based on Rubin’s rule,in most simulation settings,performed better regarding MSE and Emp SE.Since both the between-imputation variance and within-imputation variance were taken into consideration,the standard error obtained by these three approaches tended to be larger,the corresponding coverage rate could easily exceed 95%,and the interval width was therefore significantly wider than other approaches.This trend was particularly evident in the NNM context.Increasing the size of sample or the number of imputation times could shorten the interval width and thus enhance the accuracy of the estimation result.Among the three approaches,MI1_AVEFF presented the best comprehensive performance,and it was robust in terms of MSE,Emp SE,and coverage.The accuracy and efficiency of estimation results obtained by MI1_AVEFF were high and tended to be conservative but more reliable.The approaches based on random sampling,average covariate,average PS,and average model did not take into account the withinimputation variance,which would underestimate the standard error,and the length of the interval was easy to be too narrow.Most of these approaches tended to have poor coverage as the missing rate increased,and were prone to type I errors and overestimation of the accuracy of the estimated results.Although MI3_AVMOD,MI1_AVMOD,and MI1_AVPS performed well in the SIPTW context,they had poor performances in the STR and NNM settings in terms of MSE and Emp SE.Additionally,in most cases,the 95%CI Cov decreased significantly with the increase of the missingness,and the interval width was significantly narrower than that of the method based on the average treatment effect.(2)Research on the MI methods for handling missingness of different covariate types1)Overall,under different missing mechanisms and PS methods,the variation regularities of indicators of different MI methods were relatively similar.The missing proportion significantly affected the performance of the MI methods.When the missing proportion was small,the performances of all MI methods were comparable.With the increase of missing,the difference among MI methods gradually emerged.In general,an overall increasing trend for MSE,Emp SE,and 95% CI AL was captured as the missing increased,and the coverage rate did not change much.In addition,under different simulation scenarios,most methods showed a trend of increasing bias as the missing proportion became larger.Overall,with high rates of missing,the performance of each MI method in most simulation scenarios deteriorated,the accuracy and efficiency of imputation decreased,and the corresponding effect estimates gradually deviated from the gold standard.This trend worsened as missing increased.2)The simulation results indicated that there was variation in the performance of the MI methods in the context of missing different types of covariates:(1)When missing continuous variables: the RF-based method performed best.The advantage of these methods was particularly evident at high missing rates and outperformed other methods concerning the accuracy and precision of effect estimation.CART and IMELR were comparable to RF regarding 95% CI AL,but had poor performances regarding MSE and Emp SE;(2)When missing binary variables: under the MCAR mechanism,the RFbased methods and CART performed relatively well,and the three RF-based methods demonstrated the best overall performance;under the MAR mechanism,the RF-based method had the best performance in terms of MSE,Emp SE and 95% CI AL;(3)When missing ordinal variables: the results under both missing mechanisms showed that RFbased methods performed the best and could obtain stable MSE,Emp SE,and 95 CI AL estimates even at high missing rates;(4)When missing nominal variables: under the MCAR mechanism,the overall performances of PMM,CART,and PLR were better.In the context of SIPTW,the performance of PLR regarding MSE and Emp SE was slightly worse;under the MAR mechanism,LDA had the worst performance,and the remained methods presented similar performance regarding MSE and Emp SE.Based on the results of all indicators,PLR and CART were better.3)In most simulation scenarios,RF_10,RF_20,and RF_40 performed equivalently well in terms of RB,MSE,Emp SE,95% CI Cov,and 95% CI AL.The increase in the number of trees would significantly prolong the calculation time.2.Empirical applicationA total of 36,045 records with the indication of CLL were included.A total of 5562 reports were involved,of which 2975(53.49%)received ibrutinib,and 2587(46.51%)were not.The SIPTW method was used to further explore the adverse events of atrial fibrillation and cardiac disorder detected by the disproportionality analysis.Considering the actual data situation and simulation research results,MI_AVPS,MI1_AVEFF,and MI1_AVMOD,which had relatively stable performances under MAR、SIPTW,and large sample size,were selected.RF_10 was used to implement multiple imputation for the missing age and gender variables.Complete case analysis(CCA)and unconditional mean imputation(UMI)were performed for comparison.The results obtained by CCA were most different from the results of other methods,and it was easy to achieve false research conclusions;UMI was slightly better than CCA but underestimated the correlation between variables.The point estimation results obtained by UMI differed a lot from the results corresponding to the three MI approaches;results after 3 MI approaches were largely consistent and statistically significant: in the CLL population,patients administrated with ibrutinib had higher rates of suffering atrial fibrillation and cardiac events than patients without,especially for atrial fibrillation,the OR point estimates for the three approaches were all above 3.0.Conclusions:In real-world studies,when applying MI in PS with partially observed important covariates,different combining approaches and different MI methods may affect the accuracy and precision of effect estimates.In general,MI1_AVEFF performs relatively robustly in the context of different PS methods and the results are slightly conservative but more credible.When the PS method is NNM,the widening of the confidence interval would be particularly obvious.Therefore,it is not recommended to use MI to imputation missing values in NNM when the sample size is small.Overall,the two MI methods based on machine learning hold certain advantages over conventional parametric and semi-parametric MI methods.The RF method is particularly noteworthy for good performance in most simulation scenarios.It is suitable for the missing of various covariate types in PS(continuous,binary,and ordinal),and can obtain more robust estimation results even in the case of a high missing rate.Based on the current simulation results,it is not yet believed that increasing the number of trees in the RF method can improve the accuracy of imputation and thus increase the accuracy and precision of the treatment effect estimation when combined with PS.
Keywords/Search Tags:observational study, causal inference, propensity score, missing data, missing completely at random, missing at random, multiple imputation, fully conditional specification, classification and regression trees, random forest
PDF Full Text Request
Related items