| BackgroundDrug safety evaluation and adverse drug reaction signal mining have unquestionable importance in the fields of clinical diagnosis and treatment,drug evaluation,and policy formulation.Since the last century,drug safety has become one of the most important indicators for drug evaluation because of the frequent occurrence of drug-related incidents in China and abroad,which has brought double burden to patients and society.Adverse drug reaction(ADR)is one of the most important causes of drug-related incidents,so it is necessary to strengthen pharmacovigilance,conduct pharmacoepidemiological studies,and detect ADR signals in a timely manner.At present,the field of pharmacovigilance and postmarketing evaluation of drugs has formed a system of methods,including active monitoring and passive monitoring.Compared with passive surveillance,which has a long history of application and relies on spontaneous reporting of ADR events,active surveillance with realworld data is a more valuable research direction to explore nowadays.Longitudinal data is a common form of data in active surveillance efforts,which is derived from a combination of observations obtained from individuals at different followup time points in long-term continuous active surveillance,and these time-series longitudinal data are then used to make inferences about clinical outcomes or future trends,and are used in biology,gene expression,and clinical medicine.The presence of time-varying confounding in longitudinal data from active monitoring is called time-dependent confounding.Time-dependent confounding cannot be corrected by traditional linear models and can lead to biased effect estimates,resulting in poor monitoring and warning reliability.How to overcome the time-dependent confounding bias is one of the important methodological problems to be solved.The most commonly used method is the edge structure model that uses propensity scores for inverse probability weighting to construct a virtual population to achieve confounding control,as proposed by Robins et al.The traditional propensity score estimation requires appropriate settings of variables in the model,cannot automatically adapt and adjust the model,is highly dependent and sensitive to the conditions of model settings,variable selection,and interaction settings,requires artificial selection of variables and interaction terms according to different linearity and additivity,and is difficult to use in large high-dimensional databases.The advantage of machine learning algorithms is that they can be applied to large datasets with highdimensional frameworks with a large number of covariates and automatically adapt to different linearity/additivity conditions.Over-fitting can also be avoided by using linear combinatorial regression trees and Shrinkage algorithms to reduce the probability of 0/1values and avoid extreme weight values.ObjectiveTo address the limitations of time-dependent confounding controls and parameter model-based propensity score estimation in longitudinal data studies by combining gradient boosting decision trees,XGBoost and Light GBM algorithms in machine learning algorithms for propensity score estimation in a high-dimensional framework.The exploration of machine learning algorithms can be summarized to some extent as a combination of parameters for external data.The results of different algorithms are applied to simulated and real data with different data contexts,variable contexts and sample sizes,and the performance of models of different algorithms in different data contexts is evaluated and compared through evaluation metrics such as point estimates,standard errors and confidence interval widths,and coverage rates,combined with real values set by simulated data,to provide methodological support for controlling temporal correlation confounding of longitudinal data.Methods1.Edge structure model construction for joint boosting tree algorithmThe definitions of terms such as time-varying confounding and causal inference are sorted out in the literature review,and the concepts,application scenarios,basic assumptions and limitations of edge structure models based on counterfactual framework are reviewed.Combine the theory of gradient boosting tree algorithms based on residual fitting with an understanding of the strengths and weaknesses of each algorithm.The algorithms are implemented in software applications and integrated with the concept of propensity scoring.Algorithm parameters are also adapted to the practical and empirical parameters of the application,resulting in the construction of edge structure models for the joint boosting tree algorithms GBDT,XGBoost and Light GBM.2.Simulation study and model effectiveness testA stochastic Monte Carlo(MC)simulation method was used to evaluate the effectiveness of the method by generating a simulation dataset based on the database features of a real active pharmacovigilance system as a standard library.The simulation dataset contains seven data scenarios with different causal relationships depending on the parameters,and each scenario simulates a real situation with five sample sizes(500,1000,5000,10,000,and 20,000).As a gold standard for assessing the validity of the model,the true value of each regression coefficient was determined based on the parameters in the simulated dataset.A non-interaction term edge structure model,an artificially specified interaction term edge structure model,a GBDT edge structure model,an XGBoost edge structure model,and a Light GBM edge structure model were built to calculate the effect values.To avoid random bias and improve the robustness of the results,1000 simulated data were generated for each calculation and the results were averaged to obtain the final estimates.The metrics used to evaluate the models were the absolute bias of the mean difference between the point estimate of the effect value and the true value,the relative bias of the mean difference,the root mean square error,and the coverage of the 95% confidence interval.The performance of the models was evaluated based on the performance of each model under different data scenarios and sample sizes.3.Example study applied to an active monitoring system for adverse drug reactionsData cleaning and normalization were performed on the admission data submitted from June 1,2018 to June 1,2022 by two sentinel hospitals of the Adverse Drug Reaction Surveillance Sentinel Alliance(CASSA)in the China Hospital Pharmacovigilance System(CHPS)and Shanghai Health Information Center Drug Health Professional Database.Based on the normalized data,a retrospective patient cohort was established and an active surveillance study of endocrine toxicity of the immune checkpoint inhibitor PD1/PD-L1 was performed.Exposure factors and observations were identified based on data set variables,the presence of time-varying confounders was determined,drug effect estimates were compared after correction for confounders using different models,and drug effects were assessed.Results1.According to the simulation study setting,in longitudinal data where exposure factors are dichotomous variables and outcome indicators are continuous-type variables with both time-qualitative and time-dependent confounding,the use of propensity scores to construct a marginal structure model of the virtual population was able to overcome the effect of confounding and achieve causal effect estimation.2.According to the absolute bias and other indicators,the marginal structure model constructed by using machine learning algorithms to estimate propensity scores outperforms the marginal structure model with propensity scores estimated by parametric models in terms of covariate balance,absolute bias,and confidence interval width.3.The relative advantages of machine learning algorithms become more obvious as the sample size increases.At small sample sizes,the joint model of machine learning algorithms performs close to the edge structure model with correctly specified interaction terms.When the sample size is large,the joint model outperforms the traditional model in most data contexts.4.among the three machine learning algorithms,the model performance of GBDTMSM is slightly better than that of XGBoost-MSM,and both are better than Light GBMMSM in general.but the model computation time of Light GBM-MSM is much smaller than the first two.5.the effect of anti-PD-1/PD-L1 treatment on patients’ thyroid function,expressed as FT4 values,was assessed using real data from the Chinese hospital pharmacovigilance system.msm,GBDT-MSM,XGBoost-MSM,and Light GBM-MSM showed that anti-PD-1/PD-L1 treatment had a negative effect on patients’ thyroid function.In the short term,there was a trend towards a decrease in FT4 levels in thyroid function.However,in the long term,this effect was not significant,and FT4 levels rebounded relatively in several follow-up visits.Regarding covariates,the effects of age and gender were relatively small and mostly not statistically significant.Baseline TSH levels and baseline FT4 levels,on the other hand,had a small effect on FT4 levels at follow-up.GBDT,XGBoost and Light GBM in the case studies provided narrower confidence intervals,as well as smaller standard errors relative to the traditional marginal structure model.ConclusionThis study investigates the time-dependent confounding control method for longitudinal data under the active monitoring model of adverse drug reactions in China by combining the characteristics of drug safety assessment work and adverse drug reaction reporting data through simulated data and case studies.Various machine learning algorithms are used to estimate propensity scores,combine inverse probability weights to generate a virtual population,construct a marginal structure model,test model efficacy based on various assessment metrics,and further validate it in the case study.In the dichotomous exposure and continuous outcome scenarios of the simulation study,the models with the joint machine learning algorithms were able to provide more accurate and less biased propensity score estimates along with more stable results compared to the traditional parametric models.Among them,GBDT-MSM is superior and XGBoost MSM is second.Light GBM-MSM sacrifices some accuracy to achieve a significant increase in modeling speed.The case study also demonstrates the effectiveness of the joint machine learning algorithm model on real data.Combining the simulation study results with the case study results,the edge structure model of the joint gradient boosting tree algorithm is able to provide relatively robust estimates of causal effects in longitudinal data with time-dependent confounding,while providing tighter confidence intervals.Compared with traditional MSM models,the algorithm shows consistent performance even in the presence of interactions and nonlinear terms,and also better adapts to unobserved variables,missing data,and other situations. |