| RCT is considered the gold standard approach for estimating treatment effects,but randomization also does not guarantee that all studied covariates are balanced between groups.Meanwhile,the external validity of RCT is limited due to the limitation of the inclusion/exclusion criteria.The RCT also does not apply to all clinical trials.Non-randomized studies are increasingly used in clinical trials,real-world studies,and investigator-initiated studies.Regulatory agencies such as FDA,EMA,and NMPA have paid more and more attention to real-world studies and relevant guidelines for real-world study.Study design of RWS,especially the use of scientific non-randomized statistical methods,is one of the key elements in which RWD can be converted to RWE.In non-randomized studies,systematic differences on observed covariates between groups could lead to biased treatment effects estimation.Historically,the common methods of regression model,stratification and matching which used to reduce bias,but they are all limited by the number of covariates.The counterfactual framework is the theoretical basis for causal inference.Rubin generalizes it to non-randomized studies.Propensity score is a new method of non-randomized research guided by the idea of counterfactual framework.Propensity score is defined as the probability of treatment assignment conditional on observed baseline characteristics.The commonly used propensity score methods include matching on the propensity score,inverse probability of treatment weighting using the propensity score,stratification on the propensity score and covariate adjustment using the propensity score.The model reduces multidimensional covariates to a one-dimensional score called propensity score.Propensity score has been widely used in non-randomized studies to reduce or eliminate the effects of confounding between groups.The propensity score method is increasingly used in practice,but it lacks standardized step by step procedures.If the procedures and statistical analysis methods are not standardized,they will lead to bias in treatment effects estimation.For example,which matching algorithm and statistical analysis method will be used while using propensity score matching,which weight and statistical analysis method will be used while using IPTW,etc..The robustness of the matching algorithm,the assessment of balance in baseline covariates by standardized difference and to choose the optimal caliper for three treatment groups also should be discussed when using propensity score matching.The problem of the propensity score methods used in studies with small sample size is worthy of consideration.This primary objective of the current study is to discuss and solve the topics mentioned above.The study includes four sections.The main achievements and conclusions are summarized as follows:The first section discussed and compared the different methods IPTW using propensity score including traditional IPTW,SIPTW and TIPTW.The comparison and simulation used the 10 different statistical analysis methods,with the different sample size including 100,200,600 and 1000 subjects,and the different allocation ratios of sample size including 1: 1,1: 2 and 1: 3.We summarized our findings as follows.1.IPTW and SIPTW had the correct type I error rates,while TIPTW had an inflated type I error possibly.2.The following three methods includes IPTW using weighted t test,IPTW using weighted regression analysis without covariates and SIPTW using weighted regression analysis with the propensity score as the only covariate were overly conservative with type I error rates that were less than 0.05,e.g.0.01,0.015,etc.3.The methods of IPTW using weighted t test and SIPTW using weighted regression analysis with the propensity score as the only covariate had more correct type I error rates compared to the other methods.Briefly,the IPTW using propensity score can control type I error rate well if the propensity score model has been adequately specified.The similar conclusions can be found with small sample size.The second section compared global optimal matching and caliper matching.The comparison and simulation used 8 different statistical analysis methods,with the different sample size including 100,200,600 and 1000 subjects,and the different allocation ratios of sample size including 1: 1,1: 2 and 1: 3.We summarize our findings as follows.1.In general,caliper matching had more correct type I error rates than global optimal matching.2.To use t test or regression analysis with the propensity score as the only covariate had more correct type I error rates compared to paired t test or traditional regression analysis while using caliper matching.3.The propensity score matching method using t test or regression analysis with the propensity score as the only covariate had more correct type I error rates under various combinations of covariates.Briefly,we recommend using caliper matching with the statistical method of t test or regression analysis without covariates except the propensity score.The similar conclusions can be found with small sample size.The third section compared the propensity score caliper matching for three treatment groups with the different sample size including 400,600,800 and 1000 subjects,and the different allocation ratios of sample size including 1: 2: 7,1: 2: 3 and 5: 6: 9.We summarize our findings as follows.1.For multiple treatment groups,the pair wise comparison was conducted to calculate the standardized difference,among that the greatest standardized difference is selected to assess the overall balance.Based on the simulation results,it has been suggested that a standardized difference of less than 15% indicated negligible imbalance in each covariate among groups.2.Using the caliper widths of 10% to 30% of the pooled standard deviation of the logit of the propensity score resulted the lower relative bias within 15%.Using the caliper widths of 20% to 40% of the pooled standard deviation of the logit of the propensity score resulted in the lower MSE.In summary,our findings suggest that a standard difference that is less than 15% can be taken to indicate a negligible difference among multiple treatment groups and the widths of20% to 30% of the standard deviation of the logit of the propensity score are the optimal calipers.The last section proposed and summarized standardized step by step procedures of IPTW using propensity score and caliper matching based on the current study and simulation results,also combined with an example.The R programs and SAS macro programs also are provided. |