| High-throughput omics techniques has been widely adopted in biomedical research such as gene signature inference,potential drug target discovery,and clinical prediction model establishment.However,the analysis of omics data is often hampered by batch effects.Batch effects are technical confounders that impede high-throughput omics data research by causing both false positives and false negatives.This in turn,creates severe problems for its clinical applications.The subject of this thesis is studying viable strategies for combating batch effects in high-throughput omics data,focusing on evaluating the practical limitations of batch effect correction algorithms and providing viable strategies for correcting batch effects to this field.The main contents are list as follows.In chapter 1,we provided a current overview of how batch effects are caused and commonly used batch effects correction algorithms,batch effects detection approaches,and the exigent problems.In chapter 2,we examined the practical limits of batch effects correction algorithms: when should you care about batch effects? Several batch effects correction algorithms(BECAs)have been devised for resolving batch effects,and there are various comparative evaluations related to BECAs,but the practical limits of BECAs remain to be clarified.Using two different methods for simulating class and batch effects,and taking various representative datasets across both transcriptomic and proteomic platforms for confirming consistency,we demonstrated that under situations where sample classes and batch factors are moderately confounded,most BECAs are remarkably robust and only weakly affected by upstream normalization procedures.BECAs do have limits: when sample classes and batch factors are strongly confounded,BECAs performance decline remarkably,with variable performance in precision,recall,and also batch correction.We also reported that removing batch effects is no guarantee of optimal functional analysis.These observations are consistently supported across the multitude of test datasets.Overall,this study suggests that all these BECAs have certain limitations,and there is no universally best BECA.In chapter 3,regarding the BECAs,we proposed a class-specific Com Bat and demonstrated it more robust against batch-class confounding issues than the existing Com Bat.One of the most widely used BECAs is Com Bat,which is based on an empirical Bayes approach for correcting batch effects.However,when using it to do batch-effect correction,batch information is presented but class(phenotype)information is generally ignored.In situations where batch and class effects are confounded due to experimental design imbalances,this may lead to performance issues.We propose an alternative flavor for performing Com Bat,which we call classspecific Com Bat(CS-Com Bat).CS-Com Bat corrects batch effects within each class independently before merging all the corrected classes back together.We performed a comprehensive comparative study of CS-Com Bat with other BECAs.We demonstrated that CS-Combat outperforms the standard Com Bat,as well as other popular BECAs on both real and simulated data with batch-class confounding,achieving a better trade-off between batch effects correction and class effects preservation.In addition,the rebalancing approach Synthetic Minority Oversampling TEchnique(SMOTE)synergizes with BECAs for improving their performance substantially on batch-class imbalanced datasets.In summary,CS-Com Bat is a potentially effective method for dealing with batch-class confounding issues,and it performs better in synergy with SMOTE.In chapter 4,we investigated CS-Com Bat for correcting batch effects in highthroughput omics data with a focus on small sample size scenario.Due to the fact that the time,expense,and sample limitations,many high-throughput omics data samples are relatively small,and most batch effects correction algorithms are not suitable for small samples size.Com Bat is one of the few algorithms that can be used to process small sample size data.However,Com Bat is imperfect either as it blind to class information,as a result,it may lead to improper batch effect inference and removal.In contrast,our previous proposed CS-Com Bat may do a good job in batch effects correction as it considers the class information.Thus,in this study,we focused on evaluating the performance of CS-Com Bat and other Com Bat under the challenging small sample size scenario.Across both test genomics and proteomics datasets with real and simulated batch effects under small sample size scenarios,CS-Com Bat consistently removes batch effects more thoroughly and provides better recall and inter-sampling similarity.When to use either Com Bat or CS-Com Bat depends on analytical need:when high precision is favored(e.g.,design a drug target),Com Bat is a good option.But when high recall is preferred(e.g.,understand the mechanism of disease),CSCom Bat is a better choice.In chapter 5,the work of this thesis is summarized and prospected,and the innovation of this paper is emphasized.In summary,this thesis comprehensively investigated the practical limitations of algorithms for batch effects correction in high-throughput omics data and provided potentially effective strategies for batch effect correction.These findings will facilitate researchers to pay more attention to the issue of batch effects and mitigate the impact of batch effects on their experiments,and provide potential applications in identifying gene signatures and drug targets. |