| BackgroundA lethal and extremely complicated illness,malignant tumors are distinguished by the variety of tumor cells and modifications to their surroundings.Nowadays,most cancer therapy medications are explored with a focus on particular biomarkers.Even in groups where biomarker expression is enriched,individual variations in medication sensitivity arise due to the great complexity of malignancies.The efficacy of anti-cancer medication therapies may be increased by applying human genomics sequencing technology to examine the genetic makeup of patients’ tumors,forecast how well patients will respond to anti-cancer medications,and create individualized treatment regimens for each patient.Thanks to developments in high-throughput sequencing technology,several worldwide organizations and institutes have carried out studies and made databases of anti-cancer medications that target hundreds of tumor cell lines available to the public.This has given researchers a strong data base on which to build prediction models.Cell-drug pair modeling has progressively replaced drug-based modeling in current prediction algorithms.While prediction accuracy and computing efficiency have improved,issues with poor generalization performance and restricted clinical application still exist.Investigating the latent properties of medications and tumor cell lines and skillfully incorporating these features into drug response prediction models are critical components in ensuring the predictive accuracy of drug response models.At the moment,there are several approaches for using learning to represent pharmacological characteristics and cell lines.One of the pressing scientific issues that has to be resolved is the selection of efficient drugs and cell characteristics to improve the prediction capacity of models.In actuality,heterogeneous data with a variety of shapes and features are involved in the interactions between medications and cells.Their interactions need to be characterized using graph-structured data with spatial and structural properties;they cannot be simply represented as links between vertexes with the same qualities.Consequently,it is possible to think of the drug response prediction problem as a link prediction task between vertexes.However,obtaining non-Euclidean geometric data connection characteristics poses a difficulty for typical neural network methods and machine learning techniques.Graph Convolutional Neural Networks(GCNs)show great potential in feature extraction from data with association between topological structures.The procedure of generating graphs presents the biggest obstacle to developing models based on GCN.Cell and drug similarity networks are used as graph-structured data structures for transductive learning in the majority of existing anti-cancer treatment response prediction models based on GCN.The model’s clinical usefulness and generalizability are restricted by this method.Furthermore,models based on transductive learning permit the use of unlabeled test data for model training,in contrast to the general modeling procedure.During model validation,this might lead to information from validation data leaking.Consequently,in order to include inductive learning into the process of building GCN models and to justify the model validation scenario,more thorough investigation and study are required.ObjectivesBased on the aforementioned scientific issues,this study will evaluate the validity of validation experimental scenarios for validating predictive models using publicly available cancer cell-drug response databases.Furthermore,by employing effective model evaluation strategies,this study will compare the impact of different drug and cell feature representation methods on the predictive capabilities of the model.Additionally,by employing efficient drug and tumor cell feature representation methods,the study will further explore the possibility of combining transductive learning and inductive learning to construct an anti-tumor drug response prediction model based on GCN,aiming to enhance the model’s generalization ability and clinical applicability.Contents and Results1.Evaluation of experimental scenarios for drug response prediction model validation.In this study,the Genomics of Drug Sensitivity in Cancer(GDSC),Cancer Cell Line Encyclopedia(CCLE),Patient-derived tumor xenograft(PDX),and The Cancer Genome Atlas(TCGA)datasets were utilized to perform background correction,data cleaning,and standardization of all omics data.The logIC50 value was employed as the indicator of drug response.Based on various database-specific drug response thresholds and relevant literature reports,the logIC50 values were binarized.Values exceeding the threshold were classified as resistant,while those below the threshold were classified as sensitive.Existing research has proposed various machine learning and deep learning-based methods to model the relationship between cell line characterization and drug characterization in order to predict Cell Line-Drug Response(CDR).The dataset is split into training and testing sets using various splitting strategies that correspond to different application scenarios.Commonly used data splitting strategies include mixed data splitting,"cell-blind" data splitting,and "drug-blind" data splitting,as suggested by the literature.Through replicating and delving deeper into mainstream methods,this study discovered that the model’s effectiveness was overstated due to differing levels of data leakage with commonly employed data splitting strategies,and that the model’s ability to generalize cannot be accurately assessed.To address this issue,we designed an evaluation of the mixed data splitting strategy using random matrix replacement and random guessing.Additionally,we proposed a simple baseline method that relies solely on the Ground Truth Response(GTR)of the training data for evaluating the "cell-blind" and "drug-blind" data splitting strategies.The results indicated that when the generated pseudo-data matrix was substituted for the real data matrix,the cell line-drug response prediction model based on similarity networks showed no significant difference in evaluation compared to the model trained on the actual cell-drug features input when using a mixed data splitting strategy for model evaluation.Furthermore,in the "cell-blind" and "drug-blind" split strategy validations,using the mean of the trained cell/drug responses as the predicted value yielded results that showed no significant difference from the predictions obtained through model training.This suggests that the model did not extract useful information from the input cell lines and drug representations for prediction.The strategies of mixing response data,"cell-blind," and "drug-blind" splits all suffer from data leakage,thus failing to effectively validate the model’s predictive and generalization capabilities.Due to the data leakage issues associated with these data splitting strategies,we proposed new "cell-blind," "drug-blind," and double-blind data splitting strategies to validate the CDR prediction model.These strategies aim to address the problems inherent in the existing data splitting methods and lay the groundwork for subsequent algorithm research and model validation.2.Comparative study of drug characterization learning methods.To compare different drug representation learning methods,particularly graph-based representations and pre-trained large-scale compound models,we initially utilized the Bayesian optimization algorithm and nested cross-validation to select classifiers.Using the classifier with the optimal parameter combination,we input features obtained from six distinct drug representation learning methods into the classifier to train the CDR prediction model.The model’s performance was evaluated using the novel cell-blind data splitting method,and the contribution of each drug representation method to the model was assessed by comparing its performance.This process aims to enhance our understanding of the significance of different drug representation learning methods in predicting drug responses.The experimental results demonstrated that during the classifier selection process,Adaboost,Elastic Net,-Nearest Neighbors,Random Forest,Multilayer Perceptron,and Deep Neural Network(DNN)were compared in terms of their effectiveness in drug response prediction when using gene expression and drug features from the same cells as input data.It was found that DNN achieved the best predictive performance among these classifiers.Subsequently,by utilizing DNN as the classifier,the study compared the roles of SMILES,fingerprints,molecular descriptors,molecular of graph data structures,and pre-trained large-scale compound models in the prediction model.The results revealed that when incorporating graph data structures and pre-trained large-scale compound models as features for drug input,the model exhibited superior predictive performance,with the pre-trained large-scale compound models producing the best results.The AUC value of the model’s prediction was approximately 0.89,and the mean average precision(m AP)was 0.88.3.Representation study of deep fusion algorithm for multi-omics data of cancer cell lines.The deep integration of multi-omics data,including somatic mutations,copy number variations,and gene transcriptome data,enables a more comprehensive understanding of complex biological systems and disease mechanisms.GCN represents an emerging and promising algorithm for achieving this deep integration of multi-omics data.The primary challenge lies in effectively conducting marginal and joint representation learning for different modalities within the GCN framework to address data redundancy and complementarity.Therefore,further research has been conducted to explore effective methods of integrating marginal and joint representations.Building upon the findings of the previous sections,a two-layer GCN model was constructed.To validate its performance,this model employed a novel "drug blind" data splitting strategy and evaluated the role of three different methods of marginal and joint representation learning in constructing predictive models.This analysis aims to identify the optimal integration strategy for achieving effective fusion of multi-omics data.The research results indicated that,compared to using only marginal representation or joint representation,a deep fusion strategy based on integrating marginal and joint representations yields the best performance in extracting cell features for drug response prediction modeling.The AUC values of the five drugs for three targeted therapies and four chemotherapeutic drugs are maximized,indicating the best performance in drug response prediction.Furthermore,when compared to single omics data or pairwise combination strategies,the cell representation method based on three omics features demonstrates the optimal predictive performance.4.Research on drug response prediction algorithm based on bipartite heterogeneous graph convolutional neural network.In this section of the study,a Bipartite Heterogeneous Graph(BHG)data structure was constructed with cells and drugs as vertexes,and the R-HGCN model for predicting anti-tumor drug response was developed by combining transductive learning and inductive learning.This model treats the prediction task as a link prediction based on BHG,using transductive learning to characterize drug and cell features and inductive learning to generate prediction results.The algorithmic workflow of R-HGCN includes the following steps: first,constructing a BHG network based on interactions between drugs and cells;second,training and outputting drug node features fused with cell feature mappings through a two-layer HGCN network,and obtaining cell line features using simple convolutional layers;finally,concatenating drug and cell features and using a DNN classifier to produce prediction results.The model training results indicated that in the GDSC dataset,R-HGCN achieves a prediction accuracy of approximately 0.87,an AUC value of 0.85,with an average precision(AP)and F1 score of around 0.87 and 0.80,respectively.In the CCLE dataset,the model’s prediction accuracy further improved,reaching approximately 0.9.Compared to all baseline models selected in this study,R-HGCN demonstrates a certain level of enhancement in prediction accuracy and precision.During robustness validation using PDX and TCGA datasets,it was found that the model demonstrates enhanced generalization compared to existing methods.Additionally,by employing the model interpreter of R-HGCN to generate cell gene scores,the analysis reveals that target genes of specific drugs are ranked highly,indicating the model’s biological interpretability.Conclusion1.It was found that common drug response prediction model validation experiments suffer from data leakage issues,and a new validation data split strategy based on cell-blind,drug-blind,and double-blind principles has been proposed.2.Compared to drug representation methods based on linear structures and physicochemical properties,methods based on graph representation and pre-training of large-scale compound models can better represent drug features.3.The integration of multi-omics data through the fusion strategy of comprehensive marginal representation and joint representation learning enables a better feature representation of cells.Compared to genetic mutations and copy number variations,gene transcriptome data plays a more significant predictive role in the fused data.4.The anti-tumor drug response prediction model R-HGCN,constructed based on a BHG neural network,demonstrates significantly improved accuracy and generalization compared to baseline methods.Innovation1.Based on current CDR prediction models,limitations of commonly used three model validation data split strategies were identified,and feasible data validation split strategies were proposed.2.A novel CDR prediction model,R-HGCN,which combines inductive and transductive feature learning approaches,was introduced in an innovative manner.The model’s generalization performance and interpretability were analyzed. |