The drug discovery process is a high-risk and high-commitment process with a very low success rate.In order to improve the efficiency of drug development and reduce the risk and cost of drug development,deep learning technology has been successfully applied to various stages of drug discovery.Among them,the rapid and accurate prediction of molecular properties based on deep learning technology can significantly accelerate the discovery and optimization process of lead compounds and play an indispensable role in drug discovery.However,there are still many problems in the current molecular property prediction method based on deep learning,such as the lack of interpretability of most existing models,the poor generalization ability of prediction models,and the description of molecular characteristics is not comprehensive enough,etc.Among them,the molecular representation is an important factor affecting the accuracy and generalization ability of the model.In view of the shortcomings of the existing molecular characterization forms,this paper focuses on the following two tasks:1.In order to build a comprehensive molecular representation method,this work proposes a molecular property prediction model based on multi-dimensional representation of the attention mechanism—Attention based Sequence and Graph Encoder(ASGE).The model uses the Frequent Continuous Subsequence algorithm(FCS)to decompose the SMILES sequence into smaller subsequence structures or single atomic structures,and builds a molecular sequence encoder based on the Transformer model architecture to encode molecualr sequence features.In addition,RDKit is used to process SMILES into a molecular graph,and the graph attention network Attentive FP is used to encode the molecular graph feature information to extract the atomic and bond structure information contained in the molecule,and to learn the key nodes and connections in the molecule.The molecular feature information of sequence and graph is fused to predict molecular properties through the designed feature decoder.This work uses 8 datasets in Molecule Net to train,verify and test the model,and obtains the best performance in 6 of them,for example,the AUC value of ASGE in the Clin Tox dataset is 0.081 higher than that of FP-GNN,and the AUC value of ASGE in the BACE dataset is 0.082 higher than that of Gra Seq.The visualization of the key nodes of the properties in the molecule also provides a certain degree of guidance for the further optimization design of the molecule.And through ablation experiments,the necessity of multi-dimensional encoding molecules in this work to improve the performance of the model is verified,and it is confirmed that the molecular features encoded by our model ASGE using multi-dimensional molecular representation methods are more comprehensive and accurate.It provides a new idea of molecular feature fusion,and can be widely used in other drug discovery model tasks.2.Based on fusing molecular graph information,combined molecular fingerprint information and 3D spatial information to encode multidimensional characterization of drug molecules,a graph neural network-based molecular property prediction model—3D Spatial Structure and Molecular Fingerprint Graph Network(3DF-GNN)was developed.The model considers molecular representation in three dimensions simultaneously,uses RDKit to process the two-dimensional molecular graph and threedimensional spatial information of the molecule,and encodes them by constructing a convolutional neural network that introduced by the External Attention mechanism to capture important molecular feature information.In addition,two kinds of molecular fingerprints with different emphases are combined and deep neural network is used to learn molecular feature information,and finally all features are fused to predict molecular properties.We conducted experiments to evaluate model performance in 7widely used benchmark datasets for classification and regression,and obtained the best performance in 5 datasets and suboptimal performance in 1 dataset,such as the 3DFGNN model in the Free Solv dataset.The RMSE value is 0.671 lower than the Attentive FP model,and the AUC value of the 3DF-GNN model in the HIV dataset is0.042 higher than that of the FP-GNN model.The importance of multidimensional encoding characterization of drug molecules considering spatial information was verified by a large number of ablation experiments,which proved the superiority of model 3DF-GNN.The analysis of the visualization results of the key nodes in the molecule also provides a degree of reference for further optimization of the molecule design.According to the research survey,the proposed 3DF-GNN is the first pioneering work that integrates 3D spatial information,molecular graph and complementary combined molecular fingerprint information to predict molecular properties.In addition to being able to accurately predict molecular property results,our ideas provide guidance for further exploration of models to accurately predict molecular properties,and 3DF-GNN can be used as a powerful and effective computational tool to address the challenging problem of molecular representation learning.Finally,we have also created a website for this model that can be used by pharmaceutical researchers.This paper mainly carried out two research works from the perspective of multidimensional encoding to characterize drug molecules,and developed two neural network model architectures.In particular,the fusion of multi-dimensional molecular features and the consideration of high-dimensional molecular structural features proposed in this paper can characterize molecules more comprehensively,providing new ideas for molecular representation learning,which has important theoretical and practical application values. |