With the continuous increasement of cost and difficulty in drug research and development,it has become a consensus among researchers that using computational methods to speed up the entire development process.On the other hand,breakthroughs in Artificial Intelligence(AI)technology in recent years have given rise to new ideas in various fields,and AI-based research is becoming a focus of attention in each research area,the field of drug discovery has also benefited from this.With the powerful learning,understanding and modelling capabilities,AI technologies have led researchers to realize that computing is no longer just an aid to the drug development,but that AIbased approaches have encompassed the entire process of drug development,and AIdriven drug design will be an important part of drug development in the future.Although antibodies are a new class of drugs that are currently favored,small molecule drugs are still the mainstream of drug development at this stage.This paper provides a comprehensive and in-depth study of AI-driven small molecule drug design and optimization.In the first introductory chapter,this paper details the current dilemmas faced by traditional drug development methods,the booming development of AI-driven drug companies,and systematically discusses the definition and current research status of AI small molecule drug design,as well as the shortcomings of the whole field.The paper then conducts research of AI-based small molecule drug design and optimization in four directions: large-scale pre-training model for molecular representation and specific tasks,X-MOL;molecular graph network model,EIAA(Edge Induced Atom-wised Attention);molecular dual learning framework,DUAL-MOL;and the preliminary development of general framework for small drug molecule design based on artificial intelligence,Drug Line.Finally,the paper concludes with thoughts and perspectives on the future development of this field.The specific work in this paper is as follows:(1)Large-scale pre-training model for molecular representation and specific tasks,X-MOL.In Chapter 2,this paper develops X-MOL,a super large-scale molecular pretraining model based on an attention mechanism.In Specific,X-MOL uses a special Encoder-Decoder structure with shared parameters which reduces the size of the XMOL model and effectively increases the computational speed.With a pre-training task designed according to the SMILES characteristic,X-MOL learns the SMILES grammar rules on more than 1.1 billion small molecules,and on this basis,a good understanding of the molecules is achieved.Subsequently,this X-MOL is fine-tuned to five categories of downstream tasks-molecular properties prediction,chemical reaction prediction,drug-drug interactions,molecular de novo generation,and molecular optimization task-for a total of 12 specific tasks.In each of these tasks,X-MOL performs at the current state-of-the-art level.In addition,in the compound-target interaction(CPI)prediction task,the model that uses X-MOL’s representation of molecules as small molecule input also shows a significant performance improvement compared with the original model.These experimental results demonstrate the excellent performance of X-MOL and the great potential of molecular representation learning with large-scale pre-training techniques in the field of small molecule drug design.At the same time,the performance also means that subsequent researchers can rapidly model and achieve good performance for problems within the field of AI drug development without any need of designing separate models.In a further study,this paper first visualizes the attention mechanisms of X-MOL in various tasks,demonstrating the ability of X-MOL to understand molecules,and also observing the different understanding of X-MOL in different tasks.Finally,this paper explores whether the addition of auxiliary information would help the X-MOL to achieve better performance in downstream tasks,ablation experiments show that the X-MOL is already able to learn enough information and the additional information may have a negative impact on X-MOL.(2)Molecular graph network model,EIAA.In Chapter 3,in response to the lack of consideration of chemical bonds in current molecular graph networks,this paper proposes a graph network model designed specifically for small molecules,the Edge Induced Atom-wised Attention(EIAA)network.EIAA classifies all interatomic connections into seven types and computationally simulates the interactions between atoms within a molecule through different chemical bonds by means of a special attention mechanism.At the meanwhile,these bonds are updated using a "bond updating mechanism with the influence of atoms at both ends".In addition,EIAA introduces a virtual super-node for whole-graph pooling.The unidirectional connection between the super-node and other atoms ensures that the super-node is able to extract whole-molecule-level features without affecting the normal updating of the atoms.By comparing with the traditional graph attention(GAT)network,EIAA shows significant advantages on all 19 tasks of QM9.In this chapter,the paper also discusses the performance of deeper EIAA model,experiments on EIAA with 4-8 layers on the QM9 dataset show that increasing the number of model layers is effective in improving the final performance of EIAA in complex tasks.(3)Molecular dual learning framework,DUAL-MOL.In Chapter 4,this paper proposes a novel dual learning framework for the training of molecular task,DUALMOL.DUAL-MOL exploits the duaility between two different types of molecular tasks and unites these two task training processes through a simple regular term,allowing the two models to constrain each other during training and ultimately achieve better performance.In this chapter,the DUAL-MOL framework was constructed of molecular property prediction task and molecular de novo generation task as exemplars,and was experimentally validated on 140 tasks screened in Ch EMBL.The extensive experimental results show that the DUAL-MOL framework can effectively improve the performance of the molecular property prediction task and the molecular de novo generation task compared to the separately conventional training strategy.Later in the comparison with SSVAE,another framework that combines prediction tasks to facilitate molecule generation,DUAL-MOL also shows better performance and takes much less time to complete the same task than SSVAE.Finally,this paper further studies the different performance improvement of DUAL-MOL on different tasks.The results show that the smaller the amount of data,the greater the improvement will be achieved by DUAL-MOL,which is of great significance for the field of drug design lacking labeled data.(4)The preliminary development of general framework for small drug molecule design based on artificial intelligence,Drug Line.In Chapter 5,this paper designs and implements a general AI driven small molecule drug design framework,Drug Line,which can realize "one click AI drug development process".Aiming at the differences and connections between different AI drug design work,this paper creatively designs a core architecture called Package&Pool for Drug Line.Package is a way to package the methods proposed by different researchers.The design of Package completes the modularization of the whole framework and gives Drug Line a high degree of flexibility and later expansibility.While Pool is a set of storage standards and read logic for internal data of the framework.Through the standardization and real-time access of different data,Pool realizes the automatic operation of drug design workflow within the framework,and its standard file system also brings compatibility to Drug Line for new methods that may appear in the future.At the implementation level,the design of configuration file design and the Python package format further reduce the threshold to use. |