Drug design and development is a costly,long-term,and risky process that typically takes about 10 to 15 years and costs billions of dollars.Due to the complexity of biological systems,extensive experiments are required to evaluate the bioactivity,toxicity,and various pharmacokinetic properties of candidate drugs during the drug discovery process.The experimental measurement of related properties of a large collection of candidate drugs is time-consuming,labor-intensive,and expensive.The massive bio-medical literature contains a large amount of molecular quantitative structure-activity relationships obtained through complex experiments,which can help reduce the reliance on expensive experiments.Unfortunately,molecular structure information in the literature is usually expressed by molecular images,which cannot be interpreted directly by computers.Therefore,the development of molecular image recognition algorithms can help build automated literature mining tools and help obtain as much relevant molecular property data as possible,thereby accelerating the drug discovery process.However,compared with the space of potentially druggable molecules,the currently available drug data is still too scarce,so it is also very important to build accurate and efficient computational models for predicting molecular properties.The traditional molecular dynamics simulation method based on the density functional theory is difficult to be applied on a large scale due to the extremely high computational cost.Traditional machine learning methods can greatly improve efficiency in the presence of substantial expertise as well as experimental data to construct explicit molecular representations,which suffers from the problem of lacking generality.Deep learning methods can automatically learn relevant features from raw data in an end-to-end training process,which is very suitable for molecular data that have diverse structures and strong data heterogeneity.Deep learning models often require a sufficient amount of labeled data to be adequately trained.However,drug property data is rather scarce due to the high cost.Focusing on the problem of molecular property prediction,this thesis focuses on two kinds of research: on the one hand,studying molecular image recognition algorithms which can effectively support the drug properties acquisition in the literature and on the other hand,proposing a variety of learning strategies to build molecular property prediction deep learning models with high data utilization efficiency,strong generalization ability,ease of training and deployment to accelerate the drug design and development process.The main research contents of this thesis include the following aspects:First,a molecular image recognition method based on divide and conquer is proposed according to the characteristics of molecular images.Although the types of molecules are complex and diverse,the types of their constituent elements,including atoms and bonds,are very limited.Therefore,this paper proposes to represent atoms and bonds with their position center points,and then detect these center points and predict their related properties through key-point recognition neural network.After identifying the positions and related properties of atoms and bonds,the overall molecular structure can be constructed by a simple reconstruction algorithm,and further converted into a molecular representation that can be processed by computers.The model simplifies the recognition task and improves the robustness of the recognition through a divide-andconquer strategy,which greatly improves the recognition accuracy in the actual molecular image recognition task.Second,a data augmentation strategy for molecular property prediction is proposed.To alleviate the labeled data scarcity problem for molecules,this paper proposes a data augmentation strategy to increase the quantity and diversity of available molecular data based on the fact that there are several corresponding SMILES string forms for a given molecule.In addition,this paper also proposes a strategy to augment the test data and fuse the prediction results in the test phase to improve the robustness of the prediction.In addition,this paper also proposes a multi-step attention model based on bidirectional LSTM to improve the model feature extraction ability and data utilization efficiency.A comprehensive set of comparative experimental results fully demonstrate the effectiveness of the proposed model and strategies.Third,a molecular graph pre-training method for molecular property prediction is proposed.Based on the graph structure of molecules,this paper proposes a model named MG-BERT that can be pre-trained on molecular graphs.This model can capture the environmental information inside molecules to improve the performance of the model on downstream small-sample data tasks.Specifically,this thesis integrates the self-attention mechanism of the BERT model and the local message-passing mechanism to construct a new graph neural network variant.Then,it is proposed to mask or randomly replace part of the atoms in the input molecular graph and train the model to learn to recover the polluted atom types,which can force the model to mine the intramolecular environment information.Experiments show that the pre-trained MG-BERT model can generate context-sensitive atomic representations,explaining the effect of pre-training.In addition,we fine-tune the large-scale pre-trained MG-BERT model on multiple molecular property prediction tasks and compare it with multiple classical molecular property prediction models.The results demonstrate the effectiveness of the model.Fourth,a multi-task learning method for molecular property prediction is proposed.In this part,a multi-task learning model,named MTL-BERT,is constructed to fully explore and mine the correlation information between different tasks to improve the overall prediction performance.The model uses the backbone sharing neural network architecture,which can perform multiple tasks at the same time,which is quite efficient.At the same time,the model also integrates the innovation of the previous two chapters.On the one hand,the model uses pre-training to mine latent information hidden in unlabeled data to improve the predictive ability of labeled data.On the other hand,the model also uses data augmentation strategies in the pre-training,fine-tuning,and testing phases of the model to fully exploit the diversity of the data and help improve the robustness of model predictions.The application of the model on a large multi-task dataset verifies the effectiveness of the model. |