| Drug therapy plays a very important role in human life and health.However,the process of drug research and development is complex and long,which requires huge manpower and research and development funds.The high research and development cost ultimately affects the price of drugs and the choice of treatment plan for patients.Molecular property prediction is an important task in drug discovery,which can help researchers to find drug candidates,speed up drug development,and thus reduce the cost of drug development.At present,the application of deep learning technology in the field of drug research and development has made some achievements,and continuously improving the accuracy and reliability of molecular property prediction has become the main pursuit of researchers.This thesis mainly studies the prediction of molecular properties.In order to relieve the quantitative limitation of labeled compound data and extract efficient molecular representation,this thesis uses pre-trained language models to learn compound knowledge from a large-scale unlabeled compound corpus,and then transfers the learned knowledge to a small labeled data set.Specifically,the main research work of this thesis is divided into the following two parts.In order to encode the substructural features of molecules,a molecular fingerprint based molecular property prediction model(FP-BERT)is proposed,which uses stacked Transformer encoders to learn bidirectional molecular representations from a compound corpus.Each compound in the labeled dataset is represented as a set of molecular substructures,and the learned molecular representation is obtained by encoding the substructures in the molecular fingerprint using the pre-trained FP-BERT model.Then,the prediction model based on CNN is constructed for supervised learning.In order to construct a more comprehensive molecular representation,a multi-view molecular property prediction model,MV-Mol BERT,is proposed in this thesis,which integrates information among different molecular representations.MV-Mol BERT encodes each compound from the perspective of SMILES(Simplified Molecular Input Line Entry Specification)and molecular fingerprints respectively,and extracts highdimensional features with CNN.After that,the molecular representations of the two views are concatenated together as the multi-view molecular representation.Then,the neural network prediction model is constructed for supervised learning of molecular properties.The predictive performance of FP-BERT model and MV-Mol BERT model were evaluated on classified datasets(HIV)and regression datasets(ESOL,Freesolv,Lipophilicity,Malaria,CEP).The experimental results demonstrate the ability of FPBERT model to extract molecular fingerprint features.In addition,the multi-view prediction model MV-Mol BERT achieves better performance than FP-BERT. |