As the starting point of drug research and development,the discovery of hit compounds plays a crucial role in the whole research and development process.Virtual screening is an important technology in the discovery of hit compounds.It can use the computer to rapidly screen out potential active molecules for specific protein target from massive compounds,and significantly reduce the number of tested compounds in the biochemical experimental screening stage.With more and more protein three-dimensional structures being analyzed,the structure-based virtual screening method has shown obvious advantages in the research of the hit compounds discovery.Structure-based virtual screening method relies on molecular docking technology.However,the existing molecular docking theories are not perfect.And moreover,there are many docking programs with different performances.If the compounds are sorted and screened only by scoring functions of the docking software,a stable and good success rate is hard to achieve.Therefore,it is of great significance to improve the success rate of strcture-based virtual screening method by optimizing the molecular docking program and formulating the screening scheme.In this study,the machine learning technology is used to optimize the virtual screening scheme,and improve the success rate and efficiency of hit compound discovery from three aspects.Firstly,the molecular docking method is improved to simulate the binding pose of small molecules docking to target proteins.Secondly,the classification method of compound activity is established to predict the activity of small molecules,and the small molecules are preliminarily screened according to the simulated binding pose of small molecule.Thirdly,the prediction model of protein-ligand binding affinity is constructed to predict the binding strength between the small molecules preliminarily screened and target proteins.Then,the hit compounds are determined after the fine screening.In terms of the above three processes,the main research contents of this thesis are as follows.1.A conformation search method based on fireworks algorithm is proposed.Firstly,the optimization problem representation of conformation search in molecular docking is defined.Secondly,the core strategies of fireworks algorithm applying to molecular docking are designed,such as explosion operator,mutation operator and fireworks selection strategy.Thirdly,according to the memetic algorithm theory,the fireworks algorithm is combined with the BFGS quasi-Newton search algorithm.Namely,the fireworks algorithm is taken as the global optimizer to quickly locate the promising areas in the search space.While BFGS quasi-Newton search algorithm is used for the local fine search,so as to improve the convergency speed and increase the opportunities to find the optimal solution.In the end,this method is implemented in the framework of Autodock Vina.Thus,the molecular docking program FWAVina is established,and the FWAVina is tested based on the standard test dataset.The results show that compared with the classic docking program Autodock Vina,FWAVina has a faster convergence rate and higher accuracy of molecular docking.2.Based on ensemble learning technology and Spark platform,a compound activity classification method,named ENS-VS,is proposed.First of all,the ensemble learning technology is used to fuse the protein-ligand interaction features and structural features of ligand,and integrate three classification algorithms including support vector machine,naive bayes and decision tree.The ensemble learning method is designed to improve the applicability and stability of ENS-VS on different target proteins,and simultaneously solve the serious classs imbalance problem between the active and inactive compounds.Secondly,this method is parallelly implemented on Spark platform to improve the efficiency of screening active compounds from massive compounds.Finally,on the basis of the DUD-E standard database,the protein family-specific model,target-specific model and general model are constructed,respectively.Furthermore,the applicable situations of the models are summarized.As there are more known active compounds for one target,the target-specific model should be adopted.When there are less known active compounds for one target,the protein family-specific model should be adopted.In addition,when a new target protein appears,the general mode should be adopted.The experimental results indicate that compared with the classic molecular docking program,the ENS-VS method can effectively improve the success rate of active compounds screening,and the ENS-VS method can be combined with any molecular docking program.3.Based on graph attention network,a prediction model of protein-ligand binding affinity,Complex-Net,is proposed.Firstly,the graph structure in graph theory is used to represent the molecular structure data,aiming to automatically learn features at the atomic level.Secondly,we make the following improvements based on the graph attention network.On one hand,the dynamic feature mechanism of nodes is designed in the graph attention network.The edge information is dynamically added to the node feature.Moreover,each node feature dynamically changes with the difference of aggregation nodes,so that the difficulty in processing the edge information could be solved.On the other hand,the virtual super nodes are introduced and then taken as the graphlevel feature aggregation mechanism.Then,the node-level feature representation is aggregated to the graph-level feature,so as to make the network model used for graph-level prediction.Thirdly,multi-task learning with hard parameter sharing is introduced in the model.The prediction for root mean square distance(RMSD)between three-dimensional structures of decoys and natural ligand is taken as an auxiliary task.The dataset is expanded to improve the generalization ability of Complex-Net.Finally,four schemes are used to test the model performance.The results show that the Pearson correlation coefficient and Spearman correlation coefficient predicted by Complex-Net are superior to the benchmark method RF-Score and the representative method Pafnucy based on the convolutional neural network.In this paper,the machine learning technology is used to improve molecular docking program,establish the classification method of compound activity and construct the protein-ligand binding affinity prediction model.The performance of virtual screening is improved from these three aspects. |