Molecular docking has emerged as a routine technique for modern lead discovery and optimization,but its real-world performance is still unsatisfactory due to the inaccuracy of scoring function(SF).One possible explanation is that they always assume an additive functional form to represent the linear relationship between the binding affinity and the features that characterize a protein–ligand complex.In recent years,machine learning-based scoring functions(MLSFs)has been proposed as a type of more flexible method,which can automatically learn generalized nonlinear functional forms from training data rather than the predetermined functional form.However,despite the rapid development of MLSFs recently and their substantially superior performance than classical methods in most previous studies,due to the current unfair assessment environment,whether these MLSFs can consistently outperform the classical methods still needs further exploration.On the other hand,the emergence of novel featurization strategies for protein-ligand complexs and the rapid advances of the new generation of artificial intelligence(AI)algorithms such as ensemble learning and deep learning,has brought new opptunities for the further development of MLSFs.Herein,we have systematically explored the performance of MLSFs in terms of three major capability of a SF,i.e.scoring power,docking power and screening power,and the main contents and conclusions are as follows:(1)To better recognize the potential of classical SFs,a comparative assessment of25 commonly used SFs was conducted.Accordingly,the scoring power was systematically estimated by using the state-of-the-art ML methods that replaced the original linear regression method to refit individual energy terms.The results show that the newly-developed MLSFs consistently performed better than classical ones.In particular,gradient boosting decision tree(GBDT)and random forest(RF)achieved the best predictions in most cases.Structural and sequence similarities between the training and test proteins could exert significant impacts on the final performance,but the superiority of MLSFs could be fully guaranteed when sufficient similar targets were contained in the training set.Moreover,the effect of the combinations of features from multiple SFs was explored,and the results indicated that combining NNscore2.0 with one to four other classical SFs could yield the best scoring power.(2)Due to the fact that one might focus more on the performance of a SF in virtual screening(VS),a systematic assessment was carried out to re-evaluate the effectiveness of 14 reported generic MLSFs in VS.Overall,most of these MLSFs could hardly achieve satisfactory results for any dataset,and they could even not outperform the baseline of classical Glide SP.RFscore-VS trained on the DUD-E dataset showed its superiority for most targets.However,in most cases,it clearly illustrated rather limited performance on the targets that were dissimilar to the proteins in the corresponding training sets.Taken together,generic MLSFs may have poor generalization capabilities to be applicable for all the real VS campaigns.Therefore,it should be quite cautious to use this type of methods for VS.(3)To benchmark the VS performance of target-specific MLSFs on a relatively unbiased dataset,the MLSFs trained from three representative protein–ligand interaction representations were assessed on the LIT-PCBA dataset,and the classical Glide SP SF and three types of ligand-based quantitative structure-activity relationship(QSAR)models were also utilized for comparison.Two major aspects in VS campaigns,including prediction accuracy and hit novelty,were systematically explored.The calculation results illustrate that the tested target-specific MLSFs yielded generally superior performance over the classical Glide SP SF,but they could hardly outperform the 2D fingerprint-based QSAR models.In terms of the correlations between the hit ranks or the structures of the top-ranked hits,the MLSFs developed by different featurization strategies would have the ability to identify quite different hits.(4)Based on some cross-docking datasets dedicatedly constructed from the PDBbind database,several MLSFs designed for protein-ligand binding pose predictions were developed.The calculation results illustrate that using ECIF,Vina energy terms and docking pose ranks as the features can achieve the best performance according to most validation tests.Our calculation results also highlight the importance of the incorporation of the cross-docked poses into the training of the SFs with wide application domain and high robustness for binding pose prediction.The source code and the newly-developed cross-docking datasets can be freely available under an open-source license.We believe that our study may provide valuable guidance for the development and assessment of new MLSFs for the predictions of protein–ligand binding poses.To sum up,we have explored several crucial issues in the training and validation of MLSFs from the perspectives of scoring,docking and screening,and constrcted several MLSFs and datasets for others to use.This study could provide valuable guidences for not only the development and evaluation of novel MLSFs but their potential applications in real-world drug development. |