| Protein and DNA are bio macromolecules that contribute to an important part in organism,and their interactions,i.e.protein-DNA interactions,play an important role in various biological activities.The study of protein-DNA interactions can help understand the biological functions,and promote the development and research of new drugs.Only a small fraction of the residues called hot spots,which contribute to the most of the binding free energy in the protein-DNA interactions.Identification of hot spot residues could be a new direction to study the potential binding mechanisms and stability of protein-DNA complexes.Alanine scanning as a biological experimental technique used to determine the contribution of a specific residue to the stability or function of given protein,has been widely used to measure the binding free energy of residues in protein-DNA.This technique mutates residues at the protein-DNA interface into alanine and then identifies hot spots by calculating changes in free energy.However,because the experimental techniques for identifying hot spots are relatively inefficient and labor-intensive,it is necessary for developing computational approaches to predict hot spots.Nowadays,most computational methods were developed mainly based on molecular mechanics to identify hot spots.But the limitations of protein structures as well as the predictive efficiency,lead to low availability of these methods.In contrast,machine learning-based approach can overcome these shortcomings.We proposed two computational methods called PrPDH(Prediction of Protein-DNA binding Hot spot)and PrPDH-V2,which are feature-based machine learning approaches to predict hot spots in protein-DNA interface.A user-friendly web server is well established and freely available at http://bioinfo.ahu.edu.cn:8080/PrPDH/.In the first part of our work(PrPDH),we obtained 108 protein-DNA complexes with 414 amino acid mutations from dbAMEPNI and SAMPDI.To reduce the bias of two different original datasets,we performed data processing and finally got a dataset of 150 mutations for training and 64 mutations for test,We systematically assessed a wide variety of 114 features from a combination of the protein sequence,structure,network,and solvent accessible information and their combinations along with various feature selection strategies for hot spot prediction.We then trained and compared four commonly used machine-learning models,namely,SVM,random forest,Naive Bayes,and K-nearest neighbor,for the identification of hot spot using 10-fold cross-validation and the independent test set.Finally,we developed a feature-based predictor called PrPDH to identify hot spot using SVM based on the selected 10 optimal features.Comparative results on benchmark datasets indicate that our predictor is able to achieve generally better performance in predicting hot spot as compared to the state-of-the-art predictors,yielding an F1 score of 0.721 and AUC of 0.803 on the training set as well as an F1 score of 0.706 and AUC of 0.764 on the test set.In the second part of our work(PrPDH-V2),we proposed a feature coding method based on residue neighbor information to improve the ability of hot spot prediction model.Based on the protein-DNA binding mechanism,we speculated that the change of residue features between the complex and monomer might reflect the binding free energy of the residue.Moreover,the use of neighbor inform,ation can help to achieve a better performance according to the previous researches.We calculated the number of hydrogen bonds of the residue when binding to DNA as a donor.In addition,we proposed a feature encoding method based on neighbor information.We got total 41 features and selected 8 optimized features by using SVM-based recursive feature elimination(SVM-RFE)feature selection method.Finally,we established an improved predictor called PrPDH-V2 using SVM algorithm.Comparative results on benchmark datasets show that PrPDH-V2 has a great improvement in performance compared with PrPDH and other methods,yielding an F1 score of 0.787 and AUC of 0.871 on the training set as well as an F1 score of 0.755 and AUC of 0.852 on the test set.Better performance of PrPDH-V2 indicate the advancement of the feature coding method to identify hot spot residues. |