| Knowledge of protein binding sites contributes to understanding of protein functions at the molecular level,disease mechanisms,and drug development.Although current methods provide modestly accurate predictions,they still face several issues: 1)low accuracy,which means that current predictors hardly distinguish different types of binding sites or predict non-binding sites as binding sites,and especially the urgent development of protein-nucleic acid binding site predictions for intrinsically disorder proteins;2)poor generalization ability,structure-trained predictors perform poorer for intrinsically disorder proteins,and vice versa.This thesis proposed a set of sequence-based predictors to address or alleviate the above issues.The main research contents and innovations are as follows:(1)To alleviate the cross-predictions of protein-protein binding site predictions,a dynamic selection predictor,named PROBselect,is proposed.The comprehensive analysis at the dataset-level and protein-level indicates that different predictors provide various performances for different proteins.According to the analysis,PROBselect evaluates the confidence of the prediction from individual methods based on the SVR model.And then PROBselect dynamically selects the predictor for each input protein based on the confidence score and the predictive proportion of binding residues from SCRIBER.The results show that PROBselect reduces the cross-prediction rates and further improves the accuracy of predictions.(2)Disorder-trained predictors perform poorer and generate higher false positive rates.Moreover,only one predictor provides protein-nucleic acid binding predictions for intrinsically disorder proteins.Thus,a multitask learning-based predictor,named Deep DISOBind,is proposed to improve the accuracy of predictions of protein-,DNA-and RNA-binding sites in intrinsically disorder proteins.Deep DISOBind learns the shared features of three different prediction tasks based on the common layer,and then progressively builds individual modules for protein-,DNA-and RNAbinding sites predictions.The results show that Deep DISObind improves accuracy,reduces the false positive rates of predictions,and outperforms single-task learning predictors.(3)A modular deep learning model,Deep PRObind,is proposed to alleviate high false positive rates of protein-protein binding sites and address the problem that current predictors hardly achieve accurate predictions on structure-annotated and disorder-annotated proteins.By using residuelevel features and aggregation features including window-level and protein-level as inputs,Deep PRObind constructs a predictive model based on residue attention modules for structure-annotated and disorder-annotated proteins,respectively.And then Deep PRObind integrates the outputs of these two predictive models through a non-learning fusion strategy.DeepPRObind achieves the best results on the whole dataset,structure-annotated sub-dataset,and disorder-annotated sub-dataset while reducing crossprediction and over-prediction rates.(4)A meta-predictor,Hybrid RNAbind,is proposed to address the problem that current predictors of protein-RNA binding sites hardly achieve accurate predictions both on structure-annotated and disorder-annotated proteins.The performance of current predictors indicates that best structure-trained predictors perform poorer on disorder-annotated proteins,and vice versa.According to the above observations,Hybrid RNAbind uses the Deep DISOBind and NCBRPred as the base models and then integrates the results of the two base models with the Random Forest algorithm.The results show that Hybrid RNAbind achieves accurate predictions across structure-annotated and disorder-annotated proteins,and alleviates the problem of over-prediction. |