| G-quadruplex(G4)is a secondary structure of nucleic acid,which has a special combination of tetrad stacking.It plays an important role in biological processes,such as transcription and translation.The results of biological experiments and bioinformatics analysis showed that G-quadruplexes can recruit functional proteins,then act on specific biological processes.We call them G4-binding proteins.There is only one G4-binding protein database(G4IPDB)published in 2016,and due to the difficult detection technology and high cost,the number of identified G4-binding proteins is still small.Therefore,this paper focuses on human G4-binding proteins,builds a database for subsequent studies,analyzes the sequence features of G4-binding proteins,and builds predictive models based on the features.First,the database of human G4-binding proteins was built.Obtain datas through literature reading and database searches,then use Django,My SQL,and Bootstrap,build the database with functions such as browsing,searching,and downloading.The database contains 273 pieces of G4-binding protein data,each of which includes the protein and G-quadruplex information.Secondly,the sequence features of G4-binding proteins were analyzed,including amino acid composition,difference analysis,and motif prediction.The results showed that the amino acid composition of the G4-binding protein group was similar to the nucleic acid-binding protein group,but different from the human protein group.And from the motif predictions,found two typical motif patterns,one corresponds to the RGG domain,and another contains lysine(K),glutamine(E),and arginine(R).The above results indicate that the G4-binding protein sequence is specific and can be used as the feature of models.Finally,the predictive models of G4-binding protein were built.G4-binding proteins as the positive sample,excluding the human proteins of G4-binding proteins as the negative sample,and the sequences as the feature.The model building is based on support vector machines(SVM)and deep learning(CNN-Bi LSTM).The model was trained and validated at a ratio of 4:1,and got the SVM model(Accuracy: 0.6667;Precision: 0.7692;Recall: 0.6451;AUC: 0.6565)and CNN-Bi LSTM model(Accuracy:0.9315;Precision: 0.4286;Recall: 0.7391;AUC: 0.7650)with good prediction effect.These two types of models are suitable for G4-binding protein prediction of small sample and large sample data.Compared with the work that uses the RGG domain score to predict G4-binding proteins,our work uses machine-learning algorithm first to construct the prediction model of G4-binding protein,and uses the sequence information as the feature of prediction.This work is innovative and forward-looking. |