| The protein-DNA binding site is a piece of DNA sequence that can interact with a protein.Finding a protein-DNA binding site can help predict the function of regulatory genes,understand the regulatory processes in biological systems,and identify pathogenic variants.More importantly,protein-DNA binding sites can help design drugs that can promote or inhibit the expression of the target gene.Therefore,accurately identifying protein-DNA binding sites from a DNA sequence is an important task.The traditional methods of identifying protein-DNA binding sites based on biological experiments are costly and time-consuming.Therefore,it is necessary to design a method for predicting protein-DNA binding sites in DNA sequences by deep learning methods.Specifically,this work deals with the prediction of protein-DNA binding sites from different perspectives,and has achieved good results in the Ch IP-seq datasets,different species datasets,and different cell line datasets.The main work of this article is as follows:(1)This work proposes a Multi-Nucleotide One-Hot(MNOH)encoding method.The main idea of the MNOH encoding method is considering the inter relationship between the nucleotides of adjacent positions in the protein-DNA binding site and encode the adjacent nucleotides into a single vector.The MNOH encoding method is used to process the DNA sequence,so that the model can make full use of the DNA sequence information during training,which improves the result of predicting protein-DNA binding sites based on the DNA sequence to a certain extent.(2)This work proposes a fusion of multi-scale convolutional neural networks and longterm and short-term memory models to predict protein-DNA binding sites in DNA sequences.In view of the variable length of protein-DNA binding sites,this work uses a multi-scale convolutional neural network to automatically learn the multi-scale features of the primary sequence of DNA to capture the characteristics of binding sites of different lengths,and then proteinDNA binding.The discriminant information of the protein-DNA binding sites was generated by the long-short-term memory model.(3)This work proposes a predictor named Deep TF,which combines the MNOH coding method with a fusion model.In order to verify the effectiveness of Deep TF in the prediction task of protein-DNA binding sites,this work uses different network structures on the Ch IP-seq data set,different species data sets,and different cell line data sets.Comparative experimental results show that The fusion model performs well in the task of predicting binding sites.The prediction method for protein-DNA binding sites based on DNA sequences proposed in this work makes full use of the related characteristics of DNA sequences and achieves good prediction results in multiple datasets. |