| Protein macromolecules enable gene regulation by binding to specific DNA sequence segments and influencing gene expression patterns and transcriptional activity.The search for protein-DNA binding sites can help humans understand the mechanisms of gene regulation and aid in designing targeted drugs that promote or inhibit the expression of targeted drugs.With the explosive growth of genomic data,traditional biological sequencing methods are timeconsuming and costly,significantly decreasing the efficiency of utilizing genomic data.Therefore,machine learning methods that can handle sequence information well are widely used in protein-DNA binding site prediction.However,the existing methods ignore the fact that a DNA sequence may contain multiple transcription factor binding sites.They do not fully extract the features of DNA sequences,and the prediction accuracy still needs to be improved.This thesis conducts an in-depth study on protein-DNA binding site prediction methods based on DNA sequence information,combined with current advanced algorithms in deep learning,and the main work is summarized as follows:This thesis proposes a protein-DNA binding site prediction method based on weakly supervised and hybrid neural networks.Considering the fact that a DNA sequence may contain multiple transcription factor binding sites,this thesis uses a multiple-instance learning method to partition each input DNA sequence into multiple overlapping fragments using sliding windows,with each DNA fragment as an instance and models each subsequence separately.Considering the complex dependencies between nucleotides,MNOH encoding is used to map them into high-dimensional features.In this work,an attention mechanism is added to the CNNBi LSTM network to give more attention and weight to the focused positions in the sequences for better capturing the global and local information of DNA sequences.A fully connected structured classifier is then used to obtain the final prediction results.To further improve the feature extraction effect on DNA sequences,this thesis proposes a BERT-based protein-DNA binding site prediction method using a DNABERT model pretrained with large-scale DNA datasets to improve the quality of DNA sequence feature extraction.The model contains a DNA sequence feature extraction module based on multiheaded self-attentiveness and a classifier with a fully connected network structure.The K-mer representation is used to segment DNA sequences into small segments of fixed length,and each segment is considered a word to construct a DNA vocabulary.Then it is passed into the DNA feature extraction module to capture both contextual DNA sequence information.Furthermore,the extracted features are passed into the classifier for classification prediction.The experimental results show that the method achieves an accuracy of 83.85% on the Broad dataset.Based on the above two works,this thesis proposes a feature fusion model with BERTCNN structure for protein-DNA binding site prediction performance to improve the binding site prediction performance further.The model combines the global information in the DNA sequence features extracted by BERT and the remaining sequence features processed by CNN to realize the DNA features’ modeling and achieve higher binding site prediction accuracy.The experimental results show that the proposed method has good classification prediction ability on protein-DNA binding sites. |