Research On SSR Prediction Method Based On BERT-CNN

Posted on:2024-02-16

Degree:Master

Type:Thesis

Country:China

Candidate:D Zhang

Full Text:PDF

GTID:2530307106965329

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

SSRs(Simple Sequence Repeats)are a type of DNA sequences that widely present in the biological genome.Due to their high variability and polymorphism,they are closely related to phenotypic changes and gene expression regulation.Therefore,efficient and accurate identification of SSRs is of great significance for researchers in the fields of genetic diversity analysis,genetic identification,and DNA fingerprinting,and so on.The development of SSR identification is limited by the drawbacks of traditional biological experiments,such as high cost,long cycle,and low throughput.In recent years,researchers have used sequence-based computational methods to identify SSRs,but they are generally limited by memory and have difficulty processing high-throughput sequencing data.Based on this,this thesis uses deep learning methods to predict SSRs,aiming to provide a more efficient and accurate solution for SSR prediction.The specific research work of this thesis is as follows:(1)To address the issue of missing SSR datasets,a comprehensive experimental dataset containing SSR and sequencing sequences was constructed.The dataset selects sequences from PPSD database,MSDB database,and NCBI database according to certain collection standards.Finally,a total of 25,000 SSRs and sequencing data for four species(Camellia sinensis,Citrus sinensis,Oryza sativa,and Homo sapiens)are collected.After data preprocessing,a total of 20,000 sequence data are used to construct a positive and negative dataset for model training.(2)To address the issue of vectorization of biological sequences,feature code sequence data and convert it into feature vectors.The BERT model was used to encode the sequence features,which could learn the semantic information of the sequence and generate feature vectors as inputs to the SSR prediction model.Evaluating BERT with four encoding methods:One hot encoding,K-mer encoding,Word2 Vec,and Fast Text,the results showed that compared to other encoding methods,BERT achieved the highest results under all four indicators,indicating that the BERT model has better feature extraction ability and better encoding performance.(3)To address the issue of insufficient feature extraction by the BERT model,a BERTCNN SSR prediction model was constructed.After encoding the features using the BERT model,the feature vectors were input into the CNN model to further extract the local features of the sequence and improve the accuracy of the model’s predictions.The predicted results of the BERT-CNN model were compared with those of the BERT-RNN,BERT-LSTM and BERT models.The experimental results showed that the BERT-CNN model improved Sensitivity,Specificity,and ACC by 4.9%,6.4%,and 7.1%,respectively,compared to the BERT model.This indicates the superiority of BERT-CNN in sequence feature extraction and verifies its applicability and accuracy in SSR prediction.

Keywords/Search Tags:

SSR prediction, feature encoding, BERT, Convolutional neural network

PDF Full Text Request

Related items

1	The Protein Secondary Structure Prediction Based On Convolutional Neural Network
2	Research On Feature Extraction Algorithm Of Functional Peptide Prediction Problem Based On BERT Pre-trained Model
3	Research On EEG Signal Classification Method Based On Convolutional Neural Network
4	Research On Prediction Of Polyproline Type Ⅱ Structure Based On Multi-feature Fusion
5	Research On The Prediction Of Protein-ATP Binding Sites Based On Improved Convolutional Neural Network
6	Research On FMRI Visual Information Deep Neural Network Encoding Model Based On Feature Fusion
7	Research On EEG Emotion Recognition Methods Based On Convolutional Neural Networks
8	Brain Network Feature Analysis And Application Based On Graph Convolutional Neural Network
9	Research On Link Prediction Algorithm Based On Deep Convolutional Neural Network
10	The Prediction Of LncRNA Based On Multi-modal Deep Neural Network