Font Size: a A A

Research On SSR Prediction Method Based On BERT-CNN

Posted on:2024-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhangFull Text:PDF
GTID:2530307106965329Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
SSRs(Simple Sequence Repeats)are a type of DNA sequences that widely present in the biological genome.Due to their high variability and polymorphism,they are closely related to phenotypic changes and gene expression regulation.Therefore,efficient and accurate identification of SSRs is of great significance for researchers in the fields of genetic diversity analysis,genetic identification,and DNA fingerprinting,and so on.The development of SSR identification is limited by the drawbacks of traditional biological experiments,such as high cost,long cycle,and low throughput.In recent years,researchers have used sequence-based computational methods to identify SSRs,but they are generally limited by memory and have difficulty processing high-throughput sequencing data.Based on this,this thesis uses deep learning methods to predict SSRs,aiming to provide a more efficient and accurate solution for SSR prediction.The specific research work of this thesis is as follows:(1)To address the issue of missing SSR datasets,a comprehensive experimental dataset containing SSR and sequencing sequences was constructed.The dataset selects sequences from PPSD database,MSDB database,and NCBI database according to certain collection standards.Finally,a total of 25,000 SSRs and sequencing data for four species(Camellia sinensis,Citrus sinensis,Oryza sativa,and Homo sapiens)are collected.After data preprocessing,a total of 20,000 sequence data are used to construct a positive and negative dataset for model training.(2)To address the issue of vectorization of biological sequences,feature code sequence data and convert it into feature vectors.The BERT model was used to encode the sequence features,which could learn the semantic information of the sequence and generate feature vectors as inputs to the SSR prediction model.Evaluating BERT with four encoding methods:One hot encoding,K-mer encoding,Word2 Vec,and Fast Text,the results showed that compared to other encoding methods,BERT achieved the highest results under all four indicators,indicating that the BERT model has better feature extraction ability and better encoding performance.(3)To address the issue of insufficient feature extraction by the BERT model,a BERTCNN SSR prediction model was constructed.After encoding the features using the BERT model,the feature vectors were input into the CNN model to further extract the local features of the sequence and improve the accuracy of the model’s predictions.The predicted results of the BERT-CNN model were compared with those of the BERT-RNN,BERT-LSTM and BERT models.The experimental results showed that the BERT-CNN model improved Sensitivity,Specificity,and ACC by 4.9%,6.4%,and 7.1%,respectively,compared to the BERT model.This indicates the superiority of BERT-CNN in sequence feature extraction and verifies its applicability and accuracy in SSR prediction.
Keywords/Search Tags:SSR prediction, feature encoding, BERT, Convolutional neural network
PDF Full Text Request
Related items