Research On LncRNA-Protein Interaction Prediction Methods Based On Sequence And Word Embedding

Posted on:2024-03-06

Degree:Master

Type:Thesis

Country:China

Candidate:D F Xia

Full Text:PDF

GTID:2530306917988089

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Long non-coding RibonucleicAcids(LncRNAs)do not encode proteins and have a length of more than 200 nucleotides with a relatively large molecular weight.LncRNAs bind to proteins and affect chromosome structure and gene transcription,participating in key cellular processes such as signal transduction,chromosome replication,substance transport,and mitosis.LncRNA-protein interaction(LPI)is diverse,including epigenetic modification of proteins by LncRNAs,such as acetylation,ubiquitination,phosphorylation,etc.;proteins affect the rate of LncRNA degradation and mediate the intracellular localization of LncRNAs;LncRNAs mediate protein intra-nuclear localization and affect the three-dimensional structure of proteins.In the process of cancer development,LncRNA regulates tumorigenesis by binding to polycomb complex or participating in gene transcription mediated by DNA methylation and acetylation.In addition,infectious diseases including COVID-19 are closely associated with LPI dysregulation.Therefore,accurate identification of LPI is crucial for the annotation of LncRNAs,the understanding of molecular mechanisms of diseases,and will contribute to some extent to the advancement of disease diagnosis technology as well as the development of therapeutic procedures.Due to the large volume and variety of LPI,it is impractical to conduct detailed experimental verification for each LPI,and the time and resource costs are high.Therefore,some calculation methods for LPI prediction have emerged.However,most of these methods are evaluated on a single dataset and cannot measure the model generalization ability.In addition,how to fully mine feature information reflecting functional properties from LncRNAs and proteins and apply deep learning to integrate advanced features learned from the original input features of proteins and RNA has been an important research challenge in this field.In view of the limitations of existing methods,based on deep learning and natural language processing,this dissertation proposes effective solutions to the key links of LPI prediction model construction.The main research contents are as follows:(1)A combination of deep learning and machine learning is adopted to predict LPI based on sequence and structure information.Firstly,the k-mer feature extraction method is used to convert biological sequence information into the form of feature vectors that can be recognized by the computer.Secondly,a stacked denoising autoencoder is used as the basic model,and a stacking ensemble strategy is used to combine Random Forest,Support Vector Machine and Gradient Boosting Decision Tree into LPI ensemble learning model.Finally,it is fine-tuned using logistic regression.Experimental results show that this integrated model can automatically learn high-level abstract features from a full range of simple sequence composition features and is more robust than a single model.The method achieves an Acc of 91.8%on the RPI488 dataset,which exceeds existing methods by 1.3%-2.1%,and also has high generalization performance on the RPI1807 and RPI2241 datasets.(2)Constructing LncRNA and protein association prediction models based on feature fusion method and improved residual networks.Firstly,sequence features are extracted from RNA sequences using k-mer,Gapped k-mer and reverse complementary sequences,and protein sequence features are extracted using binary profile features.Secondly,feature fusion is performed using Long and Short-Term Memory network and self-attentive mechanism.Finally,the fused features are input to the improved residual network and the performance of the model is evaluated by 5-fold cross-validation.The experimental results show that the model can effectively integrate heterogeneous features and capture new unique features with a comprehensive performance that is far superior to existing methods on plant datasets.(3)Using natural language processing ideas for reference,biological sequences are regarded as special texts,and Transformer model is used to predict the association between LncRNAs and proteins.First,in order to enhance the ability of feature representation,sequence features were extracted from RNA sequences using nucleotide frequency components of K-tubule,pseudo-dinucleotide components and physicochemical properties,and protein sequence features were extracted using Word2vec.The protein and RNA are then fed into an improved encoder and decoder,respectively.Specifically,to fit a small dataset,the attention mechanism of the encoder is replaced by a gated linear unit.In order to enhance the model’s mining of the feature information before and after the sequence,convolutional attention network is used to replace the mask multiple self-attention mechanism of the decoder.The experimental results show that this model has certain advantages in plant data set and other data sets.

Keywords/Search Tags:

Long Non-coding RibonucleicAcid, Protein, Long and Short-Term Memory Network, Residual Network, Transformer

PDF Full Text Request

Related items

1	Research On Meteorological Prediction Based On Long Short-term Memory Network
2	Forecasting Of Ionospheric TEC Using Long Short-Term Memory Network
3	Application Of Long Short-term Memory Network In Short-term Rainfall
4	Reconstruction Of Central Arterial Pressure Signal Based On Long Short-term Memory Network
5	Prediction Of Protein-DNA Binding Site Based On CNN-LSTM
6	Portforlio Selection Based On Long-short Term Memory Neural Network
7	A Precursor MicroRNA Identification Method Based On Convolutional And Long Short-Term Memory Networks
8	Research On Lithofacies Identification Method Based On Residual Recurrent Neural Network
9	Research On Short Term Forecast Of Fog Based On Deep-Learning
10	Research On Water Level Prediction Model Of Luoma Lake Based On Long-short-term Memory Recurrent Networks And Its Variants