| Long non-coding RibonucleicAcids(LncRNAs)do not encode proteins and have a length of more than 200 nucleotides with a relatively large molecular weight.LncRNAs bind to proteins and affect chromosome structure and gene transcription,participating in key cellular processes such as signal transduction,chromosome replication,substance transport,and mitosis.LncRNA-protein interaction(LPI)is diverse,including epigenetic modification of proteins by LncRNAs,such as acetylation,ubiquitination,phosphorylation,etc.;proteins affect the rate of LncRNA degradation and mediate the intracellular localization of LncRNAs;LncRNAs mediate protein intra-nuclear localization and affect the three-dimensional structure of proteins.In the process of cancer development,LncRNA regulates tumorigenesis by binding to polycomb complex or participating in gene transcription mediated by DNA methylation and acetylation.In addition,infectious diseases including COVID-19 are closely associated with LPI dysregulation.Therefore,accurate identification of LPI is crucial for the annotation of LncRNAs,the understanding of molecular mechanisms of diseases,and will contribute to some extent to the advancement of disease diagnosis technology as well as the development of therapeutic procedures.Due to the large volume and variety of LPI,it is impractical to conduct detailed experimental verification for each LPI,and the time and resource costs are high.Therefore,some calculation methods for LPI prediction have emerged.However,most of these methods are evaluated on a single dataset and cannot measure the model generalization ability.In addition,how to fully mine feature information reflecting functional properties from LncRNAs and proteins and apply deep learning to integrate advanced features learned from the original input features of proteins and RNA has been an important research challenge in this field.In view of the limitations of existing methods,based on deep learning and natural language processing,this dissertation proposes effective solutions to the key links of LPI prediction model construction.The main research contents are as follows:(1)A combination of deep learning and machine learning is adopted to predict LPI based on sequence and structure information.Firstly,the k-mer feature extraction method is used to convert biological sequence information into the form of feature vectors that can be recognized by the computer.Secondly,a stacked denoising autoencoder is used as the basic model,and a stacking ensemble strategy is used to combine Random Forest,Support Vector Machine and Gradient Boosting Decision Tree into LPI ensemble learning model.Finally,it is fine-tuned using logistic regression.Experimental results show that this integrated model can automatically learn high-level abstract features from a full range of simple sequence composition features and is more robust than a single model.The method achieves an Acc of 91.8%on the RPI488 dataset,which exceeds existing methods by 1.3%-2.1%,and also has high generalization performance on the RPI1807 and RPI2241 datasets.(2)Constructing LncRNA and protein association prediction models based on feature fusion method and improved residual networks.Firstly,sequence features are extracted from RNA sequences using k-mer,Gapped k-mer and reverse complementary sequences,and protein sequence features are extracted using binary profile features.Secondly,feature fusion is performed using Long and Short-Term Memory network and self-attentive mechanism.Finally,the fused features are input to the improved residual network and the performance of the model is evaluated by 5-fold cross-validation.The experimental results show that the model can effectively integrate heterogeneous features and capture new unique features with a comprehensive performance that is far superior to existing methods on plant datasets.(3)Using natural language processing ideas for reference,biological sequences are regarded as special texts,and Transformer model is used to predict the association between LncRNAs and proteins.First,in order to enhance the ability of feature representation,sequence features were extracted from RNA sequences using nucleotide frequency components of K-tubule,pseudo-dinucleotide components and physicochemical properties,and protein sequence features were extracted using Word2vec.The protein and RNA are then fed into an improved encoder and decoder,respectively.Specifically,to fit a small dataset,the attention mechanism of the encoder is replaced by a gated linear unit.In order to enhance the model’s mining of the feature information before and after the sequence,convolutional attention network is used to replace the mask multiple self-attention mechanism of the decoder.The experimental results show that this model has certain advantages in plant data set and other data sets. |