Font Size: a A A

Research On Rna Functional Target Prediction Using Word Embedding Representation And Deep Learning Technology

Posted on:2024-08-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y N YangFull Text:PDF
GTID:1520307313950969Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Ribonucleic acid(RNA)is a key component of cell biology with functional targets that play an important role in post-transcriptional regulation.These targets have a profound impact on various biological processes in eukaryotes by interacting with the regulatory factor RNAbinding proteins(RBPs).As an essential part of the post-transcriptional regulatory process,RNA splicing transforms precursor messenger RNA(Pre-m RNA)after transcription of most cellular genes into mature coding RNA through linear splicing processing,thus directing the ribosomes in the cell to synthesize proteins.However,recent studies have revealed that the terminals of Pre-m RNA molecule produce a class of endogenous non-coding RNAs with a closed-loop structure by back-splicing,i.e.,circular RNAs(circ RNAs).RNA-binding proteins interact and adsorb circ RNAs through their interaction,which regulate gene expression and influence cellular functions.Meanwhile,the majority of these interactions appear in the 3’untranslated regions of the coding RNA sequences,involving a variety of key biological processes such as transcription,modification and subcellular localization.Therefore,in-depth exploration of the functional binding mechanism of RNA and RBP has extremely important theoretical and practical value for basic biological research and disease treatment.Benefiting from the rapid development of high-throughput sequencing technology,massive amounts of functional targets with high resolution have been derived.However,current experimental studies of RNA functional targets in vivo are still limited to transcripts expressed in specific experimental cell types,which requires computational methods to reveal binding information without direct observation.With the emergence of deep learning technologies in genomics and other research fields,the development of new methods for RNA functional target prediction drives forward effectively.Efficient intelligent computing methods provide powerful guidance for accurate targeting of post-transcriptional regulatory functions by uncovering and summarizing patterns and relationships within large-scale complex data.This study relied on RNA-protein relationships to effectively represent RNA sequential information with natural language word embedding techniques,and constructed a deep learning model specialized for the research problem to accurately predict the binding preference of targets.To broaden the application fields,this study further constructed a generic model to systematically assess the propensity of any functional binding events,and also empowered the ability to recognize binding motifs for interpreting the patterns of target interactions.Finally,this research proposed an advanced language model based on abundant unannotated sequences that addressed tasks related to post-transcriptional regulation,and provided biological explanations for the effectiveness of the model.The detailed research works are summarized as follows:(1)Compared with linear RNAs,circular RNAs were significantly different in structural and biological properties,which increased the complexity of prediction caused by their covalent closed-loop structures.The existing computational approaches provided a single strategy for representing circ RNA sequences without considering the biological correlation,and the designed framework failed to fully explore the potential information under the interactions,resulting in a performance bottleneck.Therefore,this work developed a novel computational method called i Circ RBP-DHN,using deep hierarchical network for the identification of circ RNA-RBP interaction.Since the regulation of functional targets depended on the synergy between different ranges of nucleotides,this study first utilized static word embedding technology to model the circ RNA sequences and obtained their continuous distributed vectors by unsupervised learning as a feature of global contextuality.Meanwhile,K-tuple nucleotide frequency patterns were combined to express local relationships,followed by a self-attention mechanism to integrate the diversity linkages between bases at each position that provided a basis for model discrimination.The experimental results showed that i Circ RBP-DHN significantly improved identification performance of circ RNA functional targets,especially on small-scale datasets.Such improvement came from the matching relationship between the binding motif pattern and the word frequency in the corresponding vocabulary,and the pretrained language model had the potential to extract specific binding patterns from the semantics and grammar implied by the sequences.Meanwhile,this study also verified that the word embedding encoding scheme enabled to greatly improve the efficiency of automation and precision in scientific research,thus providing new perspectives to reveal the intrinsic regulations of biological sequences and understand biological phenomena.(2)Since static word embedding technique only generated fixed vectors of base information,leading to their inability on capturing the differential variations caused by specific segments in various sequential contextual environments,which resulted in representation limitations on large-scale datasets.Furthermore,most of the existing prediction methods constructed specific prediction models for each set of RBPs to identify their functional targets,however,this strategy was unable to be applied to circ RNA sequences that lacked known target RBPs.Considering the above limitations,we presented a dual computational method called HCRNet for identifying RBP binding events on circ RNAs.To capture the hierarchical relationship of sequences,this study fused multi-source biological features including dynamic and static word embedding representations,and constructed a deep temporal convolutional network with global expectation pooling to ensure effective induction of discriminative information without losing the robustness of the model.The experimental results showed that the proposed method had flexible applicability,allowing to handle target prediction tasks with different data scales,and exhibited overall superiority in performance over existing computational approaches.Notably,the model based on the generic strategy also maintained superior generalizability and robustness over 150 novel RBPs.Furthermore,HCRNet not only assessed the probability of RNA binding events,but also located the interaction position simultaneously,which provided important practical value for identifying novel functional targets.Finally,with the outstanding feature learning capability of HCRNet,this work successfully employed motif analysis to reveal the specific binding patterns of different RBPs at circ RNA level.Significant differences in the binding motifs of the same RBP on different types of RNA were found,further emphasizing the importance of employing a dual prediction strategy.(3)Different from the back-splicing process of circ RNA,the precursor messenger RNA(Pre-m RNA)after transcription of most cellular genes needed to undergo splicing,capping,and tailing steps to mature into m RNA.These m RNAs served as protein-coding templates and were essential for maintaining cell function and normal functioning of organs.Research showed that the functional interaction targets of regulatory sequences were mainly located in the 3’untranslated region of coding RNA,which was essential for regulating gene and protein expression at the post-transcriptional level.Cis-regulatory elements in this region were recognized and bound by trans-acting factors such as RNA-binding proteins to modulate RNA modification,abundance and cellular localization.Even though a multitude of high-throughput biochemical assays were used to characterize the sequences,the complex nature of transcriptomes posed great challenges to the thorough interpretation of their functions.Meanwhile,existing computational methods relied heavily on sequence labeling information of RNA,and they suffered from a common problem of generalizability.The architectures of predictive models required not only deliberately designed but also optimization and tuning for specific tasks,which led to challenges in transferring them to other related studies.In this study,we developed an advanced language model called 3UTRBERT.Learning from the purely regulatory sequences,this model allowed for successful application in various downstream tasks of post-transcriptional regulation by fine-tuning operations with slight modifications to the model architecture,including RNA-RBP linkage prediction,RNA modification site identification and subcellular functional localization.Moreover,3UTRBERT enabled to visualize the semantic relationships and contributions based on the nucleotide level of sequences,thus effectively identifying the conserved binding motifs and deciphering the biological functions involved in post-transcriptional regulation,which had great insight for solving the widely criticized ‘black box’ problem in deep learning.
Keywords/Search Tags:RNA binding proteins, RNA functional targets, Post-transcriptional regulation, Natural language processing technology, Deep learning, Interpretability analysis
PDF Full Text Request
Related items