| The genome is known as the book of life,where nature depicts the entire life process through biological sequences.Biological sequences provide the original support for humans to understand the nature of life at the molecular level.Natural language and biological sequences have remarkable similarities.Using Natural Language Processing(NLP)technology to reveal the meaning of the “book of life” contributes to a comprehensive understanding of the functions and structures encoded by biological sequences and has greatly facilitated the development of biological sequence analysis research.The performance of NLP and machine learning depends heavily on the quality of data representation.Nucleic acids and proteins are the two most important biological macromolecules.Nucleic acid and protein molecules have a wide variety of sequences.However,the sequence composition and structure are highly similar,and different sequences are characterized and encoded similarly.Therefore,in this dissertation,several typical molecules from nucleic acid and protein sequences were selected as the study objects for biological sequence representation learning,including plant R protein,m RNA,nc RNA-protein interactions(NPI),and ACVP sequences,with traditional feature engineering,static representation,dynamic pre-trained representation,and automated feature engineering as technical tools for sequence representation learning of biomolecules.The main work conducted is summarized as follows:(1)A model for predicting plant R proteins based on pairwise energy content.Plant R proteins can recognize effector proteins secreted by pathogenic microorganisms and trigger immune responses to pathogenic microbial infestation.Accurately identifying plant R proteins is an important research topic in plant pathology.Most existing computational models focus on animal rather than plant proteins.Moreover,the methods for protein sequence representation rely mainly on amino acid frequency characteristics,ignoring the inter-amino acid properties.In this work,we propose Stack RPred,a plant R protein prediction model based on traditional feature engineering.First,a pairwise energy content matrix of amino acid residues is introduced and used to propose two plant R protein representation learning methods.Then,the obtained sequence representation information is fed into the constructed two-layer Stacking ensemble learning framework for training to predict plant R proteins.The five-fold cross-validation and independent test validation accuracy reached97.5% and 96.7%,respectively,indicating that the Stack RPred method is an effective tool for predicting plant R proteins.(2)A predictive model of m RNA subcellular localization based on a multi-scale interpretation mechanism.m RNA plays a vital role in the post-transcriptional regulation of genes,and it is a direct template for directing protein biosynthesis.However,there is a lack of work on predicting m RNA subcellular localization,and the exploration of m RNA sequence representation learning methods needs to be further enhanced,especially interpretable prediction methods.Hence,this dissertation constructs a multi-scale self-attentive interpretation mechanism named m RNA-CLA for predicting multi-labeled m RNA subcellular localization.The model obtains sequence features at different locations through multiscale convolutional layers.Moreover,it uses the self-attentive scores generated by the selfattentive layer for each sequence and the position weight matrix extracted from the CNN layer to provide interpretable model analysis.In particular,sequence base analysis was performed to obtain the base specificity of m RNA sequences at different positions.From the evaluation results,m RNA-CLA enhances the predictive performance of m RNA subcellular localization while increasing the interpretability of the model by visualizing the m RNA sequences.(3)An NPI prediction method based on graph representation learning and community detection.nc RNA-protein interactions are involved in essential life processes,and it is crucial to explore nc RNA-protein interactions.Existing methods are mainly based on nc RNA or protein molecules’ sequence or structural feature vectors.In contrast,the feature analysis of their interactions is relatively rare,especially since the potential of applying GNN methods in predicting NPI is still not well developed.To this end,an NPI prediction model based on GNN is proposed by transforming the NPI prediction problem into a binary classification problem on subgraphs.Specifically,two groups of structured labels are utilized to distinguish two different types of nodes: nc RNA and protein,which alleviates the problem of over-coupling in the graph network.Subsequently,the representations of nc RNA and proteins are optimized based on the community ownership relationships of the nodes in the graph.Moreover,the model applies a self-attention mechanism to preserve the graph topology to reduce information loss during pooling.Finally,experimental validation is conducted on two dense and two sparse graphs.The experimental results show that the proposed method exhibits the best prediction accuracy on dense graphs compared with existing methods,with the prediction accuracy exceeding 90% in both cases;it also shows an overall good experimental performance in sparse graphs.(4)An automated feature engineering-based ACVP prediction method.The COVID-19 pandemic severely affects people’s daily life.The development of ACVP prediction models based on sequence representation learning strategies can help develop anti-coronavirus drugs.However,although much antiviral peptide(AVP)data has been identified,there are still few experimentally validated ACVP samples as a member of AVP.In addition,existing prediction methods mostly rely on experience for feature selection and model parameter setting,which are prone to subjective bias and make model training more and more ”expensive”.Automated feature engineering methods can optimize features and models and help alleviate the above problems.Therefore,this paper proposes an ACVP prediction method ACVP-Auto based on automated feature engineering.The proposed method first learns multi-view representations of AVP and ACVP sequences.Then it introduces Bayesian techniques to optimize the search space to select the best combination of features and models.A training model is constructed with the help of the AVP dataset and a small amount of ACVP data.In conclusion,building an automated machine learning model helps improve the model prediction performance and effectively avoids the inefficient manual empirical setting of hyperparameters,which allows it to scale up. |