| Studies have shown that lncRNAs play an important role in transcription,splicing,and gene expression,and are closely related to the development of complex diseases.Therefore,investigating the functions of lncRNAs is of great significance for understanding the mechanisms underlying biological processes.However,the current understanding of the functions of lncRNAs in humans remains limited.Typically,lncRNAs require binding to partner-proteins to exert their effects.Therefore,identifying the interaction between lncRNAs and proteins is important for understanding the mechanisms underlying biological processes.Experimental methods for identifying lncRNA-protein interactions include ChIRP,CHART,RIP,RIP-ChIP/Seq,CLIP,etc.However,these experimental methods are time-consuming and expensive.With the development of machine learning(ML)techniques,the use of computational methods to predict lncRNA-protein interactions has been widely studied.Computer-aided prediction methods provide guidance for experimental methods by identifying potential interactions and reducing the blind spots in the experimental process.Therefore,developing efficient and accurate computational models to predict lncRNA-protein interactions is of great significance.The current lncRNA-protein interaction prediction methods are mainly divided into two categories:one is based on sequence information,structural information,evolutionary knowledge or physical and chemical properties of lncRNA and protein.Such models can characterize and predict lncRNAs and proteins outside the prior network.However,such methods do not provide effective guidance information for experiments due to their poor model performance.The second is the method based on network representation learning.Network representation learning technology has achieved remarkable results in tasks such as social network analysis,product recommendation,and biological network analysis.Despite its excellent performance,the lncRNA-protein interaction prediction model based on this method usually only reconstructs the prior network,and cannot infer new information from external samples.Considering the above problems,this thesis focuses on developing machine learning models with high performance and inductive ability for lncRNA-protein interaction prediction.Using the network representation learning method,our models learn the rich connective relationships from the lncRNA-protein interaction network,and then predict the potential associations between them.The specific research contents are as follows:(1)This thesis presents an innovative approach to predicting lncRNA-protein interactions through the use of a heterogeneous network embedding model called LncPNet(Predicting lncRNA-Protein Interactions by Heterogeneous Network Embedding).LncPNet leverages lncRNA-protein interaction data to construct a heterogeneous network and then utilizes a two-step process for learning node representations within the network.The first step involves the use of random walk sampling to generate context with network semantic relationships.The second step involves the application of the skip-gram model to learn context relationships and generate vector representations of network nodes.LncPNet achieves AUC of 0.911,0.998 and 0.999 on NPInter v2.0,RAID v2.0 and NPInter v4.0,respectively.The experimental results show that this method has better predictive performance.(2)However,experimental results demonstrate that even relatively simple ML methods can accurately predict lncRNA-protein interactions.In this study,it is shown that lncRNA-protein interactions are scale-free,resulting in biased estimates of the performance of models.In particular,it is necessary to prepare negative samples before training an ML model.There are,however,no negative samples in the lncRNA-protein interaction database.Traditionally,negative samples are generated through random sampling.Since the lncRNA-protein interaction network is scale-free,there is a clear difference in degree distribution between the generated negative samples and the true positive samples.This difference can be learned by the ML model and ultimately reflected in the model’s prediction preference.To eliminate bias in model evaluation,this thesis proposes a negative sample generation method based on a degree distribution balance(DDB)strategy.With this negative sample generation method,the degree distribution of positive and negative samples can be balanced.Experiments on multiple datasets in this thesis demonstrate that this method effectively solves the prediction bias issue and makes model evaluation more objective.(3)The LncPNet model proposed in research(1)is a transductive learning method,rather than an inductive learning method.That is,for lncRNA and protein nodes that do not appear in the known network,LncPNet cannot represent and infer them.To address this issue,this study further proposes an inductive network representation learning method for predicting lncRNA-protein interactions,called iLncPNet.The DDB method introduced in research(2)is employed to generate negative samples for unbiased model evaluation.iLncPNet not only exceeds LncPNet and other existing models in terms of performance,but also has the function of inferring and predicting external interaction pairs from the prior network. |