| SnoRNA is a type of short non-coding RNA,whose lengths usually vary between 60 and 300 nucleotides.Although snoRNA doesn’t directly involve in protein genesis,it can regulate mRNA translation by participating in the post-transcriptional modification of other important RNAs,such as r RNA.The classical function of snoRNA is reported to regulate two types of post-transcriptional chemical modifications,namely methylation guided by C/D Box snoRNA and pseudouridylation guided by H/ACA Box snoRNA.Additionally,there are pieces of literature indicating that snoRNA may have a similar function to siRNA,which suppresses the expression of mRNA by pairing and binding to the target mRNA.Since snoRNA is involved in low-level and fundamental biological processes,it has been reported to be associated with various diseases.Traditional methods to study the association between snoRNA and a certain disease usually require the collection of tissue samples from patients,and experiments to access the strength of the association between the two like RNA sequencing and other methods.With the development of snoRNA research,many publicly available snoRNAdisease associations have been collected and organized into dedicated databases,which have laid the foundation for predicting snoRNA-disease associations through computational methods.Though computed association prediction can’t confirm the real association between snoRNA and diseases,predictions made through reasonable algorithms could give some hints for research direction.Currently,research on snoRNAdisease associations is still relatively limited,based on this background,this article explores the matrix completion method to predict snoRNA-disease associations.By collecting experimentally confirmed snoRNA-disease associations from the RNADisease database,the involved snoRNAs and diseases can be separately identified.Subsequently,the nucleic acid sequences of snoRNA are downloaded from public databases,and five numerical feature descriptors of snoRNA are extracted,i.e.3-mer,Z curve nucleotide frequency algorithm,and mutual information algorithm.Thereafter,the similarity between each snoRNA pair is calculated using the absolute value of Tanimoto coefficient,resulting in a snoRNA similarity matrix.In addition,by searching the topological structures of all diseases in the MeSH database and the DO database,a directed acyclic graph is constructed for each disease,and a semantic similarity matrix between diseases is calculated based on the directed acyclic graph.Based on the widely used semantic similarity calculation method for diseases,this article proposes an improved semantic similarity calculation method.The advantage of the improved similarity measurement is that values are more dispersed and contain more information.Based on the snoRNA similarity matrix and the disease similarity matrix,this article constructs two types of input datasets for machine learning algorithms and matrix completion algorithms,respectively.The machine learning algorithms include random forests,logistic regression,multilayer perceptron,and support vector machine models.While for the matrix completion algorithm,we explored a bounded nuclear norm regularized matrix completion method,which can limit the output values to a certain range and impose constraints on observed associations,making it more suitable for association prediction problems.Using evaluation metrics in machine learning,the AUROC values of the machine learning models in this article are all higher than 0.83,while the AUROC value of the matrix completion reaches 0.95.Moreover,in terms of the problem explored in this article,the sensitivity,i.e.,the recall rate of class 1 samples,is more important,which is 0.84 and0.89 for the machine learning model and the matrix completion model,respectively.These indicators show that the models built in the article are of the sound ability of association prediction. |