Font Size: a A A

Deep Forest Ensemble Learning For Classification Of Alignments Of Non-coding RNA Sequences Based On Multi-view Structure Representations

Posted on:2022-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2480306329974469Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since the completion of human genome sequencing,scholars have found that,except for a small number of protein-coding gene sequences,most of the genome does not participate in protein-coding work and is only transcribed into RNA.These transcripts are called non-coding RNAs(nc RNAs).Also found that the higher the evolutionary level of the species,the higher the proportion of non-coding RNA in the genome(up to 98% in the human genome).More and more studies have revealed that non-coding RNA plays a key regulatory role in many very important physiological and pathological processes,is closely related to a variety of diseases,and can be used as molecular targets for disease diagnosis and treatment.There are many types of non-coding RNAs,and common family types include mi RNA,pi RNA,sno RNA,sn RNA,si RNA,and lnc RNA.With the widespread use of the new generation of sequencing technology,a large number of non-coding RNAs have emerged,most of which have unknown functions and molecular mechanisms,resulting in a huge data gap between the vast amount of non-coding RNAs and the lack of annotated information.The methods based on biological experiments often require huge human and financial costs,and the cycle is long and difficult to be applied in large-scale data analysis.Therefore,it is urgent to use machine learning and information technology to build an efficient and accurate functional analysis model of non-coding RNA.It is one of the most important ways to understand and infer the function of non-coding RNAs to determine the relationship between different non-coding RNAs in sequence,especially in structure.Existing algorithms for determining the relationship between non-coding RNAs are mainly based on unsupervised learning,and since the sequence conserved of most non-coding RNAs is very low,how to extract and make better use of structural information of non-coding RNAs is one of the challenges in research.In this paper,based on the deep fusion learning framework of convolutional neural network and deep forest algorithm(GCForest),we propose a non-coding RNA classification and recognition model,which integrates multiple sequence-structure alignment features,called GCFM(GCforest Fusion Method).Compared with the unsupervised learning algorithm,GCFM based on the supervised learning framework can make better use of the known information of the non-coding RNA family,thus helping to dig deeper into the complex and abstract internal relationships among non-coding RNAs.GCFM consists of a multi-view structure representation module and deep integration module:(i)Multi-view structure representation module:three types of multi-view representation methods are proposed,including sequence-structure alignment coding representation,structure image representation,and local structure shape alignment coding representation.These different angles and levels of alignment and representation of structural characteristics enable potential specific properties between non-coding RNAs to be captured by GCFM.(ii)Deep integration module: a deep integration model based on convolutional neural network and deep forest algorithm is proposed.The convolution module is used to learn more advanced feature representations.The cascading forest module in the deep forest algorithm is used to train the final classification model,and each cascading layer is composed of XGBoosting,Random Forest,and Extratrees algorithms.Compared with other deep learning architectures,the deep forest algorithm does not need to adjust a large number of parameters and has better classification and prediction accuracy.Compared with the existing non-coding RNA classification methods based on the comparison,the F value of the GCFM method is improved by 6%.Besides,the effectiveness of multi-view structure feature representation and deep integration architecture is explored through comprehensive and systematic experiments,and the time consumption of the GCFM method compared to the method containing only convolution module is analyzed.Also,to further evaluate the validity and availability of the GCFM model,we designed examples of the GCFM model in three non-coding RNA tasks: GCFM-based non-coding RNA clustering task,GCFM inferred non-coding RNA phylogenetic tree,and GCFM predicted RNA interactions.In the clustering experiment of the non-coding RNA family,based on the classification matrix generated by GCFM,a variety of clustering methods were used to obtain the final clustering results with row vectors as the characteristics.Compared with the existing non-coding RNA clustering methods(RNAclust,Ensembleclust,and CNNclust),the accuracy of the GCFMbased method is improved by 20% in the clustering studies involving unknown non-coding RNA families.In the phylogenetic tree construction of non-coding RNAs constructed by GCFM,most of the non-coding RNAs were located correctly in the phylogenetic tree.In the study of RNA interaction,the prediction accuracy based on the GCFM method is 90.63%.Finally,to maximize the availability method of research,we developed GCFM online server(http://bmbl.sdstate.edu/gcfm/),source code,and related data is available in the server.
Keywords/Search Tags:pairwise ncRNAs classification, ncRNAs clustering, multi-view structure feature representation, GcForest, deep fusion framework
PDF Full Text Request
Related items