RNA molecules are one of the indispensable elements in biology, participating in aseries of cells most fundamental processes, including catalysis, RNA splicing, RNAediting, regulation of transcription and translation. The function of an RNA molecule isclosely related to its structure. However, the experimental determination of RNA structureis expensive and time-consuming, and computational approaches of RNA tertiary structureare so far less than optimal. Computational methods for modeling RNA secondarystructure have proven to be valuable toward determination of tertiary structure andfunction of an RNA molecule. The prediction of RNA secondary structure based on freeenergy model produces the problem that the true structure may be a suboptimal structurewithin an energy increment above the minimum free energy. The accuracy of the trueRNA structure prediction can be improved through grouping suboptimal structures into asmall number of clusters and computing representative structures for each cluster. In thispaper, we study clustering algorithms for RNA secondary structure prediction, and theachievements are introduced in the following.Firstly, a density based clustering with extensible radius dubbed ER-DBSCAN isproposed to cluster RNA suboptimal structures, according to the unknown distribution andthe unknown cluster number of RNA structures. Our algorithm selects different initialradiuses for clusters with different densities, and the clustering process starts from thehigher density point towards the lower density point. This method selects the unclassifiedhighest density point as the starting point of a new cluster, the radius of the cluster isautomatically adjusted during cluster expansion according to the density distribution anddensity variation. This method not only allows proper density variations within theclusters, but also detects clusters separated by the regions having different densities.Secondly, this study introduces a density clustering algorithm based on featureselection called RSFS-ER to cluster RNA suboptimal structures. The RSFS-ER algorithmuses cluster ensemble to generate the consensus matrix, which reflects the internalstructure of data sets. It evaluates the importance of each feature for clustering through comparing the consensus matrix and the similarity matrix of each feature. We performER-DBSCAN algorithm on the dataset consisting of the optimal feature subset to ensurethe quality of clustering results.Finally, this study uses the RBP score as a measure of RNA secondary structurecharacteristics and calculates the RBP matrix using the RBP algorithm. ER-DBSCAN andRSFS-ER algorithm are implemented to cluster RNA secondary structures using the RBPmatix as their inputs. And this paper will give the analysis according to the experimentalresults. |