| In biological research,traditional batch RNA-sequencing methods(RNA-seq)can process thousands of cells at a time and obtain an average level of variation.At present,RNA sequencing technology has improved.However,the relevant bioinformatics analysis is still very weak,and it is difficult to obtain substantial and useful information only by KEGG pathway analysis and GSEA gene analysis.Single-cell RNA sequencing(scRNA-seq)has greatly improved our understanding of biological systems,revolutionizing transcriptomic research by revealing singlecell heterogeneity with high resolution.Clustering methods have become routine analytical tools for identifying cell types,characterizing their functions and inferring underlying cellular dynamics.There have been some clustering methods for single-cell RNA sequencing,but almost all of these methods use traditional similarity measures to construct similarity measure matrices.The disadvantage is that it cannot effectively deal with high-dimensional datasets,especially high-dimensional sparse or sparse datasets.Aiming at this drawback,this paper considers two different perspectives and constructs a new similarity measure.The main research contents are as follows:(1)Based on the concept of shared nearest neighbors,an improved quadratic similarity measure Y(xi,xj)is constructed.The existing quadratic similarity measures mainly focus on the impact of shared nearest neighbors on the similarity between data samples,while ignoring the similarity between the two samples themselves.Therefore,when calculating the similarity between two data samples,the similarity measure Y(xi,xj)constructed in this paper considers not only the impact of shared nearest neighbor samples on the similarity,but also the similarity of the two data samples themselves.In order to verify the effectiveness of the constructed secondary similarity measure,this paper selects two relatively simple clustering methods K-means and k-medoids to cluster Y(xi,xj)for the widely used traditional distance similarity measure and secondary similarity measure respectively.The silhouettes results show that the secondary similarity measure Y(xi,xj)is better than the traditional distance measure in general.To some extent,it overcomes the disadvantage of low reliability of traditional similarity measurement for high-dimensional data sets.(2)Aiming at the common problem of the existing quadratic similarity measure w(xi,xj)and Y(xi,xj):considering only the similarity impact of individual neighbor samples in the shared nearest neighbor on the two data samples,this paper constructs a new quadratic similarity measure Z(xi,xj).In order to measure the similarity of two data samples more comprehensively and accurately,a new quadratic similarity measure Z(xi,xj)is constructed based on the concept of shared nearest neighbor.This measure fully considers the influence of multiple shared nearest neighbors on the similarity of data samples,so as to measure the similarity between high-dimensional data samples more accurately.In order to verify the effectiveness of similarity measure Z(xi,xj),K-means and k-medoids are used to cluster and compare the similarity measures w(xi,xj),Y(xi,xj)and Z(xi,xj).The results show the effectiveness and stability of secondary similarity measure Z(xi,xj).In order to overcome the disadvantages of low reliability of high-dimensional sparse data by traditional similarity measures,and improve the existing quadratic similarity measures,this paper proposes two quadratic similarity measures Y(xi,xj)and Z(xi,xj),which will provide a more effective similarity measure for high-dimensional sparse dataset clustering problem,and provide favorable conditions for further clustering research and analysis. |