| Facing genomics,physics,political science,economics,and many other fields growing massive data,people increasingly rely on computers intelligently to obtain useful information to solve problems.As two effective techniques of intelligent data analysis,correlation detection and dimensionality reduction are widely attended by the researchers.Correlation detection methods can automatically find relationships in large dataset;Dimensionalisty reduction methods map the data from the high-dimensional feature space to a low dimensional feature space,which can better reflect the nature of the data and reduce the computational complexity of the pro-cessing procedure.In this thesis,we report some results of our researches on correla-tion detection and dimensionality reduction methods in both theory and practice.The main work can be summarized as follows:1.Detecting Multivariable Correlation with Maximal Information EntropyFor k variables,firstly,the maximal information matrix R is constructed accord-ing to the maximal information coefficient(MIC)scores of any pairs of variables;then,the maximal information joint entropyH_R~k of k variables is computed based on the positive eigenvalues of R;finally,maximal information entropy(MIE),which measures the correlation degree of the concerned k variables,is calculated using1-H_R~k.Simulation experimental results show that MIE can detect one-dimensional manifold dependence of triplets.The applications to Word Health Organization da-tasets further verify the validity and feasibility of MIE.2.A Detecting Measure for Trivariate One-dimensional Manifold Depend-encesMIC is a good measure for detecting linear and nonlinear correlation between pairs of variables,but not directly applicable for triplets.Based on the idea of MIC and concept of total correlation,we propose the maximal total correlation coefficient(MTCC),which measures a one-dimensional manifold dependence among three vari-ables with a score in[0,1].MTCC is based on the idea that if a relationship exists among triplets,then a 3D grid can be drawn on the scatterplot of the triplets that parti-tions the data to encapsulate that relationship.Using the strategy of computing MIC,we also present an efficient dynamic programming method to approximate the true value of MTCC in practice.Experimental results on simulation and real datasets veri-fy the generality,equitability and validity of MTCC.3.Feature Clustering Dimensionality Reduction Based on Affinity Propaga-tionFeature clustering is a powerful technique for dimensionality reduction.However,existing approaches require the number of clusters to be given in advance or con-trolled by parameters.By combining with affinity propagation(AP),we propose a new feature clustering(FC)algorithm,called APFC,for dimensionality reduction.For a given training dataset,the original features automatically form a bunch of clusters by AP.A new feature can then be extracted from each cluster in three different ways for reducing the dimensionality of the original data.APFC requires no provision of the number of clusters(or extracted features)beforehand.Moreover,it avoids compu-ting the eigenvalues and eigenvetors of covariance matrix.Extensive experiments on UCI datasets in terms of classification accuracy and computational time demonstrate the effectiveness and efficiency of APFC.4.Fisher Information Metri based Stochastic Neighbor EmbeddingTo improve the classification accuracy of text classification,Fisher information metric based stochastic neighbor embedding(FIMSNE)is proposed.In this paper,text word frequency vectors were taken as probabilistic density functions that are points on a statistical manifold,and their distance were defined by Fisher information metric.From the view of information geometry,t-stochastic neighbor embedding(t-SNE)was improved to FIMSNE.With 2D-embedding and classification tasks to20Newsgroups,TDT2 and Reuters21578 datasets,we verify that FIMSNE outper-forms t-SNE,Fisher information nonparametric embedding(FINE)and PCA in the whole.5.Deep Pearson Embedding for Dimensionality ReductionA new correlation structure-preserving embedding technique called deep Pearson embedding(DPE)for dimensionality reduction was proposed.DPE learns a paramet-ric mapping from the high-dimensional data space to low-dimensional latent space by minimizing Pearson correlation coefficient between similarity matrix S in the original data space and Euclidean distance matix D in the latent space.In such a way,the cor-relation structure of data is preserved as well as possible in the latent space.Visualiza-tion and classification experiments on MNIST,COIL-20,Extended Yale B and AR datasets demonstrate the effectiveness and superiority of DPE. |