Font Size: a A A

Algorithm Analysis And Improvement Of The Maximal Information Coefficient

Posted on:2020-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2370330602954282Subject:Engineering
Abstract/Summary:PDF Full Text Request
In today's era of information explosion,massive data has become one of the most prominent features in the world today.The correlation between research data has become a research hotspot in the scientific community.In order to measure whether or not things are related and how they are related,statistical correlation analysis emerges as the times require.Among them,Pearson coefficient,Spearman coefficient and Kendall coefficient are widely used,but these correlation analysis methods cannot detection extensive relationship types due to their limitations.Therefore,in 2011,Reshef et al.introduced a new correlation analysis method,the maximal information coefficient MIC,which has been widely discussed in the scientific community.The maximal information coefficient MIC has two good properties-generality and equitability-relative to other statistics.However,as a computer-intensive method,the calculation of the exact solution of the maximal information coefficient MIC is very difficult.In order to obtain an approximate solution of the maximal information coefficient,Reshef et al.proposed a two-variable MIC approximation algorithm.This paper mainly analyzes the definition of two-variable MIC proposed by Reshef et al.and the approximation algorithm,and improves the defects of its existence.Firstly,combined with relevant literature,this paper analyzes the background of statistical correlation field and the research status at home and abroad,focusing on the related research of the maximal information coefficient MIC,including the definition of the maximal information coefficient MIC of two variables,the existing approximation algorithm and two of them,and compared with other popular correlation analysis methods to further enhance the understanding of the maximal information coefficient MIC.Secondly,through further research and analysis,this paper introduces the concept of "granularity" to explain the nature of the maximal information coefficient method,and to illustrate the essence of normalization in the MIC calculation process and the algorithm core of calculating MIC--The essence of meshing is given and the principle of meshing is given.At the same time,through the experimental analysis,the two properties of the maximal information coefficient MIC,the generality and equitability,are visualized,and the understanding of the maximal information coefficient MIC is further strengthened.Then,aiming at the insufficiency of the existing two-variable MIC approximation algorithm under the big data set,this paper proposes a two-variable MIC clustering algorithm adapted to massive data sets based on K-Means clustering algorithm,and calculates the two-variable MIC.The time complexity of the algorithm has been reduced from O(n2.4to O(n1.6,which improves the computational efficiency,and the design experiment verifies the generality and equitability of the algorithm.Finally,this paper extends the existing two-variable maximal information coefficient MIC definition to the multivariate level,and gives the definition of the maximal information coefficient MIC between the corresponding univariate and multivariate,and proposes univariate and multivariate The approximation algorithm of the MIC between the two,the design experiment to verify the generality and equitability of the algorithm,verify the effectiveness of the algorithm,so that it is possible to mine MIC between univariate and multivariate in the big data set.
Keywords/Search Tags:MIC, essence, computational efficiency, the multivariate
PDF Full Text Request
Related items