Font Size: a A A

Correlation And Compressibility Analysis Of High-dimensional Classification Data

Posted on:2019-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:L L XuFull Text:PDF
GTID:2370330548987460Subject:Statistics
Abstract/Summary:PDF Full Text Request
Usually,the data type of statistical research can be divided into qualitative data and quantitative data according to the measurement scale.The so-called qualitative data is a set of text representation data that indicates the nature of things and the type of things,and includes the classification data and the ordinal data statistically.For the classification data in qualitative data,the main research contents and conclusions except for the regression analysis based on logistics model,more is the relationship between variables and variables or correlation,especially for multi-dimensional cases,the complex between the variables Relationship and related structure is the focus of research,but also difficult.Disaggregated data are common in all areas of the social sciences,especially in the collection of various questionnaires,as well as in medical and psychology data.Because of the number of issues and factors involved,these results are often presented in the form of high-dimensional contingency tables.For high-dimensional contingency table,direct analysis is very difficult and cumbersome,and therefore need to be simplified,which requires the study of various relationships between variables,including the independence?Is it related?Is it neither independent nor relevant?If so,how is the structure?Whether it can be used to express a series of questions and so on.Therefore,we start from the most basic question of independence test The chi-square test and the likelihood ratio test for independent hypothesis testing require the validity and stability of their parameters at large sample data volumes,whereas high-dimensional contingency tables are bound to significantly reduce the frequency of each cell There are two ideas to solve this problem.One is to collect more samples and increase the data capacity of each cell.The other is to increase the number of cells by compressing variables,that is,to compress the original high-dimensional contingency table.For the first idea,the key is to increase the sample size.When there are variables in these variables that are abstract mnd do not exist in the existing data set,we must take the form of questionnaires to obtain the corresponding data.Usually,The amount of data that human and time-consuming methods receive is very limited.And the second one,since Simpson proposed Simpson's paradox in 1951,the compression of high-dimensional contingency table becomes a worthy research issue.If the high-dimensional contingency table is not properly compressed,there will be false correlation and false independence,Simpson's paradox and other issues.This article mainly aims at the above two ideas and corresponding solutions to the corresponding solutions,the main contents of the study include the following points:(1)Firstly,based on the logarithm linear model of the three-dimensional contingency table,the relevant theorem of the compressibility of the contingency table is given,and the theorem is deduced to the case of the high-dimensional contingency table.The conclusion is also valid for the high dimensional situation.The theorem can not only describe the relationship between variables,but also explain to some extent when the phenomenon of "homogeneity" occurs.That is,the ratio of the two variables does not change with the values of other variables.(2)On the basis of the compressibility theorem of contingency table of the existing three-dimensional and four-dimensional classification data,we focus on the association and compressibility theorem of high-dimensional contingency table by means of the relationship between log-linear model and correlation graph.Compared with the previous results,our method can be naturally extended to five-dimensional and above high-dimensional cases.On the other hand,we establish a more intuitive theorem of compressibility of relational graphs and analyze which variables are compressible What are incompressible,giving people a more intuitive form of pressenatation.(3)Based on the prioritization of variable importance of mutual information based on the existing three-and four-dimensional contingency table,we further study the variable ordering of variables based on conditional mutual information.Studies show that these two sorts of results are inconsistent of.In fact,in addition to the compressibility theorem based on the logarithmic linear model and the correlation graph proposed in this paper,there are other criteria to measure whether a variable is compressible,such as compression analysis based on linear information model or entropy,but they come out The answers may be different,and the compressibility ranking presented in this article is like a ruler to measure their results.(4)For the classification variables that are not easy to collect data,the samples usually obtained are limited,In order to obtain more effective samples,this paper proposes to use the Bootstrap sampling method to generate a certain amount of data sets,simulate their logarithmic linear models respectively to obtain the estimated vectors of the various parameters of the model,and perform clustering to obtain a plurality of parameters Estimated vectors to provide a choice of model predictions.The experimental results show that even if the parameters are different from each parameter of the real model,the probability distribution of the model corresponding to several parameter estimation vectors is smaller than the probability distribution of the real model,that is,the probability distribution is very close,In the vector,the closer to the confidence interval of the corresponding parameter,the smaller the distance between KL and the true probability distribution is.It is very important to explore the relationship between categorical variables and build a variable model.Especially for the common high-dimensional contingency table in categorical data,or the lack of sample size,it will not only increase the difficulty of the analysis,but also make the variables The relationship between the model and the model can not be trusted.In this paper,we propose the corresponding compressibility theorem,compressive ordering,and the method of adding samplesusing the Bootstrap sampling method.
Keywords/Search Tags:Contingency Table, Log-linear Model, Correlation Graph, Compressibility Theorem, Compressibility Ranking, Bootstrap Sampling
PDF Full Text Request
Related items