Font Size: a A A

Statistical Analysis Of Massive Imbalanced Data With Multiclass Logistic Regression

Posted on:2016-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:X J ChenFull Text:PDF
GTID:2180330464456297Subject:Statistics
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of information technology and the Internet,humans entered the era of big data. According to the analysis of the large capacity brought challenges to traditional statistical method and calculating. When the data is too large,the traditional statistical estimation methods couldn’t realize with the ordinary computer,such as the amount of data is beyond memory or the calculation results couldn’t obtain within tolerable time. These barriers have greatly limited the application of advanced statistical technique. There are mainly two ways to dealing with the challenge of analysis of massive data. The first one is to layout Hadoop or Spark distributed processing system in large computer cluster,and then realize parallel computing based on Map Reduce algorithm. The cost of this way is expensive for ordinary users. The other way is subsampling,that is to say,to analyze a smaller subsample which are reasonably extracted from the whole sample rather than to analyze massive whole datasets,so as to achieve the purpose of saving the cost of computing. The sampling is a challenging problem in classification learning problem when datasets are imbalanced. The popular uniform random sampling method exist serious problem,and that is because of the serious imbalanced distribution between different classes.The subsample extracted by uniform random sampling method may contain only very small amounts of samples of minority class,or even lack samples of some classes. It is no longer be valid to directly apply popular classification algorithms to this subsample extracted by uniform random sampling method. This paper studied the effective sampling strategy for massive imbalanced multi-classified data and multinomial Logistic regression model. We proved that it needs to correct the intercept parameters according to probability ratio of sampling at this time. At the same time,we present the correction formula and use statistical numerical simulation to study the effectiveness of sampling strategy. In this paper,the main work is as follows:1.We propose a subsampling method based on Case-Control study for a multinomial regression model of imbalanced categorical data,and present the correction formula. We use statistical numerical simulation to compare the result between the new method and the popular uniform random sampling method.2.We propose a new estimation method based on a multiple binomial regression combinedwith the idea of Case-Control sampling for multinomial regression model of large scale multiclass imbalanced data, and apply the random simulation method to study the effectiveness based on massive datasets and large scale classes.3. Studied the effectiveness of estimation under various sampling methods through statistical simulation and compare efficiency loss between the subsamples and the whole samples.
Keywords/Search Tags:Imbalanced data, Logistic regression, Subsampling, Multi-classification
PDF Full Text Request
Related items