Font Size: a A A

Distance-weighted Discrimination Methods For High-dimensional And Imbalanced Multi-classification Problems

Posted on:2022-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q HuFull Text:PDF
GTID:2557306323969699Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the development of the information age,a large number of high-dimensional classification data have been brought,and these data usually have imbalanced categories,such as the genetic data of patients,the number of people who are sick and those who are healthy is greatly different,and the number of people suffering from various diseases is different.There have been extensive studies on the multiclassification model,but many models are established under the condition that the sample size is greater than the number of variables(n>p),and the imbalance is not corrected by the model.This paper mainly explores the high-dimensional imbalanced multi-classification problem under the condition of(p>n).To solve the problem of high-dimensional multi-classification,two multiclassification distance weighted discrimination models are proposed in this paper.One is SMDWD model based on the comparison of two categories,and the other is ASMDWD model based on angle.Moreover,elastic-net penalty function is added into the model for variable screening.Secondly,in order to solve the imbalance problem,two processing methods are proposed in this paper.One is to use the proportional weighted method to correct the loss function,the other is to use the idea of integrated model to carry out data sampling,and combine the two methods with the two models in this paper.In this paper,the coordinate descent method is used to solve the model,and we simulate several multi-classification problems,and evaluate the models in the accuracy of class prediction and variable screening.The results show that when there are a lot of redundant variables in the high-dimensional data,the proposed methods have more advantages than the existing multi-classification distance weighted discriminant model,which can screen out the important variables and improve the classification accuracy.SMDWD and ASMDWD have their respective advantages in the accuracy of classification prediction and variable selection,which also depends on the distribution of data sets.The computational complexity of ASMDWD does not increase linearly with the increase of categories,so it has more advantages in computing speed.In addition,the simulation results show that the two imbalance correction methods proposed in this paper can be applied to the imbalanced classification problem effectively.Finally,the methods presented in this paper are applied to the actual classification of lung tumors.The objective is to classify tumor cells according to genetic data,which is a typical high-dimensional multi-classification data set with imbalanced categories.The results show that the weighted sparse DWD model presented in this paper has the optimal classification accuracy and can screen out the important genes in the data set.
Keywords/Search Tags:Multi-classification Problems, Sparse DWD Model, High-dimensional Data, Imbalanced Classification Problems
PDF Full Text Request
Related items