Font Size: a A A

Research And Application Of High-dimensional Multi-class Imbalanced Classification Algorithm Based On Decomposition Strategy

Posted on:2022-04-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:J T WangFull Text:PDF
GTID:1484306311466914Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
BackgroundsThe classification of cancer subtypes based on high-dimensional omics data such as genomics,metabolomics and radiomics,is an important issue that has practical implications.It plays a great role in the diagnosis of cancer and early cancer prediction and prevention.Since the prevalence rates of different disease types vary greatly,these high-dimensional data are usually multi-class imbalanced datasets,in which some classes(majority clas-ses)have significantly more samples than others(minority classes,usually those which are more important).The problem of class imbalance may cause learning bias towards the majority classes which can lead to a decline in the prediction performance on the minority classes.In particular,if the dimensionality of the data is too high,such as microarray data,the imbalance problem will be aggravated for many classification algorithms.Therefore,in the fields of Bioinformatics,Machine Learning,and Pattern Classification,the classification of multi-class high dimensional data remains one of the most challenging tasks.Moreover,the construction of a multiclass classifier is typically more difficult than a binary classifier,since the decision boundary of a multiclass classifier is much more complicated than that of a binary classifier.More and more researchers made efforts to solve the multiclass imbalanced classification problems by combining the decomposition strategy and binary imbalanced classificaion algorithoms.Decomposition strategy helps to generate simple classifiers.In recent years,many approaches based on decomposition strategy have been designed.However,these approaches have two major shortcomings.The first shortcoming is that these algorithms achieve good prediction ability only when the base classifiers achieves both high discrimination and calibration.The second shortcoming is that these algorithms fitted the base classifiers independently but ignore that all the base classifiers influence the prediction results togather.As a result,the final prediction results may be not the optimal results.Considering these two shortcomings the previous algorithms based decomposition strategy,the current study proposed two novel multiclass imbalanced classification algorithms based on decomposition strategy,and investigated the prediciton abilities of them through simulation experiments and real data analysis.In addition,in application research,the current study used imaging omics technology to construct radiomics database of common intracranial tumors,and uses this typical multi-class imbalanced data and the algorithm proposed in this study to construct a diagnosis model of common intracranial tumors.The auxiliary diagnosis model provides a new tool for the diagnosis and decision-making of intracranial tumors.MethodsIn the part of model theory research,the current study first proposed two novel multi-class imbalance discrimination algorithms based on decomposition strategy:ACM algorithm and APM algorithm.Then,simulation experiments are used to explore the influence of sample size and imbalance ratio on the discrimination performance of ACM algorithm,APM algorithm and other 7 common multi-class imbalance algorithms based on decomposition strategy.In the simulation experiment,the imbalanced data of three classes are generated through the joint normal distribution,and different sample sizes and imbalance ratio are set.The imbalance ratio is set to 1:2:3,1:3:5,1:5:7 and 1:7:11,respectively.The sample size of the smallest class is set to 20,30 and 50,respectively,and the sample size of other classes were set according to the sample size of the smallest class and the imbalance ratio.Finally,this study compared the discriminative performance of ACM algorithm,APM algorithm and other benchmarks on five recognized multi-class imbalanced gene microarray data sets:TOX-171,MLL,SRBCT,Lymphoma,and Breast using three metrics:F-measure,G-mean,and MAUC.In the part of application research,the current study collected data of 474 patients with new diagnosed intracranial tumors who were pathologically confirmed in Linyi People’s Hospital from 2011 to 2016.The current study collected these patients’medical records,MRI images(four sequences:T2 FLAIR,T1 enhanced sagittal,T1 enhanced coronal,and T1 enhanced axial positions)and image inspection reports.In this study,a total of 336 radioomics features were extracted from the four sequence MRI images using radiomics technology,and the tumor invasion location was extracted by structuring the imaging inspection reports.Based on these data,the APM algorithm proposed in this research is used to construct the auxiliary diagnosis model of common intracranial tumors.ResultsThe results of simulation experiments showed that the ACM algorithm and theAPM algorithm always achieved better discrimination performance than other benchmark algorithms under different sample sizes and different imbalance ratios.When the sample size is small(Nminimun=20),the F-measure of all algorithms showed a decrease trend as the imbalance ratio increases,but the ACM algorithm and APM algorithm are less affected.Starting from the imbalance ratio equal to 1:3:5,the advantages of the ACM algorithm and the APM algorithm over other algorithms have become more and more obviously.Especially when the imbalance rate is equal to 1:5:7,except for the ACM and APM algorithms,the F-measure of other algorithms are all below 0.7.When the sample size increases(Nminimun=50),the F-measure of all algorithms will increase,and the F-measure of the ACM algorithm and the APM algorithm are higher than other algorithms under different imbalance ratios.Except for the two algorithms proposed in the current study,the F-measure of all the other algorithms were lower than 0.8.For the five gene microarray data sets,the discriminant performance of the ACM algorithm and the APM algorithm were generally better than other algorithms.For the TOX data set,when the F-measure was used as the evaluation metirc,the APM algorithm achieved the best discrimination performance(0.845),while the ACM algorithm achieved the best discrimination performance(0.836)when the G-mean is used as the evaluation metirc.For the MLL data set,the F-measure,G-mean and MAUC of the APM algorithm were 0.951,0.951,and 0.966,respectively,and higher than all the other algorithms.The F-measure,G-mean and MAUC of the ACM algorithm were 0.943,0.941 and 0.960,respectively,which is only lower than those of the APM algorithm.For the SRBCT data set,the F-measure,G-mean and MAUC of the ACM algorithm were 0.996,0.996 and 0.997,respectively,and higher than all the other algorithms.The F-measure,G-mean and MAUC of the APM algorithm were 0.992,0.990 and 0.994,respectively,which is only lower than those of the ACM algorithm.For the Lymphoma data set,the F-measure,G-mean and MAUC of the APM algorithm were 0.993,0.997 and 0.997,respectively,and higher than all the other algorithms.The F-measure,G-mean and MAUC of the ACM algorithm were 0.989,0.995 and 0.996,respectively,which is only lower than those of the APM algorithm.For the Breastdata set,when the F-measure or MAUC was used as the evaluation metirc,the ACM algorithm achieved the best discrimination performance(0.888,0.926),while the APM algorithm achieved the best discrimination performance(0.890)when the G-mean is used as the evaluation metirc.In the section of application research,the pathological types of 474 patients with intracranial tumors included:260 cases of meningioma,118 cases of diffuse astrocytic and oligodendroglial tumours,40 tumours of the cranial and paraspinal nerves,38 tumours of the sellar region,and 18 cases of mesenchymal,non-meningothelial tumours.Obviously,this dataset was a typical multi-class imbalance data.A total of 336 radiomics features were extracted from the four MRI sequences.Among them,the differences among the five intracranial tumors of 306(91.07%)features were significantly(FDR<0.0001).In addition,the frequency of invasion of different types of tumors to different locations in the brain area is also significantly different.The auxiliary diagnosis model constructed using radiomics features,tumor invasion locations,and APM algorithm achieved good tumor discrimination ability(F-measure:0.844).Among all the tumor types,the F-measure for diffuse astrocytic and oligodendroglial tumours is 0.884,the F-measure for meningiomas is 0.959,the F-measure for mesenchymal,non-meningothelial tumours is 0.621,and the F-measure for tumours of the cranial and paraspinal nerves was 0.925,and the predicted F value of tumours of the sellar region was 0.886.ConclusionsThe ACM algorithm proposed in this study reduces the requirements for the calibration of the base classifier by adaptively adjusting the codeword of classeds;the APM algorithm optimizes all the base classifiers simutaneously according to the prediction performance of the final prediction result,thereby improving the discrimination performance of the whole decomposition framework.By comparision on the simulation data and five microarray datasets,it is found that the discriminant performance of the ACM algorithm and the APM algorithm are better than other algorithms.In the section of application research,the current study used radiomics techniques to quantitatively analyze medical images,and the results showed that radiomics features have a strong ability to distinguish common intracranial tumor types.In addition,the location of tumor invasion also has a strong ability to discriminate intracranial tumors.The auxiliary diagnosis of intracranial tumors is a typical multi-class imbalance discrimination problem.The APM algorithm proposed in this research was used to construct auxiliary diagnosis models of common intracranial tumors,which has achieved a good discrimination performance superior to other algorithms.The diagnosis model could help reduce the complexity and work intensity of diagnosis,and provides a reference for optimizing surgical plans before craniotomies.
Keywords/Search Tags:multi-class classification, imbalance, decomposition, radiomics, diagnosis model
PDF Full Text Request
Related items