Font Size: a A A

Research On Feature Selection And Classification For Medical Imbalanced Data

Posted on:2019-06-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:M ZhuFull Text:PDF
GTID:1364330572488001Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology and medical technology,the medical industry has entered the period of artificial intelligence.How to find valuable medical information(knowledge discovery)from massive data has become a hot topic in the fields of statistics,machine learning and artificial intelligence.Classification is an important technology in knowledge discovery.Based on the characteristics of the dataset,it intends to develop a function or model(called a classifier)that can judge the categories of the data,and sums up each sample of the data set to a given class.Medical data sets are characterized by Class-imbalance.A class with a large number of samples is usually defined as Majority Class,and the other is as Minority Class.Besides,the features of medical datasets are complex,there are dozens of features,even tens of thousands,and there are complex associations between features.The classification task consists of two important processes:feature selection and classifier building.If all the features of the medical data set are directly used to construct the classifier,this operation is not only time-consuming but also reduces classification performance.Therefore,feature selection method should be introduced,and the most effective feature can be selected to build classifiers.Furthermore,in clinical applications,the classification of samples in Minority Class is more valuable than which in Majority Class and it is the main object in the classification task.However,the current technologies of feature selection and classifier building often take the maximum classification accuracy as the criterion.It is difficult to select effective feature subset to build a high performance classifier when classifying imbalanced data.Besides,the classification performance of minority samples is limited.Currently,there are a series of challenges in technology of feature selection and classifier building for imbalanced medical datasets.In order to solve the problems,this thesis carried out the research of feature selection algorithm and classification algorithm for imbalanced medical data,and proposed optimized feature selection and classification algorithms,which aimed to improve the classification performance of samples,especially for samples in Minority Class.The main work is as follows:(1)In view of the generalization of feature searching,Improved Self-Adaptive Niche Genetic Algorithm(INGA)are developed.It employed a self-adaptive niche-culling operation in the construction of the niche environment to improve the population diversity.It solved the problem which "the diversity of the population of traditional genetic algorithm is insufficient,convergence speed is slow,local optimal solutions are easy to fall into".It can improve the convergence speed of the algorithm,prevent local optimal solutions,and obtain the most optimal feature.The INGA was verified in a Mathematical model for early warning of severe infection/septic shock about 497 sepsis patients,9 features are selected.The results showed that,by applying INGA,the accuracy and AUC(Area under the ROC Curve)are higher than APACHE-II(Acute Physiology and Chronic Health Evaluation-?,SOFA(Sequential Organ Failure Assessment).(2)In view of the reasonableness of cascade training parameter setting,MRDC(Minimum Risk Decision Consumption)algorithm was proposed to optimize system risk decision consumption.The algorithm took system risk decision consumption as the goal,and built the cascade feature selection and classification algorithm.In order to minimize system risk decision consumption,Bayesian minimum feature selection probability and strong classifier posterior probabilities were used.The training parameters of the cascade training system were dynamically combined to calculate the optimal strong classifiers threshold.It solved the problem which "the cascade training parameters are unmatched".Compared with DAC(Discrete AdaBoost Cascade),training efficiency of MRDC is higher,and the feature selection is more reasonably performed to improve classification performance.The MRDC was verified in an assessment of fracture risk model for feature selection and classification of bone mass reduction about 243 patients and normal persons,19 features were selected.The results showed that,by applying MRDC,the accuracy in whole,the accuracy in Minority Class,the accuracy in Majority Class,AUC,F1,Recall were higher than DAC.(3)In view of the reasonableness of classifiers weight setting,CWsRF(Class Weights Voting Random Forest)algorithm was proposed.Different weights per class are obtained from the empirical error of different classifiers.It solved the problem which "Lack of appropriate class weight,hardly to distinguish the difference between the majority class and the minority class".Compared with the traditional RF(Random Forest)and WRF(Weight Random Forest),it can improve the recognition performance for minority class while maintaining those for majority class.The CWsRF was verified in a classification diagnosis model for 4 datasets(breast lump,breast cancer,etc.)from UCI(University of California Irvine),and the different imbalance ratios were tested.The results showed that,by applying CWsRF,the accuracy in whole,the accuracy in Minority Class,the accuracy in Majority Class,AUC,F1,Recall were higher than RF and WRF.The thesis has important theoretical significance,it can help to improve the existing clinical feature selection and diagnostic technology,promote the application of machine learning technology in the medical field.Furthermore,it will help to explore the risk factors about the disease,and provide a scientific basis for clinical diagnosis and medical research.The primary clinical application has verified the effectiveness of the thesis.
Keywords/Search Tags:Medical Diagnosis, Imbalanced Data, Feature Selection, Classifier, Ensemble Technology
PDF Full Text Request
Related items