Font Size: a A A

The Research On Feature Selection For Cost-Sensitive Multi-Label Data

Posted on:2020-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:Q HuangFull Text:PDF
GTID:2428330578470903Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of big data,the siaze of data is increasingly,the form of data is growing and the semantic of data is rich,especially among high-dimension data.In traditional single-label classification problems,an instance is associated with only one label,which is not appropriate for multi-labels situation.Multi-label classification performance is strongly of great need for improving the description of a variety of data sources.An analysis and mining of multi-label has become a hot topic in the field of machine learning and data mining.High-dimensional data disasters in multi-label area severely affects the classification performance of multi-label classifiers,therefore,one of the most urgent problems is how to make a research on feature selection reduction for multi-label learning.At present,most of feature selection algorithms are aims to address a complete data set,whereas,numerical data is increasingly common in many application fields,and data is often incomplete in some situations such as diagnostic cost and privacy protection.In addition,it spends a lot to obtain data,to address this issue,the research on feature selection model and algorithm for cost-sensitive multi-label data could generate important theory and practical significance,the contributions of this paper are summarized as follows:First of all,for the multi-label incomplete data,the feature selection algorithm for multi-label incomplete data has been proposed.The algorithm applies the neighborhood rough set model to the feature selection of multi-label incomplete data,solves the neighborhood granularity of multi-label incomplete data according to the tolerance neighborhood threshold,and gives the metric multi-mark incomplete based on the neighborhood granularity.The characteristic importance criterion of the data is used to design a feature selection algorithm for multi-label incomplete data.This algorithm has the advantage of processing multi-labeled incomplete data effectively.Experimental results of four real data sets verify the effectiveness and feasibility of this algorithm.Secondly,form the view of cost-sensitive learning,a cost-sensitive multi-label incomplete data feature selection algorithm has been designed.The algorithm uses the rough set model to calculate the neighborhood granularity with multi-label incomplete data,and uses the two distribution functions,that is uniform distribution and normal distribution,to calculate the feature cost of each feature.Based on the kernel feature,a new method based is redesigned to calculate the feature importance of the test cost.This algorithm solves the problem of feature cost of incomplete data and has good classification performance through experiments.Then,for analyzing the uncertainty of multi-label data,the information entropy has been used to analyze the correlation between features and labels,and redefines a featureimportance criterion based on test cost,and gives a criterion based on the feature importance of the normal distribution and the standard deviation of the feature cost.A reasonable threshold selection method is set to eliminate redundant and irrelevant features,and obtains a lowest price feature subset.The effectiveness and feasibility of this algorithm are further verified by the experimental results of Mulan data set in multi-label learning.Finally,to prove the feasibility and effectiveness of the three proposed methods,we download some datasets from Mulan to do extensive experiments.
Keywords/Search Tags:Multi-label, Rough set, Feature selection, Cost-sensitive
PDF Full Text Request
Related items