| Data in the real world often have imbalanced distribution,and even extremely imbalanced distribution in some practical application problems.In these problems,minority samples are the focus of attention.The research on the classification of extremely imbalanced data is mainly divided into two aspects: data and algorithms.The former is further divided into sampling,feature and cost,and the latter usually adopts the ensemble methods.This paper aims to find the best combination strategy of different methods based on the greedy algorithm,so as to improve the classification performance.It is observed that deep forest produced the best effect in the classification tasks of extremely imbalanced data.In order to further enhance the performance,this paper mainly studies how to improve deep forest from the perspectives of distribution,feature and cost.The main work is summarized as follows:(1)From the perspective of data,data resampling is an effective method to deal with the classification tasks of the imbalanced data.This paper combines the mixed resampling methods with deep forest.The over-resampling method is used to add minority samples and the underresampling method is used to eliminate noise samples,so as to increase the effective minority sample information.Experimental results show that this method can effectively improve the classification effect,compared with only over-resampling methods or under-resampling methods.(2)From the perspective of feature,the imbalance of data distribution is often accompanied by the imbalance of features distribution,which leads to the imbalanced distribution of information at the feature level.The feature selection methods can be used to select the appropriate feature subset and then increase the discrimination between the minority samples and the majority samples.This paper proposes an improved deep forest algorithm for feature extraction and selection based on Top-K greedy method.In this method,new features are extracted based on the common anomaly detection algorithms and then added to the raw data.Experimental results show that this method has the best classification effect in all method combinations.(3)From the perspective of cost,cost-sensitive learning is one of the common methods to deal with the classification problem of imbalanced data.Deep forest adopts a cascade structure,and each layer is contained by multiple base classifiers(the decision-tree forests are usually selected as the base classifiers).This paper combines cost-sensitive direct learning with deep forest,and then proposes an improved deep forest algorithm based on cost-sensitive learning.Experimental results show that the deep forest with cost-sensitive factor has better performance in extremely imbalanced data compared with the other cost-sensitive methods. |