With the continuous development of the Internet and information technology,data collection has become more and more convenient,resulting in a sharp rise in data volume.How to mine more valuable information from massive data has become an important problem in pattern recognition,data mining,machine learning and other fields.Feature selection can effectively reduce feature dimensions,alleviate the difficulty of problem analysis and alleviate the problems caused by dimension disasters.When there is imbalance in the data set,the result of feature selection tends to favor the majority samples while ignoring the information contained in the minority samples.How to design an effective feature selection method for imbalanced data becomes more and more important.In the pixel classification task of sandstone CT image in oil field,it is necessary to distinguish pore pixels from non-pore pixels.The number of pore pixels is much smaller than that of non-pore pixels,which is a typical imbalanced data.Traditional machine learning methods cannot effectively process this kind of data.A targeted feature selection method can be designed to screen out feature subsets that can effectively distinguish pore and non-pore pixels,reduce the number of pore pixel features,and improve the generalization accuracy of the model.The research work of this paper mainly focuses on the feature selection algorithm of imbalanced data,and applies the proposed feature selection algorithm to the pixel classification task of sandstone CT image to screen feature subsets with high classification performance.The main research contents are as follows:(1)A feature selection algorithm called SWAFS based on sample coefficient and AUC is proposed.The traditional Relief F algorithm could not effectively deal with imbalanced data,and Relief F is easily disturbed by noise data,and the effective feature ranking of Relief F is not prominent.By integrating the class distribution information into the process of feature weight calculation,the updating process of feature weight puts more attention to the information of minority samples.The effect of noise data on the algorithm is reduced by noise processing mechanism,and the single feature is evaluated with AUC value to improve the effective feature ranking.(2)A new feature evaluation standard for imbalanced data is proposed,and combined with m RMR feature selection framework.Traditional mutual information cannot effectively evaluate the relationship between the variables of imbalanced data sets,this paper will set the class distribution of original data into the calculation process of mutual information to improve the attention of the minority samples,and the evaluation standard is combined with the m RMR feature selection framework to filter feature subsets with high class correlation,and low redundancy.(3)In view of the phenomenon of imbalanced class distribution in the pixel classification task of sandstone CT image,the two feature selection algorithms proposed in this paper for imbalanced data are applied to reduce the number of pore pixel features and screen feature subsets that can effectively distinguish pore and non-pore pixels.The experimental results show that the SWAFS and im RMR algorithm can effectively process the imbalanced data set.Meanwhile,these two feature algorithms perform well in identifying pore and non-pore pixels. |