| With the advent of the era of big data,data has become more and more important to our lives.How to mine valuable information from large-scale data has become one of the focuses of attention in recent years.In the context of big data,there are massive high-dimensional data sets that contain only a small amount of labeled data in many practical fields.Feature selection is one of the important data processing methods in data mining,and the above-mentioned "less labeled data problem" has brought new challenges to traditional feature selection methods.In this paper,we propose two semi-supervised feature selection algorithms to improve the problem of high feature dimension and part of data with labels for feature selection of small labeled data sets in the context of big data.The innovative work of this paper is mainly as follows:(1)A semi-supervised feature selection algorithm based on Relief-F is proposed.Introducing the idea of semi-supervision learning into Relief-F,we redefined the nearest neighbor solution mechanism for labeled data by comprehensively considering the impact of labeled and unlabeled data on feature selection results.At the same time,a rough set based measurement method is introduced to solve the nearest neighbor problem,which not only considers the difference in the internal distance of features,but also considers the impact of the environment in which the features are located on the different feature distances.Finally,the effectiveness of the new algorithm is verified by experimental comparison.(2)By introducing a granulation mechanism for symbol data,a class label propagation algorithm for symbol data is designed,and a semi-supervised feature selection algorithm based on granulation mechanism is proposed.The new algorithm draws on the idea of data granulation and divides the datasets into multiple data grains by sequentially selecting the maximum complementary information entropy features.By analyzing the correlation characteristics such as the distance between the data grains,it realizes the labeling of unmarked data objects.Based on this,a new feature selection algorithm that can effectively handle a small number of labeled datasets is proposed.Finally,the effectiveness of the new algorithm is verified through experiments.(3)A feature selection analysis system is designed.The system mainly includes data import,feature selection,data collection and other modules,and can also compare the accuracy before and after feature selection.The system can facilitate researchers to efficiently and conveniently select features of relevant datasets. |