| Clients’ water-consumption type is an important standard for water supply companies to charge for water usage.As a key factor for the water supply company to price clients,water-usage label plays an extremely important role in the user’s declaration process,the operation of the water supply company,and company’s profit protection.However,with the reform of the water supply industry and the expansion of the scale of users,problems such as incorrect labeling of water-usage labels and the inability to automatically update the labels according to the kind of the user’s waterusage have become more and more serious.Water supply companies have a large number of customers.And some of them changed their address frequently,which make it difficult to update water-usage labels under such large amounts of data.This caused huge losses to the water supply company,and it also caused clients to spend a lot of time in updating the label of different water-usage type.Stressed on the problems such as the monotonicity of water-usage data and the high cost of manual verification of water-usage labels,this thesis will be based on clustering features and active learning,and starting from the real data of the water supply company.This thesis will utilize clients’ address information,historical waterusage data,and latitude,longitude positioning information,etc.And finally we will use fuzzy labels for label cleaning so as to provide decision support for users of water supply companies to update their water-usage types and modify corresponding labels.Firstly,this thesis collected historical data of clients’ water consumption in Xinzhou District,Shangrao City,Jiangxi Province,and classified the real data according to the "Water Classification Standard" and the three-level water-usage classification of local water supply companies.Secondly,the outlier processing of the original water-consumption data is carried out with the help of the key thresholds from box-plot method.The original data is transposed in database,and the statistical characteristic model of the water users is generated through the method of feature engineering based on the historical water-consumption data.Then,according to the clustering algorithm,the clients’ clustering features are generated based on the clients’ latitude and longitude coordinates,and the predicted labels of the clustering algorithm are mapped into numerical features using one-hot coding,thereby constructing clusters of users.Finally,based on the verified data set,the performance of the traditional active learning method and the active learning method CFAL based on clustering features under different classifiers and sample selection strategies was compared and verified to realize the label cleaning of users.Experimental results show that CFAL,an active learning method based on clustering features,can significantly reduce the cost of manual sampling and verification of water-type labels.The performance of finding wrong signature samples is improved by 8.2%,and the performance of Micro F1 and Macro F1 indicators of the classification model is improved up to 8.7% and 1.8%respectively. |