| Nowadays,there are a lot of work related to labeling in the industry.Obviously,there are a lot of unlabeled sample data in the massive data,which indicates that it is easy to collect a lot of unlabeled data,but it’s difficult to collect labeled data.Collaborative training,as a branch of semi-supervised learning,its biggest advantage is that it can make full use of generous unlabeled samples assisted a few labeled samples to improve the performance of classifier.It has been widely applied in many fields,such as natural language processing,text classification and image retrieval.However,collaborative training has many problems,like low accuracy,single classifier and low algorithm efficiency.The paper proposes several methods for collaborative training and verifies the feasibility and effectiveness of methods through experiments.The work is as follows:(1)In view of the problem that there are noisy data in the initial samples of collaborative training,which weakens the accuracy of the initial classifier,a noise filtering method based on adaptive DBSCAN is proposed.The algorithm obtains optimal parameters through the Silhouette Coefficient to eliminate the noise points.The results show that compared with the traditional collaborative training,the classification accuracy of the collaborative training with adaptive DBSCAN denoising is improved by 3.4% on average.(2)Aiming at the problem that the classifiers are unitary,which leads to the error of tags,a difference measurement method based on weighted inconsistency is proposed.The algorithm introduces the idea of weighted distance and takes into account the difference caused by the error of tags in multi-classification data sets.Compared with the traditional difference measure,the proposed method is proved to be effective and improve the efficiency.(3)To solve another problem that it appears new noisy data after collaborative training,which affects the execution efficiency of algorithm,a similarity measure based on committee is proposed to measure confidence.The algorithm is based on Gaussian function,measuring similarity by KNN distance.Then consider the relationship between samples,similarity can be weighted by representativeness.Finally,Combining the voting method of the learner to measure confidence.In order to assess the performance of the proposed algorithm,experiments on UCI and kaggle datasets are conducted to compare the proposed algorithm with NBST and MCM.The results show that the proposed algorithm can improve the accuracy of collaborative training. |