Font Size: a A A

Research On Self-training Classification Algorithm With Data Editing

Posted on:2024-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:B LiFull Text:PDF
GTID:2568307061479454Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of social economy,the scale of datasets is becoming larger and larger,with only a small amount of data labeled,and data annotation is time-consuming and expensive.Semi-supervised classification algorithms can use a small number of labeled samples and a large number of unlabeled samples for learning.Self-training,as a classic semi-supervised learning framework,has become a research hotspot,but the performance of the self-training algorithm mainly depends on the selection of high confidence sample points.Once noise samples appear in the iterative process,the classification performance of the algorithm will be greatly affected.In order to deal with noise or mislabeled samples in data,researchers have proposed many semi-supervised classification algorithms based on data editing.However,self-training algorithms often use Euclidean distance to calculate the distance between samples,and the time complexity of most editing algorithms is no less than O(n ~2)and are unsuitable for large-scale high-dimensional datasets.In summary,existing semi-supervised self-training algorithms have two problems:firstly,they lack the processing of noisy samples and have high time complexity when selecting high-confidence sample points.Secondly,Euclidean distance is prone to dimensional curse on high-dimensional datasets.In response to these two problems,this paper proposes two algorithms.(1)A semi-supervised self-training algorithm(EBSA)for fast ball cluster partitioning and editing is proposed.EBSA divides the dataset into stable regions and controversial regions.Based on this,a ball cluster partitioning and editing algorithm is proposed to identify and edit mislabeled sample points in stable regions,improving the quality of sample selection with high confidence.In each iteration,EBSA only needs to calculate the distance between the sample point and the center of the ball cluster,which requires less computation and is fast.Experimental results show that compared with the comparison algorithm,the EBSA algorithm not only runs faster but also improves the performance of the algorithm.(2)A block estimation nearest neighbor editing self-training algorithm(MDSF)is proposed.MDSF uses a dissimilarity metric method to calculate the distance between samples,and defines a block estimation neighborhood relationship.Then,it constructs a block estimation neighborhood graph.Based on this,it proposes a block estimation neighborhood editing algorithm to edit the data,improving the quality of selecting high-confidence samples.The algorithm performs better on high-dimensional datasets because of the use of similarity measures.A large number of experimental results show that compared with similar algorithms,MDSF performs significantly better on high-dimensional datasets than the comparison algorithm.
Keywords/Search Tags:Semi-supervision, Self-training, Data editing, Classification algorithm
PDF Full Text Request
Related items