Research On Self-training Classification Algorithm With Data Editing

Posted on:2024-03-03

Degree:Master

Type:Thesis

Country:China

Candidate:B Li

Full Text:PDF

GTID:2568307061479454

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the development of social economy,the scale of datasets is becoming larger and larger,with only a small amount of data labeled,and data annotation is time-consuming and expensive.Semi-supervised classification algorithms can use a small number of labeled samples and a large number of unlabeled samples for learning.Self-training,as a classic semi-supervised learning framework,has become a research hotspot,but the performance of the self-training algorithm mainly depends on the selection of high confidence sample points.Once noise samples appear in the iterative process,the classification performance of the algorithm will be greatly affected.In order to deal with noise or mislabeled samples in data,researchers have proposed many semi-supervised classification algorithms based on data editing.However,self-training algorithms often use Euclidean distance to calculate the distance between samples,and the time complexity of most editing algorithms is no less than O(n ~2)and are unsuitable for large-scale high-dimensional datasets.In summary,existing semi-supervised self-training algorithms have two problems:firstly,they lack the processing of noisy samples and have high time complexity when selecting high-confidence sample points.Secondly,Euclidean distance is prone to dimensional curse on high-dimensional datasets.In response to these two problems,this paper proposes two algorithms.(1)A semi-supervised self-training algorithm(EBSA)for fast ball cluster partitioning and editing is proposed.EBSA divides the dataset into stable regions and controversial regions.Based on this,a ball cluster partitioning and editing algorithm is proposed to identify and edit mislabeled sample points in stable regions,improving the quality of sample selection with high confidence.In each iteration,EBSA only needs to calculate the distance between the sample point and the center of the ball cluster,which requires less computation and is fast.Experimental results show that compared with the comparison algorithm,the EBSA algorithm not only runs faster but also improves the performance of the algorithm.(2)A block estimation nearest neighbor editing self-training algorithm(MDSF)is proposed.MDSF uses a dissimilarity metric method to calculate the distance between samples,and defines a block estimation neighborhood relationship.Then,it constructs a block estimation neighborhood graph.Based on this,it proposes a block estimation neighborhood editing algorithm to edit the data,improving the quality of selecting high-confidence samples.The algorithm performs better on high-dimensional datasets because of the use of similarity measures.A large number of experimental results show that compared with similar algorithms,MDSF performs significantly better on high-dimensional datasets than the comparison algorithm.

Keywords/Search Tags:

Semi-supervision, Self-training, Data editing, Classification algorithm

PDF Full Text Request

Related items

1	Research And Implementation Of Semi-Supervised Based Self-Training Classification Model
2	Research On A Semi-supervised Random Forest Classification Algorithm And Its Parallelization
3	Research And Implementation Of Classification Model On Big Data In Healthcare Based On Semi-supervised Learning Algorithm
4	Research On Sample Denoising In Semi-Supervised Co-Training Algorithm
5	Research On Semi-supervised Learning Classification Algorithm
6	Research On Sentiment Classification Based On Co-training In Semi-supervised Learning
7	Semi-supervised Image Classification Based On Improved Ladder Network
8	Application Of Image Classification Method Based On Semi-supervision
9	Semi Supervised Classification Of Polarimetric SAR Based On Sparse Graphs
10	Semi-supervised Co-training Classification Research And Its Application