| Classification is a research hotspot in the fields of data mining and machine learning.In traditional supervised single-label classification tasks,the training data usually needs to satisfy the following two assumptions:each example is only annotated with one class label,and the annotation is accurate.However,in real-world applications,these two assumptions are often difficult to hold,giving rise to two weakly supervised learning frameworks:multi-label learning and partial label learning.Multi-label learning is caused by the ambiguity of the learning object.In this framework,each training example can be associated with multiple semantic labels at the same time,and there are generally correlations between labels,making the data analysis process more complicated.Due to practical application needs,the multilabel learning problem has attracted widespread attention from researchers and has become a hotspot issue in the field of machine learning.In multi-label learning,all examples are required to be accurately annotated,but obtaining accurate annotations for data is often timeconsuming and expensive,which leads to the problem of partial label learning.In the partial label problem,training examples are associated with multiple semantic labels,among which only one is the true label,and the learning algorithm cannot access it directly.Due to the fuzziness of the supervised information,performing classification tasks on partial label data is more difficult.Additionally,with the rapid development of information technology,the size of training datasets used for multi-label and partial label learning is also growing explosively,generally characterized by large sample sizes and high feature dimensions.Large-scale training data brings new challenges to the learning process:first,learning from large-scale data requires higher storage and computational costs.Secondly,redundant and noisy data often exist in large-scale training data,which contributes very little to the classification process and may even interfere with the learning process.Data reduction techniques,including instance reduction in the horizontal direction and feature reduction in the vertical direction,have become key technologies for addressing these challenges.The focus of this paper is on multi-label and partial-label learning,as well as data reduction for such problems.The main research includes the following aspects:1.Aiming at the problem that current multi-label instance reduction methods generally ignore label correlation,a multi-label prototype selection algorithm combining label correlation is proposed.This algorithm first measures the correlation between pairwise labels based on label co-occurrence and improves the existing”One-Versus-Rest”data partition strategy based on co-occurrence,forming a new data partition that combines pairwise label correlation.Then,the generalized condensed nearest neighbor algorithm is used to realize the prototype selection process on the new data partition and generate the instance reduction set.Finally,the instance reduction set is used to replace the original training data for subsequent learning and classification processes.Comparative experiments with multiple advanced algorithms on multiple datasets confirm the effectiveness and superiority of the proposed method.2.Research on dual reduction of multi-label data instances and features.Existing multi-label data reduction methods either only perform instance reduction in the horizontal direction or feature reduction in the vertical direction,with few methods considering both simultaneously.To address this problem,this paper proposes a dual reduction method for multi-label instances and features,MLVQ-JMR.In the instance reduction stage,the method combines prototype generation and prototype selection.First,the learning vector quantization technique is used to perform prototype generation on the multi-label data after OVR partitioning.Then,the generated prototypes are used as a guide to select the nearest neighbor prototypes and form the instance reduction set.In the feature reduction stage,the Jaccard similarity coefficient is introduced to extend the ReliefF algorithm,which is only applicable for singlelabel feature selection,to handle multi-label data.Experimental results show that executing the subsequent learning and classification process on the data set after dual reduction can achieve better results than the original data set.3.Research on multi-label classification algorithms.Existing multi-label learning algorithms based on OVA or OVR strategies treat the multi-label learning problem as a singlelabel learning problem,ignoring the correlation between labels,which may lead to poor classification accuracy.In addition,model induction based on traditional machine learning methods requires a large amount of time and effort.To improve classification accuracy and learning efficiency,we propose a multi-label classification neural network structure called MLCI,which is based on a variance loss function and a label correlation initialization layer.MLCI consists of a feature extraction module and a classifier module.In the feature extraction module,we use a variance loss function to maintain intra-class compactness during feature mapping.In the classifier module,we use an initialized hidden layer and a pairwise label correlation loss function to comprehensively consider global and local label correlations and improve classification accuracy.Comparative experiments validate that the proposed method has higher classification accuracy.4.Research on dimensionality reduction for partial-label learning.Due to the ambiguity of supervised information in partial-label learning,there are few studies on dimensionality reduction for this type of data,and the existing methods are still susceptible to interference from false positive labels.To alleviate the negative impact of false positive labels,a dimensionality reduction algorithm called SDIMR is proposed,which utilizes semantic difference information and manifold regularization.In brief,the algorithm consists of three stages:firstly,a graph model is constructed to describe the topological structure of the feature space,and semantic difference information among labels is introduced to reduce the impact of false positive labels;secondly,manifold regularization,including semantic difference information,is utilized to preserve local manifold of data during dimensionality reduction;finally,the knearest neighbor method is employed for disambiguation,and dimensionality reduction and disambiguation are iteratively performed to update the confidence matrix.Extensive experiments on real and artificial datasets have validated the effectiveness of the proposed method. |