| The 21th century is the era of big data and the potential value has attracted people's attention.Revolutionaries and innovations in domains ranging from industry,academia,education to business are driven by big data and machine learning,which is regarded as one of indispensable techniques in data analysis,has been playing a crucial role in mining underlying patterns,predictable tendencies,significant correlations and interesting disciplines.Though widely employed in the domain,the strict hypothesis of machine learning on perfect annotations has posed difficulty and limitation when faced with quickly increasing volume of data.Considering unimaginable consumptions in expert resources,money and time when performing perfect annotation in large-scale dataset,weakly annotation technique such as crowdsourcing and semi-supervised learning technique are leading the tendency in dataset labelling.Thus,related annotation techniques impose more requirements on model's robustness and capability of adaptability to the ubiquitous existence of label noise.Recently,research with respect to weakly supervised learning has aroused interest from experts and been enjoying popularity in this domain.Two crucial topics are involved in:One is to capture the complexity of noise and model the mechanism,especially the process of noisy label generation.Another is about the robust algorithm design to make the model accommodate to label noised environment with less performance deterioration.In this paper,we did some research in weakly supervised learning grounded in the CCN(class conditional noise).That is,labels are generally weakly annotated in a probabilistic flipping way between classes.Combined with recently proposed Importance Reweighting ideology,some investigations on both experiments and algorithms were carried out as follows:(1)Grounded in CCN,noise influence towards typical classifiers'performance was analyzed from theoretical and experimental perspectives.Some interesting conclusions from observations on UCIR can be drawn as:NB(Na?ve Bayes),SVM(Support Vector Machine)and Bagging tend to be more robust to CCN while AdaBoost,KNN etc.fluctuate fiercely in performance due to the noise.Surprisingly,in partial cases with lower noise levels,the performance of some classifiers is probably to be enhanced instead of declines.(2)A comprehensive overview about traditional risk theory in statistical machine learning was summarized and we disentangled inherent disciplines between expected risk function minimization principle,empirical risk function minimization principle and structural risk function minimization principle.The potential bias of these risk function estimations under CCN was further explained.The Importance Reweighting ideology in bias revision was investigated in depth and the optimal theorem provided theoretical foundations in the following research.Finally,a general framework with Importance Reweighting ideology embedded for addressing weakly supervised learning was induced.(3)Under circumstances of CCN,noise rate matrix reflects the flipping disciplines between classes and indicates valuable information.To the best of our knowledge,it was the first time to provide a comprehensive summary on noise rate matrix estimation methodologies.To address the estimation problem especially in multiclass cases,a novel method called Back-End algorithm was developed,in which the discriminative information between large scale noisy dataset and a small proportion of clean dataset was captured for learning the noise.Meanwhile,a State-of-Art algorithm in binary noise matrix estimation called RP(RankPruning)was technically reviewed and another novel algorithm called MRP(Multi-class RankPruning)was proposed for estimating diagonal elements in multiclass noise rate matrix.Both of Back-End algorithm and RP algorithms showed ideal performance in given metrics.(4)Aiming at designing novel algorithm robust to CCN,we embedded the Importance Reweighting ideology to SVM to derive IRSVM(Importance Reweighting SVM)and theoretically demonstrated the consistency between cost sensitive mechanism and Importance Reweighting technique.The dual Importance Reweighting embedded SVM was formulated in a mathematical way for IRSVM's solution.To generalize such model to multiclass cases,OVR(One VS Rest)strategy accompanied with MRP was performed on IRSVM to derive extensive model KIRSVM(K-class IRSVM).The results on both synthetic dataset and UCIR showed the effectiveness of IRSVM and KIRSVM under CCN.Further,a specified mission on identifying the formation of naval fleet was carried out.Weakly annotated dataset were generated from combat simulation platform in some scene and results of comparison confirmed the advantage of KIRSVM in recalling examples from the minority class.This is a significant result in specific military task.To conclude with,all of research was regarding weakly supervised learning problem under CCN as well as discussions on Importance Reweighting ideology.Some innovative algorithms were developed for both of noise rate matrix estimation and robust model formulation during the period and the identification of formation in naval fleet was solved to some extent.Besides,some personal perspectives on this domain were incorporated throughout the manuscript.When summarizing the former work in Conclusion section,we highlighted challenges and open questions on this topic.That is,the exploration on more complicated probabilistic dependent label noise and learning from weakly annotated time series would be of great significance. |