| The ideal machine learning paradigm learns from "perfect" data,which is usually assumed to be a set of vectors in an appropriate amount.However,real-world data appear in undesirable forms,such as massive,streaming,online,distributed,and contaminated data.These problems are compounded by each other,which is worse.In this paper,we study the problem of massive contaminated data.First we deal with the massiveness and a natural idea is to compress the data.The key concept we adopt here is the coreset,a weighted subset of the original dataset that approximates the objective value for arbitrary model parameter.However,it has been shown that the conventional coreset guaranteeing a global approximation requires a quite large sampling complexity—the coreset for a simple linear model like logistic regression takes at least Ω(n/polylog(n))samples(Mai et al.,2021).Recently researchers have proposed the "local" coreset,which approximates the objective value merely for the parameters inside a norm ball.It has been shown that the sampling complexity of a local coreset can be O(polylog(n)).So we also use the concept of local coreset in this paper.We then consider the issue of contamination.Specifically,we view this problem from a distributional perspective:the Wasserstein distance between the contaminated distribution and the true distribution is a bounded positive number.This scenario corresponds to the phenomenon of distribution shift,which has attracted much attention recently.Training under distribution shift can be formulated as the Wasserstein distributional robust optimization(WDRO)problem.Existing methods to solve WDRO problems can be costly for large-scale data,which motivates to study the coreset for WDRO problems.Due to the intractability of WDRO,the conventional coreset construction of WDRO is challenging.To remedy this issue,we exploit the strong duality of WDRO and define the "dual coreset",which is much easier to study than the conventional coreset.To construct the dual coreset,we propose a grid sampling approach that is particularly suitable for the dual formulation of WDRO.To provide a theoretical justification for our method,we prove that the dual coreset is also a qualified conventional coreset under some mild assumptions.We implement the coreset algorithm we propose and illustrate its effectiveness. |