Font Size: a A A

The Research Of Approximate Computing For Big Data Based On Sampling And Coreset

Posted on:2023-04-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:H J GuoFull Text:PDF
GTID:1528307376482414Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and the increasingly mature means for people to obtain data,the amount of data faced by all walks of life has gradually increased to TB,PB and even EB level.A large amount of data is of great value to people’s production and life,but how to deal with big data and explore its value is a major challenge.In the face of big data computing problems,the time cost of exact computation is often relatively high.In big data,even the time cost of scanning the whole data only once is intolerable.The definitions and methods of easily solvable problems in traditional complexity theory are difficult to apply to big data processing.In many big data application scenarios,the optimal solution is difficult to compute or unnecessary,so the approximate computation of big data is an important means to process big data to efficiently return the approximate results that meet the requirements.The big data approximate computation based on small data is one of the key methods.This dissertation studies the problem of approximate computation for big data by studying the computation and application of two kinds of small data(sampling and coreset).The approximate algorithms of studied problems are designed,and the time complexities and the approximate ratios or error bound of proposed algorithms are analyzed.The research issues and contributions of this dissertation are summarized as follows.Firstly,this dissertation studies the problem of data source selection by samplingbased information coverage estimation.In the era of big data,with the rapid growth of useful information,data sets can be collected from a variety of sources.Due to the large number of data sources,high information redundancy among data sources and different data quality of them,it is necessary to select data sources to integrate big data.The current research work does not comprehensively consider the information coverage of data sources,overlap between data sources,data quality and query costs when selecting data sources.There is also no work on pure sublinear approximation algorithm for data source selection.By comprehensively considering the above measures,this dissertation formalizes the data source selection problem into a gain maximization under the limit of the number of data sources problem and a gain maximization under the limit of the cost problem.Due to the hardness of the proposed problem,this dissertation designs two approximate algorithms to solve the formalized problem respectively,and analyzes the approximate ratios and complexities of the proposed algorithms.In order to further improve the efficiency,this paper proposes a sampling-based randomized approximation algorithm to estimate the information coverage of data sources,which makes the time complexities of the proposed algorithms are sublinear in data size.Experimental results on both real world and synthetic data sets show that our methods can select sources providing a largest gain,and the randomized approximation algorithm can greatly improve the efficiency of the algorithms by sacrificing very little gain.Secondly,this dissertation studies the problem of sampling and partitioning based approximate range top-k queries.Approximate top-k query returns a list of k tuples that have approximate largest scores with respect to the user given query.However,existing algorithms restrict the class of ranking functions to linear monotone functions,and they do not have a consideration on the data subset do approximate approximate top-k queries.In this dissertation,a novel algorithm PSATop-k,which combines partitioning and sampling techniques,is proposed to answer approximate range top-k query efficiently.PSATop-k is suitable for queries with selection conditions and arbitrary ranking functions.PSATopk first determines the sampling size that meets the accuracy requirement,then draws sufficient random tuples to return result by accessing a subset of the partitioned data.Experimental results on both real world and synthetic data sets show that PSATop-k can efficiently answer approximate range top-k queries under the given error bound.Thirdly,this dissertation studies the problem of minimum ε-kernel computation.Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate error.It is widely used in geometric optimization,clustering,and approximate query processing,etc.The minimum ε-kernel(MK)problem has not been studied yet.This dissertation first formalizes the MK problem and analyzes its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,a pseudo-sublinear time approximate algorithm with O(d log 1/ε)-approximate ratio is developed to solve it.Then,for the open problem presented by Wang et al.that whether the minimum ε-coreset(MC)problem and the MK problem can be reduced to each other,we prove that the MC problem and the MK problem can be Turing-reduced to each other.Finally,we discuss the update of MK under insertion and deletion operations,respectively.Experimental results on both real world and synthetic data sets show that the proposed algorithm can compute MK efficiently.Finally,this dissertation studies the problem of coreset-based diversity and freshnessaware regret minimization set queries.Regret minimization set query is a problem of finding a representative small data from a large data to support multi-criteria decision making.Aiming at the shortcoming that the current definition of regret minimum set can only guarantee one result under one utility function,this dissertation defines the concepts of strong regret rate and strong regret minimization set,which guarantee k results under one utility function.Based on the new definition,we study the problem of minimum regret set MS and the problem of maximum diversity and freshness MSDF.This paper proves that both MS and MSDF problems are NP-hard.For the MS problem,this dissertation transforms the MS problem into a set mutilcover problem,and proposes an algorithm with approximate ratio O(d log 1/δ)to solve it.For the MSDF problem,this dissertation transforms the MSDF problem into a max-sum diversification problem.Approximation algorithms with approximation ratios of 2/cos δ,2 and 2 are designed for diversity measurement under Euclidean distance,Manhattan distance and Chebyshev distance,respectively.Experimental results on both real world and synthetic data sets show that the algorithm for MS problem can get the smallest strong regret minimization set,and the algorithm proposed for MSDF problem can get the most diversity and freshness set.In summary,this dissertation firstly studies two sampling-based approximate computation problems.They are data source selection by sampling-based information coverage estimation,and sampling and partitioning-based approximate range top-k queries.The problem of data source selection is the work of data acquisition and integration,which provides data preparation for big data approximate computation.Then,this dissertation studies two coreset-based approximate problems,they are the minimum ε-kernel computation problem,and coreset-based diversity and freshness-aware regret minimization set queries.
Keywords/Search Tags:big data approximate computing, sampling, coreset, data source selection, approximate range top-k queries, regret minimization set queries
PDF Full Text Request
Related items