Font Size: a A A

Research Of Truth Discovery Algorithms Based On Optimization Methods

Posted on:2021-01-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:C YeFull Text:PDF
GTID:1368330614950826Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,all areas have ushered in the era of big data.One big challenge in analyzing the overwhelming generated data is the veracity of the data.Data,even describing the same object or event,can come from a variety of sources such as crowd workers and social media users.However,noisy pieces of data or information are unavoidable.Facing the daunting scale of data,it is unrealistic to expect humans to “label”or tell which data source is more reliable.Hence it is crucial to identify correct and trustworthy information from multiple noisy information sources,referring to the task of truth discovery.At present,the truth discovery research for multi-source data mainly faces two challenges.On the structural level,it is essential to consider the different characteristics of data composition and application scenarios and define the truth discovery problem on different occasions.On the algorithm level,the truth discovery task needs to consider different levels of information conflicts and design efficient algorithms to mine more valuable information using multiple clues.Existing truth discovery methods have defects on both the structural level and the algorithm level,making the truth discovery problem far from totally solved.In this dissertation,the theories,techniques,and methods in data cleaning,data mining,and natural language processing are synthetically used to study the truth discovery problem on multi-source data.This dissertation mainly focuses on three data models: the first is multi-source isomorphic data,which has a clear and significant entity-attributesource structure;the second is multi-source heterogeneous data,where the entities and attributes from different sources may have various representations;the third is text data,which does not intuitively reflect the entity-attribute-source structure and contains a lot of irrelevant words.On the basis of three data models,this dissertation studies the truth discovery problem on multi-source data from four important properties: relevance,inconsistency,sparseness,and heterogeneity.The main research contents are as follows.Firstly,for multi-source isomorphic data,a novel automatic truth discovery approach Auto Repair is proposed to enrich the evidence by taking the advantages of sourcereliability-estimation-based truth discovery methods and functional-dependency-based data repairing methods.Functional dependency,one of the most common types of con-straints,is used to detect the violations,and the source reliability is used as evidence to discover and repair the errors among these violations.Then,the repaired results are used to estimate the source reliability in turn.As the source reliability is unknown in advance,this process is modeled as an iterative process to ensure better performance.Extensive experiments are conducted on both simulated and real-world datasets.The results clearly demonstrate the advantages of Auto Repair,which outperform both recent truth discovery and rule-based data repairing methods.Secondly,to integrate the information of attribute relations and external knowledge on multi-source isomorphic data,a novel truth discovery technique powered by integrity constraints and source reliability is proposed.The key component of the solution is to incorporate denial constraints,an expressive type of integrity constraint,into the process of truth discovery.It is formulated as an optimization problem and an iterative algorithm CTD is developed to solve it.Benefiting from this algorithm,the truth discovery result is not only supported by reliable sources but also satisfies the denial constraints.Additionally,two optimal strategies are also proposed to ensure that it is scalable under massive constraints.Experimental results on real-world datasets demonstrate the high accuracy and scalability of CTD.Thirdly,to tackle the information shortage on the entity level and attribute level of multi-source heterogeneous data,pattern discovery for truth discovery is introduced and formulated as an optimization problem.The entities which share the same pattern are treated as a group and the problem is modeled by identifying the latent groups and the representative of each group using an optimization framework.The latent groups,the group-level representatives,the source reliability and the property weight are simultaneously identified by defining them as four sets of unknown variables.To solve such a problem,an algorithm called Pattern Finder which jointly and iteratively learns the variables is proposed.Experimental results on simulated and real-world datasets demonstrate the advantages of the proposed method,which outperform the state-of-the-art baselines in terms of both effectiveness and efficiency.Finally,considering the patterns are able to extract multiple facts from different sentences,both pattern reliability and fact trustworthiness are considered in addressing the truth discovery problem on text data.Then,the correct attribute values can be obtained from trustworthy facts.To learn the complex relationship between pattern reliability and fact trustworthiness,this dissertation proposes a novel deep learning model using a hybridof CNN and LSTM architecture.For fact embedding,the model adopts CNN to extract a fix-sized representation of each component of the fact,i.e.,entity,attribute,and value.For pattern embedding,the pattern is represented as a semantic composition of its extracted fact representations.To de-emphasis the noisy facts,the framework considers both the fact trustworthiness and frequency during the process of pattern embedding,where the features of the fact trustworthiness information are extracted by a long shortterm memory(LSTM)model.To learn the pattern-fact relational dependency,the model is trained with both pattern and fact labels.Extensive experiments on three real-world datasets demonstrate that the proposed model significantly improves the quality of the patterns and the extracted facts in the pattern-based truth discovery task.
Keywords/Search Tags:Truth discovery, source reliability, integrity constraints, optimization framework, deep learning
PDF Full Text Request
Related items